Do CBLite replications give up reconnection attempts when offline?

pkramer · June 13, 2018, 10:47pm

We have two long-lived replication instances that we start on a timer to do a pull/push with the sync gateway. When we go offline, it appears that the replications will work for about 4 more interval ticks and then they stop raising events forever.

This seems to be consistent with the ExponentialBackoffStratgey log statements I have found:

2018-06-13 16:25:05,981 DEBUG [11664] [SYNC (ExceptionResolver)] Rule #3: Exception is WebException and status is ConnectFailure, Timeout, ConnectionClosed, RequestCanceled, or NameResolutionFailure, ruling transient...
System.Net.WebException: The remote name could not be resolved: '<redacted>'
   at System.Net.HttpWebRequest.EndGetRequestStream(IAsyncResult asyncResult, TransportContext& context)
   at System.Net.Http.HttpClientHandler.GetRequestStreamCallback(IAsyncResult ar)
2018-06-13 16:25:05,988 DEBUG [TabletPCR.exe] [11664] [SYNC (TransientErrorRetryHandler)] Out of retries for error, throwing
System.AggregateException: One or more errors occurred. ---> System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.Net.WebException: The remote name could not be resolved: '<redacted>'
   at System.Net.HttpWebRequest.EndGetRequestStream(IAsyncResult asyncResult, TransportContext& context)
   at System.Net.Http.HttpClientHandler.GetRequestStreamCallback(IAsyncResult ar)
   --- End of inner exception stack trace ---
   --- End of inner exception stack trace ---
---> (Inner Exception #0) System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.Net.WebException: The remote name could not be resolved: '<redacted>'
   at System.Net.HttpWebRequest.EndGetRequestStream(IAsyncResult asyncResult, TransportContext& context)
   at System.Net.Http.HttpClientHandler.GetRequestStreamCallback(IAsyncResult ar)
   --- End of inner exception stack trace ---<---

My question is, is the expected behavior that replications will stop attempting to reconnect indefinitely once they go offline, and after the retry strategy has been reached?

If so, how can we identify this scenario so that we know that we need to re-instantiate the replications and start from scratch?

borrrden · June 13, 2018, 10:56pm

Does this mean continuous? The behavior here is different between continuous and non-continuous. The former should keep trying until it either hits an error that is not recoverable or until told to stop. The latter will only retry a handful of times before giving up.

In either case the way you can tell if you need to start again is by monitoring the changed callback. If the replicator stops with an error you can examine it to determine what to do.

pkramer · June 13, 2018, 11:12pm

By “long-lived” I mean we create the instance of the push and pull replications one time and re-use those instances for the lifetime of the application, as opposed to creating a fresh replication instance at each polling interval of our timer.

These replications are non-continuous, so it sounds like the behavior is consistent with what you described above. However, on the very last retry attempt, we do not get a Changed event:

2018-06-13 16:25:03,839 DEBUG [TabletPCR.exe] [2952 ] Replication timer elapsed. Documents will be synced if changes are detected.
2018-06-13 16:25:03,866 DEBUG [TabletPCR.exe] [2952 ] Setting session cookie on replicator Puller f3b93767-009b-4882-8f98-4f4c305a8c0a.
2018-06-13 16:25:03,882 DEBUG [TabletPCR.exe] [2952 ] Starting one-shot pull replication on pcr_tenant_bff750cb-338c-e711-80c4-0050569c053d for the following channels:

	<irrelevant channels here>

2018-06-13 16:25:03,887 DEBUG [TabletPCR.exe] [2952 ] Waiting to acquire lock for replication Puller f3b93767-009b-4882-8f98-4f4c305a8c0a.
2018-06-13 16:25:03,891 DEBUG [TabletPCR.exe] [2952 ] Lock acquired for replication Puller f3b93767-009b-4882-8f98-4f4c305a8c0a
2018-06-13 16:25:03,895 DEBUG [TabletPCR.exe] [2952 ] Starting Couchbase Lite replication: Puller f3b93767-009b-4882-8f98-4f4c305a8c0a with filters sync_gateway/bychannel
2018-06-13 16:25:03,899 DEBUG [TabletPCR.exe] [2952 ] [SYNC (Replication)] Start() called, firing Start...
2018-06-13 16:25:03,907 DEBUG [TabletPCR.exe] [2952 ] [SYNC (Replication)] Preparing to fire Start
2018-06-13 16:25:03,915 DEBUG [TabletPCR.exe] [2952 ] Waiting for replication Puller f3b93767-009b-4882-8f98-4f4c305a8c0a to finish...
2018-06-13 16:25:06,005 WARN  [TabletPCR.exe] [11664] [SYNC (Replication)] Unable to save remote checkpoint for 53128149-de7e-4bc9-b6ae-7ad470377362
2018-06-13 16:25:06,013 INFO  [TabletPCR.exe] [11664] [SYNC (Replication)] Pusher 53128149-de7e-4bc9-b6ae-7ad470377362 stopped.  Elapsed time 12.731 sec
2018-06-13 16:25:06,237 WARN  [TabletPCR.exe] [3940 ] [SYNC (Replication)] Unable to save remote checkpoint for 1dd83698-6fde-4a5c-bfb1-052e24052fe8
2018-06-13 16:25:06,245 INFO  [TabletPCR.exe] [3940 ] [SYNC (Replication)] Pusher 1dd83698-6fde-4a5c-bfb1-052e24052fe8 stopped.  Elapsed time 12.578 sec

The last entry we write is “Waiting for replication Puller f3b93767-009b-4882-8f98-4f4c305a8c0a to finish…”, which blocks indefinitely as we wait for a ManualResetEvent to be set when we get the Changed event.

I can send the full logs that show the last 2 polling intervals (currently about 350 lines), which shows that the events are being raised and handled. It’s just the last attempt where no event is raised.

borrrden · June 13, 2018, 11:26pm

Can you describe what the sequence of callbacks is that you receive? For one-shot I wouldn’t expect to get changed callbacks for any retry logic since it is transparent in the HTTP handler (as in, it’s not a part of the replicator logic, but a part of the HTTP messaging logic). Eventually when giving up the expected behavior is that the replicator will stop and callback with a stopped status. If that is not happening then it is probably a bug, but I’d need more information:

Is this happening from going offline in the middle of a running replicator, or before starting a new one? If the latter there might not be any events since the reachability component will report that the endpoint is not reachable (though I’m not sure I remember that correctly).

pkramer · June 13, 2018, 11:49pm

When online:

Application timer ticks
Start replication
Replication raises changed event with status of “Active”
Replication downloads stuff (if there are changes)
Replications raises changed event with status of “Stopped”
Timer elapsed handler complete
Sequence repeats at next interval

Disconnect WiFi (before timer ticks, not during a replication)

Application timer ticks
Start replication
Replication raises changed event with status of “Active”
Replication raises changed event with status of “Active” (not sure why twice)
Replication raises changed event with status of “Offline”
Timer ticks 2 more times, same (expected) behavior of “Offline” being raised
On the 4th tick, the replication starts and no Changed event is ever received, either for “Active”, “Stopped”, or “Offline”

Correct, we do not get any events while CBLite is retrying. We just don’t get the final “stopped” or “offline” event once it gives up.

borrrden · June 14, 2018, 12:01am

I’m unclear about this part. If you are still offline the replication cannot start. I don’t think receiving offline events in a non-continuous replicator is correct (is it @Jens ?). A replicator raises events whenever its status changes OR it gets a new error to report (that is probably the reason for the double active).

I think I see why it would report offline, and I am wondering if that is correct or not. There seems to be a lack of distinction between inability to reach an endpoint because the network is shut off on the machine and inability to reach for other reasons. That’s probably a bug. Can you provide a log of the run so I can try to follow the flow?

pkramer · June 14, 2018, 3:38am

CBLite Offline Replications.zip (7.1 KB)

See attached zip.

You can search for statements like:

Pusher 1dd83698-6fde-4a5c-bfb1-052e24052fe8 status: Active

and

Replication changed event received for Pusher 53128149-de7e-4bc9-b6ae-7ad470377362. Current Status: [Active], Previous Status: [Stopped], Is Running: [True], Changes Count: [1], Completed Count: [1], Last Error: []

… to see the events being raised when the replication is offline, even after calling Start().

The CBLite logs are interspersed.

borrrden · June 14, 2018, 5:08am

I think there is a mistake in the codebase in the handling of starting a replication in an environment which already has no connectivity. It seems to enter the replicator retry loop regardless of whether or not it is continuous or non-continuous. I notice a change of exception from one that would be considered permanent (no such host) to one that is considered transient (cannot resolve dns name) so that likely is playing a part. I also noticed some statements involving “clearing error.” (Clearing LastError from Replication Puller f3b93767-009b-4882-8f98-4f4c305a8c0a). What is this part actually doing?

One thing you could try doing immediately as a potential workaround is examining the exception that you are clearing and ruling that the replication is “finished” based on that. I think a likely fix on the Couchbase side would be to do a more extensive analysis of what is going wrong at the beginning of a replication.

Also as a side note, does the situation clear up if you put the device back online when it is in this state?

pkramer · June 14, 2018, 3:38pm

This sets the replicator’s LastError property to null after we have inspected it and handled the error (if we can). We had to do this because we would continue to get “unauthorized” errors with each status change event, even after we had re-authenticated and updated the replication’s cookie.

No, once it has reached this state, it never goes back online to active again.

I notice that the replication options has a property for RetryStrategy. Is it worth implementing our own strategy to include a much larger number of attempts?

borrrden · June 15, 2018, 1:50am

I see, clearing the error will actually trigger another change callback in that case. Actually the log there is a bit dumb and will log even if the error is null that an error has occurred.

I am not sure that the retry strategy will have any effect. That implementation is for HTTP based failures, and will only retry if it receives a certain number of predefined transient HTTP status codes (none of which are happening here). What it looks like is that Start is being triggered, but the trigger is ignored because it is not in a valid state to receive “start”.

What is the final state that the Replicator is in when it is not responding?