Continuous Replication throws socket disconnet

We are creating 6 continuous replicators in our C# application. They sync up our gateway with a CBLite 2.1.2 database.

They start up fine and actively sync the changes. After a few minutes however they all get “unclean socket disconnect” error messages associated with an errno of 108 (Unknown error). Then they successfully attempt a restart , work for a while and again fail. It seems that we are eventually replicating the proper data.

Does this error signal a problem with our system or is it a side effect of continuous replication? Do we have too many replicators running at the same time? Why is this socket disconnect happening, what to do to stop it?

Below is the verbose logging of the cycle of one of our replicators.

Replicator) [6] Replicator[<> wss://Blah-deBlah.com/favorites]: Retrying…
11:09:39.648785| [Sync]: {Repl#69}==> class litecore::repl::Replicator I:\Blah-deBlah\qa\data\favorites.cbli2019-05-22 11:09:39.6414|DEBUG|Replicating Resident Sync Status (0/0)
te2\ ->wss://Blah-deBlah.com:443/favorites/_blipsync
11:09:39.653697| [Sync]: {Repl#69} Push=continuous, Pull=continuous, Options={{auth:{password:"********", type:“Basic”, username:“quiggle”}, headers:{User-Agent:“CouchbaseLite/2.1.2 (.NET; Microsoft Windows 10.0.17763 ) Build/13 LiteCore/ (15) Commit/9aebf28”}}}
11:09:39.660310| [Sync]: (WebSocketWrapper) [27] WebSocket connecting to Blah-deBlah.com:443
11:09:39.662591| [Sync]: (Replicator) [6] Replicator[<
> wss://Blah-deBlah.com/favorites] is Connecting, progress 0/0
11:09:39.666315| [Sync]: {Repl#69} activityLevel=connecting: connectionState=1
11:09:39.667901| [Sync]: (Replicator) [6] Replicator[<> wss://Blah-deBlah.com/favorites] is Connecting, progress 0/0
11:09:39.670830| [Sync]: {DBWorker#70} activityLevel=idle: pendingResponseCount=0, eventCount=1
11:09:39.673763| [Sync]: {Repl#69} Local checkpoint ‘cp-QpQuIpDkRRQEW7nxigrM02019-05-22 11:09:39.6758|DEBUG|Replicating Favorites Sync Status (0/0)
tOHShY=’ is [46, ‘16195’]; getting remote …
11:09:39.678005| [Sync]: {Repl#69} activityLevel=connecting: connectionState=1
11:09:39.679314| [Sync]: {Repl#69} pushStatus=busy, pullStatus=busy, dbStatus=idle, progress=0/0
11:09:39.680962| [Sync]: {Repl#69} activityLevel=connecting: connectionState=1
11:09:39.909929| [Sync]: {Repl#69} activityLevel=connecting: connectionState=1
11:09:39.911321| [Sync]: (WebSocketWrapper) [68] WebSocket CONNECTED!
11:09:39.921368| [Sync]: {Repl#69} Connected!
11:09:39.927317| [Sync]: {Repl#69} activityLevel=busy: connectionState=2
11:09:39.928758| [Sync]: {Repl#69} now busy
11:09:39.943385| [Sync]: {DBWorker#68} activityLevel=idle: pendingResponseCount=0, eventCount=1
11:09:39.946282| [Sync]: (Replicator) [106] Replicator[<
> wss://Blah-deBlah.com/favorites] is Busy, progress 0/0
11:09:39.949695| [Sync]: {Repl#69} activityLevel=busy: connectionState=2
11:09:39.986338| [Sync]: {Repl#69} Received remote checkpoint: [46, ‘16195’] rev=‘0-13’
11:09:39.988633| [Sync]: {Repl#69} activityLevel=busy: connectionState=2
11:09:40.004159| [Sync]: {Repl#69} pushStatus=busy, pullStatus=busy, dbStatus=idle, progress=0/0
11:09:40.006388| [Sync]: {Repl#69} activityLevel=busy: connectionState=2
11:09:40.012199| [Sync]: {Repl#69} pushStatus=busy, pullStatus=busy, dbStatus=idle, progress=0/0
11:09:40.014216| [Sync]: {Repl#69} activityLevel=busy: connectionState=2
11:09:40.031008| [Sync]: {Repl#69} pushStatus=idle, pullStatus=busy, dbStatus=idle, progress=0/0
11:09:40.033762| [Sync]: {Repl#69} activityLevel=busy: connectionState=2
11:09:40.061380| [Sync]: {Repl#69} pushStatus=idle, pullStatus=idle, dbStatus=idle, progress=0/0
11:09:40.063015| [Sync]: {Repl#69} activityLevel=idle: connectionState=2
11:09:40.064692| [Sync]: {Repl#69} now idle
11:09:40.065412| [Sync]: (Replicator) [90] Replicator[<*> wss://Blah-deBlah.com/favorites] is Idle,
11:10:32.096331| [Sync]: (WebSocketWrapper) [3] Failed to read from stream too many times, signaling closed…
11:10:32.099446| [Sync]: (WebSocketWrapper) [3] WebSocket CLOSED WITH ERROR: Couchbase.Lite.CouchbasePosixException: CouchbaseLiteException (POSIXDomain / 108): Unknown error.
11:10:32.105582| [WS] WARNING: {C4SocketImpl#82}==> class litecore::repl::C4SocketImpl wss://Blah-deBlah.com:443/favorites/_blipsync
11:10:32.107983| [WS] WARNING: {C4SocketImpl#82} Unexpected or unclean socket disconnect! (reason=errno, code=108)
11:10:32.110013| [Sync]: {Repl#69} Connection closed with errno 108: “CouchbaseLiteException (POSIXDomain / 108): Unknown error.” (state=2)
11:10:32.113462| [Sync] ERROR: {Repl#69} Got LiteCore error: CouchbaseLiteException (POSIXDomain / 108): Unknown error. (2/108)
11:10:32.116206| [Sync]: {Repl#69} activityLevel=stopped: connectionState=-1
11:10:32.117568| [Sync]: {Repl#69} now stopped

This log message: “(WebSocketWrapper) [3] Failed to read from stream too many times, signaling closed” sounds like the error comes from the .NET-specific WebSocket client code, not from the core replicator itself. @Sandy_Chuang or @borrrden should be able to find the source of this error.

The exception POSIXDomain / 108 looks weird, as there doesn’t seem to be a well-known errno value for 108 … at least, there isn’t one in <sys/errno.h> on my Mac.

Does Sync Gateway log any errors at the time this occurs? The problem might be on the server side, or at least we might get some clues by seeing the messages there.

(Actually it appears that POSIX does not define specific numbers for errno codes,only their names. Well, that’s useful! I did find a cross-platform comparison of the numeric codes.

According to that chart, the only error with code 108 is ESHUTDOWN, “Cannot send after transport endpoint shutdown”, on Linux (and only Linux.)

We have a proxy load balancing between 2 SG servers. The admin here says we are supposed to be sticky, your requests will always be sent to one or the other server.

Looking at the logs for both SGs I see my ID in both files so perhaps that is an issue.

I also see the following message at about the same time the client reports its 108 error:
http: TLS handshake error from 10.3.1.160:42238: tls: client offered an unsupported, maximum protocol version of 300

I get multiple messages like this - I assume one per replicator. The port number changes for each one.

Hope this info helps.

POSIX codes are hideously nonstandard above a certain point. In fact I went through the trouble of documenting them all in the .NET code base. As you can see 1-10, 12-14, 16 - 25, and 27-34 are consistent between platforms but the rest are not.

108 is a Windows specific POSIX code that represents ECONNRESET (FYI that’s 54 on mac and 104 on Linux). However, it is not being thrown by the system in this case. The log message you see means that there was a failure reading from the websocket stream and .NET is interpreting it as a reset connection so that the replicator can try again.

I am not a network guy so first off, is this impacting the replicators? i.e. do they simply reset and are able to pull/push as needed until they go idle for a while and the connection is reset?

Or do we need to get our network guy in on this as its a failure on our network interface?

Our network guy is saying we are running on tls 1.0 and would like to upgrade to 1.2.
Do you think that will help this problem?

looks like we are pretty much running on 1.2 so its not really an issue

If you are seeing a lot of these then it is likely a problem with the network. Specifically a lot of network engineers like to close connections that are open for a long time and that’s not a good thing for web sockets since they are designed to stay open for the long term. The majority of the time people complain about this kind of stuff, it’s because of a proxy being overzealous or something.

yes thats the route we are going down now, checking the timeout configuration.

thanks, bryan

Is the default heartbeat for continuous replicators still 30 seconds? Our client side software is not setting any parameter so I assume we are taking the default configuration. Again couchbaselite v2.1.2

Heartbeat does not apply anymore. Heartbeat in 1.x was to keep an HTTP connection open but 2.0+ uses web sockets which keep themselves open.

OK, so that means whatever value we set the timeout to on our server it will still close the replicator websockets if no activity for that period?

Do the replicators do anything to keep the websocket active? Long poll?

This page suggests the replicator pings every 300 seconds. Is this still accurate?
https://docs.couchbase.com/sync-gateway/current/load-balancer.html

If you have a server that shuts off incoming connections due to inactivity then it is very possible that the replicator will be shut off prematurely. “Long poll” is not a relevant term anymore. Web sockets themselves are designed to get rid of the limitations of HTTP with regards to long running connections so basically they are considered open until they receive a message that they are closed, or there is some error in the underlying transport (TCP). For this reason, Sync Gateway will send a ping to the client every 300 seconds (or maybe the reverse, or both but at least one side) to basically make sure it is still there. If it has silently disconnected then the ping will fail (or no pong will be received in a timely manner) and the connection will be torn down.

TL;DR: Yes, 300 seconds is still the case and you should not have any timeouts less than that. Web sockets are better at taking care of themselves than HTTP connections are.

fantastic thanks for all the info and hlep

HI @borrrden,

what’s the software can catch the sg and lite ping <–> pong network package?
I use the Wireshark can’t see the heartbeat network package.

thanks

angular