Problem with Sync.Gateway

In my test environment I have started to see problems with the sync. gateway server. It basically ends up not responding on the network. This means I cannot SSH to it - and it doesn’t respond on port 4985. I can normally PING it but with slow responses and sometimes it times out but it normally responds.

The server is running on a VM so I can start a console directly on it. And it isn’t busy at all. Things I have tried:

  1. Changed from a fixed mapped DHCP address to a fixed IP address
  2. Restart the network service
  3. Restart the Sync Gateway service

Only restarting the SG makes the server respond again.

When the problem occurs I get error messages like:

Apr 30 15:02:54 sg1 bash: #033[1;34m2020-04-30T15:02:54.437+02:00 [INF] HTTP: c:[5fb520ab] #303:     --> BLIP+WebSocket connection error: read tcp 192.168.42.213:4984->192.168.42.240:42082: read: connection timed out#033[0m
Apr 30 15:02:54 sg1 bash: #033[1;34m2020-04-30T15:02:54.437+02:00 [INF] HTTP: c:[5fb520ab] #303:    --> BLIP+WebSocket connection closed#033[0m
Apr 30 15:06:22 sg1 bash: #033[1;33m2020-04-30T15:06:22.517+02:00 [WRN] c:data-SG Error processing DCP stream - will attempt to restart/reconnect if appropriate: pkt.Receive, err: read tcp 192.168.42.213:52646->192.168.42.211:11210: read: connection reset by peer. -- base.(*DCPReceiver).OnError() at dcp_receiver.go:61#033[0m
Apr 30 15:06:22 sg1 bash: #033[1;34m2020-04-30T15:06:22.618+02:00 [INF] CBGoUtilsLogger: Using plain authentication for user <ud>remoteSync</ud>#033[0m
Apr 30 15:06:22 sg1 bash: #033[1;33m2020-04-30T15:06:22.739+02:00 [WRN] c:data-SG DCP RollbackEx request - rolling back DCP feed for: vbucketId: 329, rollbackSeq: 0. -- base.(*DCPCommon).rollbackEx() at dcp_common.go:175#033[0m
Apr 30 15:06:22 sg1 bash: #033[1;33m2020-04-30T15:06:22.741+02:00 [WRN] c:data-SG DCP RollbackEx request - rolling back DCP feed for: vbucketId: 228, rollbackSeq: 0. -- base.(*DCPCommon).rollbackEx() at dcp_common.go:175#033[0m
Apr 30 15:06:22 sg1 bash: #033[1;33m2020-04-30T15:06:22.742+02:00 [WRN] c:data-SG DCP RollbackEx request - rolling back DCP feed for: vbucketId: 415, rollbackSeq: 0. -- base.(*DCPCommon).rollbackEx() at dcp_common.go:175#033[0m
Apr 30 15:06:22 sg1 bash: #033[1;33m2020-04-30T15:06:22.744+02:00 [WRN] c:data-SG DCP RollbackEx request - rolling back DCP feed for: vbucketId: 123, rollbackSeq: 0. -- base.(*DCPCommon).rollbackEx() at dcp_common.go:175#033[0m
Apr 30 15:06:22 sg1 bash: #033[1;33m2020-04-30T15:06:22.744+02:00 [WRN] c:data-SG DCP RollbackEx request - rolling back DCP feed for: vbucketId: 407, rollbackSeq: 0. -- base.(*DCPCommon).rollbackEx() at dcp_common.go:175#033[0m
Apr 30 15:06:22 sg1 bash: #033[1;33m2020-04-30T15:06:22.744+02:00 [WRN] c:data-SG DCP RollbackEx request - rolling back DCP feed for: vbucketId: 455, rollbackSeq: 0. -- base.(*DCPCommon).rollbackEx() at dcp_common.go:175#033[0m

192.168.42.211 and …212 are my Couchbase servers - and they are both responding fine from any other computer.

In my setup the SG is behind an Nginx server for access to from the mobile clients.

What can I do to troubleshoot this? Could it be lack of resources somewhere?

SG is version: Couchbase Sync Gateway/2.7.2(2;583d2dc) CE
CB is version: Community Edition 6.0.0 build 1693 ‧ IPv4

Thanks in advance for any advice!

/John

I don’t know if my problems are related to running many tests of an app against this server - and many of those tests end by just restarting the app (or it simply crashes - that’s the nature of developing) - but that also could mean that open connections could be lingering around…

Are there any specific steps I should take to close down the database in my app “nicely”? I’m using C# in Xamarin Forms.

Have you tried checking whether there are a lot of connections stuck open on the Sync Gateway side to confirm that theory?
If you’re exhausting all available connections or file descriptors on your server, it would certainly manifest in the same way (unable to SSH, unreliable ping, etc.)

There’s some info about tuning file descriptor limits and TCP options in here if you’ve not done so already:
https://docs.couchbase.com/sync-gateway/2.7/os-level-tuning.html

Have you also made sure to configure nginx for websocket timeouts/keepalives as per the docs?
https://docs.couchbase.com/sync-gateway/current/load-balancer.html#basic-nginx-configuration-for-sync-gateway

1 Like

Thanks for the pointers. I had most of them set already - but not the fs.file-max and corresponding value in the sync_gateway.json file.

I have added them - and will see what happens :+1:

Ok, it didn’t solve the issue… :frowning:

Not sure how to troubleshoot this further…?

I tried to look for open connections. Not sure if I do that correctly but here are a couple of screen shots that to my knowledge don’t seem to be too many:

And:

I’m not sure what the IPv6 entries are doing there though as I have tried to disable IPv6 in the network settings…?

Just a quick follow up.

I made some more adjustments to OS setup specifically for Systemd. And it seems to run more stable now :+1:

We are just about to put this into production - so I’ll know more about it in the coming weeks :innocent::grinning: