How does sync gateway handle a node failure in the cluster?

#1

I have a two node cluster and yesterday one of the nodes went down. After about 10 minutes it was back up and rejoined the cluster, and after about 25 minutes it had fully warmed up and was in Ready state. Last night we noticed random lags in sync gateway responses from the node that did NOT go down so I checked the logs this morning and see this line over and over again:
14:42:24.297499 2016-06-03T14:42:24.297Z WARNING: Skipped Sequence 3966241 didn’t show up in MaxChannelLogMissingWaitTime, and isn’t available from the * channel view. If it’s a valid sequence, it won’t be replicated until Sync Gateway is restarted. – db.func·005() at change_cache.go:206

These messages completely fill the log (the sequence value in it is one higher each time) and after a few hours sync gateway became unresponsive. Restarting sync gateway fixed the issue and it is working fine now with none of these messages in the log.
Is this related to the node failing and rejoining the cluster? Is there anyway we can mitigate this issue in the future or is restarting sync gateway after all nodes in the cluster are in the Ready state the only way to fix it?

#2

@alexegli,

You need a minunum of a three node CB cluster for autofail over to be used.

Also make user your CB bucket has index replica checked(below image is NOT CHECKED).

Try your testing with three nodes.

#3

I didn’t failover the node because I was worried about data loss (a big red warning popped up when I clicked on Failover) and because I knew the node would be back up in a few minutes.

My question is about sync gateway though. Why did sync gateway fail to handle a node failing and then coming back online and is there anything I can do to prevent that issue from happening again?

#4

@alexegli,

The long failure time (25 minutes) is what got you. I happened to me. I set autofailover to 30 seconds on CB server and pulled the network cable from one of the nodes. SG gave me not available for 30 seconds, but as soon as auto failover happened SG was up and running serving _changes feed.
Below image is the mini datacenter I used to test.

#5

That is an awesome diagram! Nice setup. So the issue was that I didn’t do failover and that sync gateway handles an actual failover better than a node failure and recovery. I guess in the future I will just restart it manually after the node comes back up.

#6

With only 2 nodes and if one fails. It mean a bunch of things.

  1. your lost 50% of you DB capacity :frowning:
  2. Auto failover will not work with only two nodes.

Sync Gateway recovery.
Sync gateway is just an APP like any other. So asking it to recovery from a huge pause or outage of about 30min is never a good idea. We are talking about a Major outage. CB has XDCR so you should fail over to another data center with SG & CB already to go just in case autofailover does not happen for some odd reason or you want to keep to a very tight SLA.

#7

Thanks. Fortunately our SLA is not that tight and sync gateway did recover fine once I restarted it. If we have greater restrictions in the future I will look into increasing the number of nodes.