Cleanly Fail a node while using sync gateway

Caroline · August 21, 2017, 3:50am

Hi, I’m wondering what the proper procedure to take is when a server node fails(server goes down, graceful failover, or rebalance) while using a sync gateway.

In a 3 node setup, I noticed when a server node went down it stopped replication through the sync gateway. When I failed over that node, I started getting these messages on the gateway:

“(look for: time=2017-08-19T02:28:28.863+00:00 _level=ERROR _msg=go-couchbase: Error connecting to tap feed of IP:: dial tcp IP: getsockopt: connection timed out”

Again, no replication through the gateway.

Rebalancing after the node was failed had similar effects, and I continually saw

“new configuration for bucket …” logs on the gateway

So my concerns are:
How can we safely remove/fail/add a node while running the sync gateway and not stop replication?

Is this possible?

Thanks,
Caroline

househippo · August 21, 2017, 9:51am

@Caroline what versions of SG and CB are you running?

Caroline · August 21, 2017, 2:35pm

Sorry I forgot to mention that
Gateway: 1.4.1
Server: 4.6.1

traun · August 21, 2017, 5:26pm

Hi @Caroline,

Sync Gateway 1.5 will support multiple Couchbase Server URLs (details in this issue), which will improve the situation a lot in that regard.

In the meantime, for Sync Gateway 1.4, here is a possible workaround:

Create your own monitoring script that monitors all couchbase server urls
If a node is detected to be down, rewrite the Sync Gateway config to point to one of the couchbase server urls that is still alive
Restart sync gateway

Caroline · August 21, 2017, 7:51pm

@traun Thanks for the reply.

Unfortunately, Our sync gateway was pointed to a server node that did not go down.
In fact we had two sync gateways running. when the server node went down, both gateways stopped replicating (the same when it failed over)
When We started rebalancing we saw the

“new configuration for bucket …”

and replication either never came though, or was incredibly slow. eventually, after rebalancing was nearly complete, the replication started coming through.

So, even though the gateway was not pointed to a server node that went down, it still stopped replicating when that node failed in the cluster…

In other words, if any of our server nodes in the cluster fail, all replication through the sync gateway stops. We need to figure out some way around this.

Thanks,
Caroline

traun · August 21, 2017, 8:37pm

Thanks for the update.

It sounds like the same issue that was reported in https://github.com/couchbase/sync_gateway/issues/2576, which should be fixed in Sync Gateway 1.5. If you have a way to test/reproduce this at will, it would be helpful if you could verify that the 1.5 beta 2 build fixes your issue.

Given what you reported, I think the following workaround might work

Create your own monitoring script that monitors all couchbase server urls
If a node is detected to be down, restart sync gateway

Since the the only pointers to the failed couchbase server should be in memory at this point, rebooting should have the effect of clearing out that memory. When it restarts and connects to the healthy node to pull the cluster map, that failed node should no longer be in the cluster map and there should be no more errors such as " Error connecting to tap feed of IP:: dial tcp IP: getsockopt: connection timed out”

Caroline · August 21, 2017, 10:08pm

@traun thanks.

I can confirm that restarting the gateway does not resolve the replication issues. I also tried stopping and starting the gateway.
We will test with 1.5 beta 2 and see how that works.

Thanks,
Caroline

Caroline · August 28, 2017, 10:16pm

@traun, we experience the same issues on 1.5 beta.
I’m wondering what the cleanest solution will be (until we have a version of the gateway that doesn’t break when a node goes down/cluster rebalances)

We need to take a node down for maintenance in a few weeks, This should take roughly an hour or so. If we don’t failover and rebalance, that means the gateway replication would be down an hour while the node is worked on. Is this safe? Ideally, the node would come back up, and replication would resume.

However, If we rebalance, that’ll add about 4 more hours of no replication… So I’d like to avoid that…
Any suggestions?

Thanks,
Caroline

Caroline · August 31, 2017, 3:19pm

@traun Is this issue being worked on in the latest 1.5 beta?

Thanks

adamf · August 31, 2017, 4:43pm

@Caroline There shouldn’t be any case where a restart of the Sync Gateway node, as you described above, fails to properly restore replication behaviour.

Can you provide specifics (or even better, file a github issue) of the steps you’re taking, the expected results, and the actual results you’re seeing?

Thanks.

traun · August 31, 2017, 4:43pm

Hi @Caroline,

Can you please file a ticket via https://github.com/couchbase/sync_gateway/issues/new?

The most important thing is to include the steps to reproduce the issue. The more details you can include, the better. Also mention the specific version of Sync Gateway you are using. If you go to localhost:4985, it returns the version.

There is some ongoing testing for SG 1.5 related to failover/rollback going on here: https://github.com/couchbaselabs/mobile-testkit/issues/979