Any way to speed up graceful failover + rebalance?

janpaulb · March 21, 2019, 9:35pm

I’m trying to figure out if there’s a way to speed up the process of performing a restart on a Couchbase node for maintenance purposes. Graceful failover and then rebalance seems like the best option for avoiding any outages or data loss while restarting nodes, but it takes a really long time for our clusters – about 30-60 minutes in the worst case.

We’re using Couchbase with several pretty big clusters. We’ve opted for lots of small VM’s and some of our clusters have as many as 9 machines. We also have just a lot of data – about 350 million items, XDCR’ed across 5 different datacenters. I’m guessing this has an effect on the performance of these operations. It also means that to perform the graceful failover + rebalance operation on every machine in a single cluster can take up to 9 hours in the worst case, which is pretty unmanageable if we need to do restarts more than once or twice a year.

The only other option that seems to be available is simply restarting a node without doing a failover first. This seems to at least be faster since the node will rejoin the cluster without doing any rebalance. But is there a possibility of data loss from doing this? Are there any other downsides to performing maintenance that way?