Question on Recovering cluster

pratheej · April 6, 2017, 12:36pm

Hi,
Consider a 100 node couchbase cluster configured with 1 replica.

Imagine in an event of hardware failure of 2 nodes in the cluster simultaneously, 2 CBS nodes went down at the same time.
As there is only one replica configured, there will be data loss as expected.

Questions:

What are the steps needed to recover the setup with remaining 98 nodes, so that we can avoid further data loss due to another failure of any other node in the future?
How can we remove the failed 2 nodes from the cluster?

Request reply.

Regards
Pratheej

drigby · April 6, 2017, 3:27pm

You’d need to fail over the problematic nodes (the first will have already been failed over if you had auto-failover enabled), and then rebalance the cluster.

Note: If you /really/ have an 100 node cluster, you probably want more than one replica given the chance of two nodes failing in a given time window is high.

pratheej · April 7, 2017, 9:02am

Thanks for the reply Drigby.

We are trying to find out the technical behaviour.

I will put my question as below, please let me know:

In a cluster (say 10 nodes) with only one replica configured with autofailover.

If 2 nodes from the cluster goes off abruptly at exact same time and exceeds the time out value for autofailover, will the failover (for one of the node as you said) get triggered?
If the failover get triggered, will it get completed successfully? Will it allow to do rebalance after the failover ?
If all the above steps get completes successfully, how can we remove/eject the other node which went down from the cluster and do rebalance ?

Note: Please consider that the 2 nodes which went off is not going to join back the cluster anymore.

Regards
Pratheej

drigby · April 7, 2017, 10:34am

Your questions should be answered in the Auto-Failover documentation

pratheej · April 7, 2017, 1:01pm

Thanks for the update drigby.

From the link it is clear that when 2 nodes go off at the same time, failover will not be triggered.

It says … “Designed to failover a node only if that node is the only one down at a given time.”

I still have the same questions raised when raising this topic.

What are the steps needed to recover the setup with remaining nodes, so that we can avoid further data loss due to another failure of any other node in the future?
How can we remove the failed 2 nodes from the cluster?

Note: Please consider that the 2 nodes which went off is not going to join back the cluster anymore.

Regards
Pratheej