How does Hard Failover work? Our nodejs client kept trying to hit a node that was in Pend state

alexegli · September 8, 2015, 7:00pm

We have a couchbase cluster of two nodes, both running enterprise v3.0.2 on ubuntu 12.04. We recently had to upgrade the memory on one of our couchbase server nodes, which involved restarting the VM. We have our buckets configured to have 1 replica, so I thought this meant that when one node went down the other node would activate its replicas and take over for it. Our nodejs client though kept getting connection errors while the node was in the Pend state, and then once it was back up and running the connection errors stopped happening. Does a hard failover not happen if the node is in the Pend state, or do we have to do something special in the cluster to make sure our clients don’t try to contact nodes that have gone done?

cihangirb · September 8, 2015, 9:10pm

if you do not have auto-failover enabled, we won’t fail a node automatically - you can still manually do this but would be easier with auto failover.

In a situation like yours, it is easy to avoid these failures in the app;

failover the node that will be restarted,
restart the couchbase server node
once it is back online, add back the node and rebalance

With this steps, after the failover, new incoming operations are sent to the node that takes over after the failover. once you add back and rebalance, the restarted node starts taking the traffic again and during the failovers, node app will automagically work avoiding mass failures.
thanks
-cihan

alexegli · September 8, 2015, 9:14pm

The node restarted because the VM ran out of memory and crashed. I then kept the node down and added more memory, but I thought couchbase would handle cases where a node dies in a cluster and becomes unresponsive. Is enabling auto-failover as simple as going to that Auto-Failover page in settings and clicking the enable checkbox? Are there any downsides to enabling it, or anything we need to do in preparation for it? Will the act of enabling it temporarily take down the cluster or impact the cluster in any way?