Auto-Failover not working with 5.1.1?

carl.duranleau · May 8, 2019, 6:20pm

Hi everyone,

I may need to have a better understanding of the Couchbase failover behavior, but here’s my situation:

We have 3 identical Couchbase server nodes. Before the upgrade to 5.1.1, we were using our own load-balancing solution to handle a node failures by simply not using the failed node until it gets back online. With 5.1.1, our solution doesn’t work anymore because the new Java client isn’t working fine when using a load-balancer address in the connection URI list. So we are using each node’s IP instead. Everything works fine. The only problem is the failover mechanism. When one of the nodes gets killed, or simply crashes, the Java client keeps sending request to the dead node. On the Couchbase web interface, we can see a message under the failing node stating that this server is not taking requests and that it can be failed-over. The Failover button is displayed and can be pressed to kill the node. But this is a manual operation. Not really a good production environment solution to have an administrator in front of a screen 24/7 to hit the Failover button in case something goes wrong. We have thousands of requests per second and 80% of these requests need to save or retrieve data from Couchbase. Since the requests keep hitting the dead node, our services get almost unusable.

I read the documentation and tried to activate the Auto-Failover option with the minimum delay of 30 seconds, but this doesn’t do anything. The documentation says that the Auto-Failover is triggered only on a minimum of 3 nodes. I don’t know if it means that 3 nodes must be alive (not my case here since I have 1 dead and 2 remaining nodes), or it means that the cluster must contain at least 3 nodes.

There’s must be a real production environment solution to set a real failover mechanism that simply removes the not responding node from the pool and stop sending requests to it until it gets back.

What are my options?

EDIT: Tried with 4 nodes and the result is the same. The failover mechanism does nothing and requests keep getting sent to the dead node.

carl.duranleau · May 10, 2019, 2:42pm

Looks like the failover mechanism is not working when there are different types of bucket in the cluster. I have 3 buckets; 2 memcached and 1 couchbase (persisted and replicated). I first tried to create all 3 buckets as memcached and the failover mechanism started working as expected. I then tried to create 2 clusters; one with my 2 memcached buckets and one with my persisted bucket. Surprisingly, the failover mechanism was working fine on both clusters.

poonam · May 13, 2019, 4:40pm

Autofailover requires at least 3 data nodes in the cluster. And, in 5.1.1, it will failover only when exactly one node has failed.
It works irrespective of types of buckets in the cluster.
As mentioned in the doc, there are various conditions that can prevent an auto-failover. We need logs to investigate.
If this is still an issue, can you please open a jira ticket with logs from when auto-failover was enabled but did not work?

carl.duranleau · May 13, 2019, 7:33pm

Thanks for your answer poonam. I’ve done my first tests with only two nodes, but upgraded to four nodes after reading the auto-failover documentation. That did not solve the problem, but when moving to all ‘persisted’, and then, to all in-memory buckets, everything started to work as expected. If we decide to keep our current configuration and want to activate the auto-failover mechanism, we’ll push the investigation further by providing logs and all needed technical information.

After all these tests, we have decisions to take for the next step. We may finish with only in-memory buckets to get faster recovery and faster crash management. For now, the persisted bucket almost kills the application environment when a crashed node is added back to the cluster. The rebalancing process slows everyone down and the system gets almost unusable. With a working auto-failover mechanism, the rebalancing would not be required to be done immediately because the replication already allows other nodes to answer correctly. Without the failover, the Couchbase Client always tries to hit the bad node. I really don’t understand why the client is unable to detect a replicated node then simply retry on another node before failing. I can’t find anything in the API to catch Timeouts and retry on the next replicated node.

Before upgrading our CB version to 5.1.1, our own generic load-balancer front service was way more efficient at handling node failures. It was faster, was supporting any number of nodes and did not rely on a complex logic to validate the node activity.