Failure of a specific node in a cluster causes an infinite rebalance loop

Greetings,

I am using Couchbase Operator 1.2 to manage a Couchbase Enterprise-6.0.1 cluster in Google Kubernetes Engine. Currently I have a 5 node cluster with 2 index/data nodes and 3 everything else nodes. I have 5 buckets, one of which is a memcache, and each of these has 2 replicas. I have 7 indexes of varying complexity, each of which has 1 replica.

I have been testing node failure by deleting a server pod inside the Kubernetes cluster and letting the operator recreate the server and rebalance the cluster. If I remove one of the index nodes, the server pod is recreated, the persistent volume is reattached, the node is marked for delta recovery, and the rebalance runs to completion. The same is true for 2 of the “everything else” nodes. I can replace 4 of the 5 nodes,granted 1 at a time, without any significant issue or downtime.

However, if the server that fails is numbered 0000, the rebalance either gets stuck or fails and then continues to fail as the operator continues trying to rebalance the cluster. It doesn’t always get stuck at the same percentage of completion, I’ve seen it get to 70%, I’m currently watching a cluster that’s been stuck at 55% rebalanced for almost an hour for a 19MB database.

The database itself does continue to function, but if I lose another server while it is in this state, the cluster will never fully recover. I also have some evidence that backups are being prevented while the database is in this state but as yet I have not thoroughly tested that.

I would like to know what is causing this behaviour. I need to know how to recover from it.

Hi Tyrell,
Thanks for testing out the Operator, could you provide a capture of the logs from the cluster with rebalance hang?
Refer -> https://docs.couchbase.com/operator/1.2/cbopinfo.html
It’s not exactly clear to me if the operator and the k8s networking/dns has anything to do with this, or Couchbase itself. Thanks!

Hi Tommie,
I’ve uploaded the results of cbopinfo --collectinfo --system to DropBox:

Thanks!
I’m seeing some errors on the analytics node that I suspect to be the cause of rebalance hanging. Just after node 0001 goes down, node 0000 is reporting ‘failed to sync’ due to ‘connection refused’ and this continues even after node 0001 is recovered and during recovery attempt of 0000

2019-06-27T18:22:39.721Z ERRO CBAS.api.PartitionReplica [Executor-35:5d37cc4c1653513f42831000c5aa79a2] Failed to sync replica {"id":"0@cb-cluster-member-0001.cb-cluster-member.default.svc:9120","status":"CATCHING_UP"}
java.nio.channels.UnresolvedAddressException: null

I’ll pass this along to analytics team to have a look as well. Meanwhile I suggest trying without the analytics service and seeing that fixes issues with rebalancing.

Thank you for taking a look. I tried standing up the cluster with no analytics nodes, then I deleted node 0001 and it recovered and rebalanced, which I expected. Then I dropped node 0000 and it was able to complete a rebalance after that, so I think you may have hit the nail on the head there. I’m going to run a few more drop and rebuild tests overnight and I will let you know the results.

1 Like

Overnight I set the cluster to lose a node, wait for a rebuild, lose another node, etc until all 5 nodes had been recycled. Each time, the rebalance completed. The rebalance also completed faster with each iteration, which was unexpected. What I did notice that I’m a little nervous about is that the first rebalance took over 14 minutes on a database of less than 1000 items. To be fair, I’m new to Couchbase, so maybe that’s normal? Is that longer than you’d expect for such a tiny DB, or am I just jumping at shadows?

https://developer.couchbase.com/documentation/server/3.x/admin/Tasks/rebalance-questions.html

Rebalance speed is really tied to available resources like RAM/Network/disk speed. In your GKE setup, how many nodes do you have setup? You do have 4 CB buckets with two replicas. That means each rebalance would have to recreate 2048x4 vbuckets, so this initial “lag” could be tied to disk I/O.

Thanks Tim. I will try spinning the cluster up on SSDs to test this.
However, as we are using persistent disks, shouldn’t the delta recovery be preventing a bottleneck on disk IO?

I would allocate more RAM to the data nodes. I recall that you had a lot more free RAM per node. Delta recoveries typically are faster.