I am using Couchbase Operator 1.2 to manage a Couchbase Enterprise-6.0.1 cluster in Google Kubernetes Engine. Currently I have a 5 node cluster with 2 index/data nodes and 3 everything else nodes. I have 5 buckets, one of which is a memcache, and each of these has 2 replicas. I have 7 indexes of varying complexity, each of which has 1 replica.
I have been testing node failure by deleting a server pod inside the Kubernetes cluster and letting the operator recreate the server and rebalance the cluster. If I remove one of the index nodes, the server pod is recreated, the persistent volume is reattached, the node is marked for delta recovery, and the rebalance runs to completion. The same is true for 2 of the “everything else” nodes. I can replace 4 of the 5 nodes,granted 1 at a time, without any significant issue or downtime.
However, if the server that fails is numbered 0000, the rebalance either gets stuck or fails and then continues to fail as the operator continues trying to rebalance the cluster. It doesn’t always get stuck at the same percentage of completion, I’ve seen it get to 70%, I’m currently watching a cluster that’s been stuck at 55% rebalanced for almost an hour for a 19MB database.
The database itself does continue to function, but if I lose another server while it is in this state, the cluster will never fully recover. I also have some evidence that backups are being prevented while the database is in this state but as yet I have not thoroughly tested that.
I would like to know what is causing this behaviour. I need to know how to recover from it.