CRITICAL: Couchbase Cluster Stuck in Rebalance

ash · July 20, 2017, 1:51am

We have a 3 node cluster (CB 4.0 Community Edition) running on AWS m3.2xlarge instances.

The cluster has been running for a year without any issues and has 100MM documents contained. Bucket size is about 50GB.

Today 1 of the nodes went into pending state and we did a soft failover, and added a new server of equivalent size with CB4.5 to the cluster and rebalanced. Everything ran smoothly.

Later on, we attempted to add another new equivalent CB4.5 server to the cluster with the original Elastic IP (EC2 public DNS hostname) of the failed over node.

On adding the new node, the other new CB4.5 node went into a yellow pending state and the cluster now appears stuck on 0% rebalance.

All cli attempts to failover either problem nodes result in “[“Unexpected server error, request logged.”]”.

Currently it appears on the Buckets page that the bucket is there and the document count is slowly incrementing (we still have 1 application server pointed there which is getting 60% timeouts on write operations).

We have also tried to XDCR the cluster to a new cluster we have setup, but XDCR appears stuck too.

As a precaution I have backed up (copied) the raw data files from each of the servers if that is worth anything.

Can anyone please advise soonest.

Sincerely
Ash.