After adding a new node to a near full couchbase cluster, rebalance hangs

sylecn · March 13, 2015, 2:36pm

Hi,

We run a 8 node couchbase cluster and it has run well until recently we find the RAM usage is too high and client starts to see temporary errors. We figure it’s time to add a new node to the cluster.

Adding the new node is straightforward and I started the rebalance when the node is added to existing cluster. The old nodes RAM usage is over 95% most of the time, disk usage is about 190GB on each node. The new node RAM usage is about 10%, disk usage is 1.25GB. rebalance progress has stayed in 0% for a long time on all running nodes.

On the new node, there is no active items, only replications:
/opt/couchbase/bin/cbstats HOST_NEW:11210 -b default all|grep curr
curr_connections: 22
curr_conns_on_port_11209: 15
curr_conns_on_port_11210: 5
curr_items: 0
curr_items_tot: 1071476
curr_temp_items: 0
vb_active_curr_items: 0
vb_pending_curr_items: 0
vb_replica_curr_items: 1071476

Here are some findings to help debug the problem:

bin/couchbase-cli server-list
show all nodes are up running.

bin/cbhealthchecker
also run successfully.

So I think the new node is configured correctly.

cbstats tap |grep backfill
On the new node, it returns:
ep_tap_queue_backfillremaining: 0
On one of the old nodes, it returns this line and also
eq_tapq:replication_ns_1@{HOST_X}:queue_backfillremaining: 0
eq_tapq:replication_ns_1@{HOST_Y}:queue_backfillremaining: 0
etc.

I have checked the disk space and data/index dir permission, so disk is probably not the problem.

Error log and debug log on the new node does not help much. The only relevant error log I see is

Connection attempt from disallowed node 'ns_1@{HOST_X}' **

I have checked the “otpCookie” field in bin/couchbase-cli server-info output, and see no problem there.

Now I wonder whether the rebalance will finish at all. What is blocking the rebalance progress? Is rebalance on a near full cluster supported?

We run 2.1.1 community edition (build-764-rel) on Ubuntu system.