Rebalance stuck at 0% and does not cancel

#1

Hi,

We’re using 2.5.1 Community edition. After one of 24 nodes was failed over due to network problems I added it back and started rebalance. Rebalance is at 0% for 4 days so far and it does not respond to cancel. UI logs are full of those (useless?) “Metadata overhead warning. Over 62% of RAM allocated to bucket “XX” on node “XXXX” is taken up by keys and metadata.”

Please help as this is a serious problem for us.

#2

Update: UI popup says “Rebalancing 0 nodes” despite we got 24 in the cluster.

#3

There is no such thing as a useless log, even if it might be irrelevant to your current situation. @pvarley can surely help you on that.

#4

Sorry for saying that, we were just overwhelmed with the amount of such messages.

Update: command line client did not help us:

/opt/couchbase/bin/couchbase-cli rebalance-status --cluster=XXX:8091 --user=Administrator --password=XXX
(u’running’, None)
/opt/couchbase/bin/couchbase-cli rebalance-stop --cluster=XXX:8091 --user=Administrator --password=XXX
SUCCESS: rebalance cluster stopped
/opt/couchbase/bin/couchbase-cli rebalance-status --cluster=XXX:8091 --user=Administrator --password=XXX
(u’running’, None)

#5

Hi @ksafonov, Could you check the version number for your cluster. We don’t have a 2.5.1 of the community edition so you may be using an unfinished product. if we can identify the version, there may be workaround we can identify.
thanks
-cihan

#6

Sorry, my mistake. The version is 2.2.0.

#7

Hi guys, is there any workaround for our case? Rebalance is still at 0%…

#8

Update:

There’s a message in UI logs probably related to my cancel attempts:
Server error during processing: [“web request failed”,
{path,"/controller/stopRebalance"},
{type,exit},
{what,
{noproc,
{gen_fsm,sync_send_event,
[{global,ns_orchestrator},
stop_rebalance]}}},
{trace,
[{gen_fsm,sync_send_event,2},
{menelaus_web,handle_stop_rebalance,1},
{request_throttler,do_request,3},
{menelaus_web,loop,3},
{mochiweb_http,headers,5},
{proc_lib,init_p_do_apply,3}]}]

@ldoguin @cihangirb guys, any suggestions for us?

#9

Kiril reached me via gtalk (I don’t know how he found me) and I aggreed to help him. After looking at his logs I found that he was hit by that famous erlang master election thing. After searching my gmail history I found similar case and advised him to restart ns_servers via erlang web shell. Which helped “unstuck” rebalance.

He then had another issue with name resolution (at least one of machines was unable to resolve another, likely being added, machine). So I think case is closed now.

1 Like
#10

I happily confirm that Alexey’s advices helped us and I would like to express my sincere gratitude to @alkondratenko for the support!

#11

What are the steps to resolve this issue? On Couchbase 4.0.0 Community edition encountered pretty much the exact same scenario and are stuck with the exact same issue.

In our case we have 6 nodes in the cluster, one went down temporarily, and auto-failover happened. When the node came back online we added the node back to the cluster and rebalanced. Now we are stuck with “Rebalancing 0 nodes” and no attempts to stop the rebalance nor restart the nodes is working.

Is this an issue since 2.5.1 through 4.0.0?

#12

I’ve tried several ways to workaround this now including:

  1. Taking down the node that the failover happened previously for.
  2. Adding this node back in.
  3. Adding a completely new node to the cluster.

None of these work. There is no option in the UI that works. This seems like a critical bug to have from 2.5.1 through 4.0.0 server versions since there appears to be no immediate workaround to the user. Why is the “Stop Rebalance” button broken?

#13

So… I finally found a workaround: Crash the cluster.

Basically I took down nodes 1-by-1 until the number of downed nodes exceeded the number of replicas by 1, then brought the nodes back up.

I’m really curious as to what the technical details are on behind this bug causing the rebalance to get stuck at 0 nodes. There have been reports of this bug in other forums as well.

We were going to look into our company’s policy on the enterprise edition licenses and have been using the community edition for prototyping, but since this bug appears to be in the erlang layer would this bug occur in the enterprise edition as well?

Very curious on this one since getting stuck in such a state with crashing the cluster as the only recourse is highly undesirable behavior. I wouldn’t want to go to production with an issue like this present.

#14

Hello

Just had the same issue last night and I was wondering whether you could perhaps provide the commands to “restart ns_servers via erlang web shell”? I’m guessing the command will run on each node - how does this impact a running cluster?

Thanks!

#15

I have the exact same issue… I have a cluster of 18 nodes and node 004 had a network issue that forced me to do a hard failover on it. After a reboot and the network link fixed, the node was added back with “delta-recovery” and a rebalance was started.
After 3 days of nothing progressing (stuck at rebalancing 0 nodes) I figured I needed to fix it myself before it came crashing down hard.

So I ran the following command on all the cluster nodes in parallel:
curl -X POST -u Administrator:Password http://localhost:8091/diag/eval --data ‘erlang:halt().’

At that point the nodes were all in a standby state and in the “data buckets” tabs, all buckets had a yellow pie in the “data nodes” column.
I quickly hard-failover my problematic node 004 and the cluster and all buckets went back to ready in less than a minute. I have over 9 billion items in there so for sure there was no “warm up” that happened.

So I rebooted node 004 again and once it came back up, I tried a full recovery this time.
Rebalance is now in progress (I can see it progressing) and I’ll update later on success or failure.

update
I’m not sure when the rebalance finished, but the cluster is now healthy with all 18 nodes in.