Problems with swap rebalance

Hi,

I’m trying to do a swap rebalance to upgrade server in the cluster. I’m running 2.2.0 with default configuration

I’m just running on 3 m1.medium instances in aws and would like to upgrade these, however when I try to do the rebalance it runs for a little while and then I get error messages like

“Rebalance failed. See logs for detailed reason. You can try rebalance again.”

In the log I find messages like

Rebalance exited with reason {badmatch, [{<0.27282.134>, {{badmatch,{error,nxdomain}}, [{ns_replicas_builder_utils, kill_a_bunch_of_tap_names,3}, {misc,try_with_maybe_ignorant_after,2}, {gen_server,terminate,6}, {proc_lib,init_p_do_apply,3}]}}]} ns_orchestrator002 ns_1@production.couchbase.node.4 13:55:12 - Thu Nov 14, 2013 <0.27255.134> exited with {badmatch, [{<0.27282.134>, {{badmatch,{error,nxdomain}}, [{ns_replicas_builder_utils, kill_a_bunch_of_tap_names,3}, {misc,try_with_maybe_ignorant_after,2}, {gen_server,terminate,6}, {proc_lib,init_p_do_apply,3}]}}]}

or

Rebalance exited with reason {unexpected_exit, {'EXIT',<0.28389.133>, {badmatch, [{'EXIT', {shutdown, {gen_server,call, [<18927.20805.0>,had_backfill,30000]}}}, {'EXIT', {{badmatch,{error,nxdomain}}, {gen_server,call, [<12941.16232.16>,had_backfill, 30000]}}}]}}} ns_orchestrator002 ns_1@production.couchbase.node.4 13:50:17 - Thu Nov 14, 2013 <0.28352.133> exited with {unexpected_exit, {'EXIT',<0.28389.133>, {badmatch, [{'EXIT', {shutdown, {gen_server,call, [<18927.20805.0>,had_backfill,30000]}}}, {'EXIT', {{badmatch,{error,nxdomain}}, {gen_server,call, [<12941.16232.16>,had_backfill,30000]}}}]}}}

or

Rebalance exited with reason {unexpected_exit, {'EXIT',<0.24050.133>, {badmatch, [{'EXIT', {noproc, {gen_server,call, [<18927.20351.0>,had_backfill, 30000]}}}]}}} ns_orchestrator002 ns_1@production.couchbase.node.4 13:49:43 - Thu Nov 14, 2013 <0.24040.133> exited with {unexpected_exit, {'EXIT',<0.24050.133>, {badmatch, [{'EXIT', {noproc, {gen_server,call, [<18927.20351.0>,had_backfill,30000]}}}]}}}

or

Rebalance exited with reason {{{badmatch,[{<18927.7378.0>,noproc}]}, [{misc,sync_shutdown_many_i_am_trapping_exits, 1}, {misc,try_with_maybe_ignorant_after,2}, {gen_server,terminate,6}, {proc_lib,init_p_do_apply,3}]}, {gen_server,call, [<0.8013.132>,get_replicators,infinity]}} ns_orchestrator002 ns_1@production.couchbase.node.4 13:39:45 - Thu Nov 14, 2013 <0.7999.132> exited with {{{badmatch,[{<18927.7378.0>,noproc}]}, [{misc,sync_shutdown_many_i_am_trapping_exits,1}, {misc,try_with_maybe_ignorant_after,2}, {gen_server,terminate,6}, {proc_lib,init_p_do_apply,3}]}, {gen_server,call, [<0.8013.132>,get_replicators, infinity]}}

The rebalance are done while under load and instances is pretty close to 100% .

I can get through with the entire the rebalance if I just starts a new rebalance each time the previous one fails.

Are the failures caused by high load? Could they be minimized if I the clients would push less pressure on the cluster. Are there other things you could do to try minimizing the failures

Best Regards
Niels

Hello,

Another thing which I find strange.

For a long time there is no progress in the rebalancing process, but I noted that the node I try to remove is doing something like 10-20k set per second in the bucket that it is currently balancing. It just seems strange, but there is probably a good reason.

Best Regards
Niels

Hey boldt,

I had a similar problem in which my rebalance kept failing, I changed automatic failover to a higher value and the rebalance worked, then I set it back to my desired value. Have you got automatic failover configured?

Hi,

I do not have auto failover enabled

Best Regards
Niels