Swap rebalance failed and now REST API and admin console are not responding


#1

OK, two node 3.0.1 cluster with ~350M objects running on m3.2xlarge in ec2. Tried to do a swap rebalance with an equivalent sized node, rebalance failed with the following errors, any help would be greatly appreciated. The Console is unable to maintain a connection to any of the nodes after initially connecting and the REST APIs seem to be unable to be accessed. Otherwise the cluster appears to function, applications connected to the cluster are able to do CRUD operations just fine. I just cannot do any admin ops and I’m afraid to try anything else. Anything I can try?

Rebalance exited with reason {unexpected_exit,
{‘EXIT’,<0.21262.1279>,
{bulk_set_vbucket_state_failed,
[{‘ns_1@aaa.compute-1.amazonaws.com’,
{‘EXIT’,
{{{{case_clause,
{error,
{{{badmatch,{error,badarg}},
[{dcp_replicator,init,1,
[{file,“src/dcp_replicator.erl”},
{line,48}]},
{gen_server,init_it,6,
[{file,“gen_server.erl”},
{line,304}]},
{proc_lib,init_p_do_apply,3,
[{file,“proc_lib.erl”},
{line,239}]}]},
{child,undefined,
‘ns_1@bbb.compute-1.amazonaws.com’,
{dcp_replicator,start_link,
[‘ns_1@bbb.compute-1.amazonaws.com’,
“cdi-master-catalog”]},
temporary,60000,worker,
[dcp_replicator]}}}},
[{dcp_sup,start_replicator,2,
[{file,“src/dcp_sup.erl”},{line,78}]},
{dcp_sup,
’-set_desired_replications/2-lc$^2/1-2-’,
2,
[{file,“src/dcp_sup.erl”},{line,55}]},
{dcp_sup,set_desired_replications,2,
[{file,“src/dcp_sup.erl”},{line,55}]},
{replication_manager,handle_call,3,
[{file,“src/replication_manager.erl”},
{line,130}]},
{gen_server,handle_msg,5,
[{file,“gen_server.erl”},{line,585}]},
{proc_lib,init_p_do_apply,3,
[{file,“proc_lib.erl”},{line,239}]}]},
{gen_server,call,
[‘replication_manager-cdi-master-catalog’,
{change_vbucket_replication,1022,
‘ns_1@bbb.compute-1.amazonaws.com’},
infinity]}},
{gen_server,call,
[{‘janitor_agent-cdi-master-catalog’,
‘ns_1@aaa.compute-1.amazonaws.com’},
{if_rebalance,<0.20994.1279>,
{update_vbucket_state,1022,replica,
undefined,
‘ns_1@bbb.compute-1.amazonaws.com’}},
infinity]}}}}]}}} ns_orchestrator002 ns_1@ccc.compute-1.amazonaws.com

<0.21008.1279> exited with {unexpected_exit,
{‘EXIT’,<0.21262.1279>,
{bulk_set_vbucket_state_failed,
[{‘ns_1@aaa.compute-1.amazonaws.com’,
{‘EXIT’,
{{{{case_clause,
{error,
{{{badmatch,{error,badarg}},
[{dcp_replicator,init,1,
[{file,“src/dcp_replicator.erl”},
{line,48}]},
{gen_server,init_it,6,
[{file,“gen_server.erl”},
{line,304}]},
{proc_lib,init_p_do_apply,3,
[{file,“proc_lib.erl”},
{line,239}]}]},
{child,undefined,
‘ns_1@bbb.compute-1.amazonaws.com’,
{dcp_replicator,start_link,
[‘ns_1@bbb.compute-1.amazonaws.com’,
“cdi-master-catalog”]},
temporary,60000,worker,
[dcp_replicator]}}}},
[{dcp_sup,start_replicator,2,
[{file,“src/dcp_sup.erl”},{line,78}]},
{dcp_sup,
’-set_desired_replications/2-lc$^2/1-2-’,
2,
[{file,“src/dcp_sup.erl”},{line,55}]},
{dcp_sup,set_desired_replications,2,
[{file,“src/dcp_sup.erl”},{line,55}]},
{replication_manager,handle_call,3,
[{file,“src/replication_manager.erl”},
{line,130}]},
{gen_server,handle_msg,5,
[{file,“gen_server.erl”},{line,585}]},
{proc_lib,init_p_do_apply,3,
[{file,“proc_lib.erl”},{line,239}]}]},
{gen_server,call,
[‘replication_manager-cdi-master-catalog’,
{change_vbucket_replication,1022,
‘ns_1@bbb.compute-1.amazonaws.com’},
infinity]}},
{gen_server,call,
[{‘janitor_agent-cdi-master-catalog’,
‘ns_1@aaa.compute-1.amazonaws.com’},
{if_rebalance,<0.20994.1279>,
{update_vbucket_state,1022,replica,
undefined,
‘ns_1@bbb.compute-1.amazonaws.com’}},
infinity]}}}}]}}} ns_vbucket_mover000 ns_1@ccc.compute-1.amazonaws.com
Bucket “cdi-master-catalog” rebalance appears to be swap rebalance ns_vbucket_mover000 ns_1@ccc.compute-1.amazonaws.com

error log contains the following regarding web requests, looks like a bad argument, but I can’t fathom how this could get screwed up.

[ns_server:error,2015-06-18T22:53:48.530,ns_1@ec2-52-6-104-153.compute-1.amazonaws.com:<0.29870.1318>:menelaus_web:loop:170]Server error during processing: [“web request failed”,
{path,"/pools/default/saslBucketsStreaming"},
{type,error},
{what,badarg},
{trace,
[{erlang,integer_to_list,[undefined],[]},
{ns_bucket,
’-json_map_with_full_config/3-fun-0-’,3,
[{file,“src/ns_bucket.erl”},{line,527}]},
{lists,map,2,
[{file,“lists.erl”},{line,1224}]},
{lists,map,2,
[{file,“lists.erl”},{line,1224}]},
{ns_bucket,json_map_with_full_config,3,
[{file,“src/ns_bucket.erl”},{line,519}]},
{menelaus_web_buckets,
’-handle_sasl_buckets_streaming/2-fun-1-’,
3,
[{file,“src/menelaus_web_buckets.erl”},
{line,343}]},
{lists,map,2,
[{file,“lists.erl”},{line,1224}]},
{menelaus_web_buckets,
’-handle_sasl_buckets_streaming/2-fun-2-’,
2,
[{file,“src/menelaus_web_buckets.erl”},
{line,329}]}]}]


#2

Hi @kirkbcb,
Some of the issues you are facing could look like known bugs fixed in version 3.0.3, see the release notes for more details:
http://docs.couchbase.com/admin/admin/rel-notes/rel-notes3.0.html

But the errors you describe could also happen if you are running low on resources in the cluster nodes. From what I can read here: http://aws.amazon.com/ec2/pricing/
Each node has 8 cores, 30GB RAM and 2 x 80 SSD.
Cores and RAM seem okay, but how mush free disk space do you have on each node?