Rebalance Failure v4.0.0

Hello,

We are using couchbase 4.0.0 community edition (with plans to upgrade, but it is what it is right now)

We have had a number of issues with rebalancing operations recently and are looking for some guidance. After running for a few hours of reduced performance while the rebalancing operation took place (in this case we were adding a new node to the cluster). Two of our four nodes eventually reached 100% rebalance, however two were stuck at 90%. It eventually failed with the error in the log showing:

<0.6918.333> exited with {unexpected_exit,
{‘EXIT’,<0.25943.333>,
{{error,{badrpc,nodedown}},
{gen_server,call,
[{‘janitor_agent-app-data’,‘ns_1@10.0.0.116’},
{if_rebalance,<0.10043.328>,
{update_vbucket_state,821,active,undefined,
undefined}},
infinity]}}}}

Followed by messaging indicating all of the buckets on the 10.0.0.116 node were shutting down.
These buckets did however come back up immediately afterwards, however at this point the replication processed had failed.

Taking a look at the error.log on the 10.0.0.116 node itself i found the following:

[ns_server:error,2017-03-07T22:14:23.473Z,ns_1@10.0.0.116:ns_doctor<0.2239.0>:ns_doctor:update_status:287]The following buckets became not ready on node ‘ns_1@10.0.0.46’: [“auth”], those of them are active []
[ns_server:error,2017-03-07T22:14:26.173Z,ns_1@10.0.0.116:ns_doctor<0.2239.0>:ns_doctor:update_status:287]The following buckets became not ready on node ‘ns_1@10.0.0.46’: [“app-data”], those of them are active []
[ns_server:error,2017-03-07T22:15:13.011Z,ns_1@10.0.0.116:ns_doctor<0.2239.0>:ns_doctor:update_status:287]The following buckets became not ready on node ‘ns_1@10.0.0.46’: [“app-data”], those of them are active []
[ns_server:error,2017-03-07T22:15:15.717Z,ns_1@10.0.0.116:ns_doctor<0.2239.0>:ns_doctor:update_status:287]The following buckets became not ready on node ‘ns_1@10.0.0.46’: [“auth”], those of them are active []
[ns_server:error,2017-03-07T22:18:44.455Z,ns_1@10.0.0.116:ns_doctor<0.2239.0>:ns_doctor:update_status:287]The following buckets became not ready on node ‘ns_1@10.0.0.46’: [“app-data”], those of them are active []
[ns_server:error,2017-03-07T22:18:47.138Z,ns_1@10.0.0.116:ns_doctor<0.2239.0>:ns_doctor:update_status:287]The following buckets became not ready on node ‘ns_1@10.0.0.46’: [“auth”], those of them are active []
[ns_server:error,2017-03-08T02:11:32.338Z,ns_1@10.0.0.116:ns_doctor<0.2239.0>:ns_doctor:update_status:287]The following buckets became not ready on node ‘ns_1@10.0.0.116’: [“app-data”,
“auth”], those of them are active []
[ns_server:error,2017-03-08T02:11:33.171Z,ns_1@10.0.0.116:wait_link_to_couchdb_node<0.6923.291>:ns_server_nodes_sup:do_wait_link_to_couchdb_node:156]ns_couchdb_port({undefined,‘ns_1@10.0.0.116’}) died with reason noproc
[ns_server:error,2017-03-08T02:11:35.801Z,ns_1@10.0.0.116:ns_log<0.7718.291>:ns_log:handle_cast:209]unable to notify listeners because of badarg
^C
root@ip-10-0-0-116:/opt/couchbase/var/lib/couchbase/logs# tail -n200 error.log
[ns_server:error,2017-03-02T04:42:47.519Z,ns_1@127.0.0.1:ns_log<0.1886.0>:ns_log:handle_cast:209]unable to notify listeners because of badarg
[ns_doctor:error,2017-03-02T04:42:49.310Z,ns_1@10.0.0.116:ns_log<0.2187.0>:ns_doctor:get_node:204]Error attempting to get node ‘ns_1@10.0.0.52’: {exit,
{noproc,
{gen_server,call,
[ns_doctor,
{get_node,
‘ns_1@10.0.0.52’}]}}}
[stats:error,2017-03-02T04:42:55.784Z,ns_1@10.0.0.116:query_stats_collector<0.2338.0>:base_stats_collector:handle_info:109](Collector: query_stats_collector) Exception in stats collector: {error,
{badmatch,
{error,
{econnrefused,
[{lhttpc_client,
send_request,
1,
[{file,
"/home/couchbase/jenkins/workspace/sherlock-unix/couchdb/src/lhttpc/lhttpc_client.erl"},
{line,
220}]},
{lhttpc_client,
execute,
9,
[{file,
"/home/couchbase/jenkins/workspace/sherlock-unix/couchdb/src/lhttpc/lhttpc_client.erl"},
{line,
169}]},
{lhttpc_client,
request,
9,
[{file,
"/home/couchbase/jenkins/workspace/sherlock-unix/couchdb/src/lhttpc/lhttpc_client.erl"},
{line,
92}]}]}}},
[{query_rest,
send,3,
[{file,
“src/query_rest.erl”},
{line,
64}]},
{query_rest,
do_get_stats,
0,
[{file,
“src/query_rest.erl”},
{line,
43}]},
{base_stats_collector,
handle_info,
2,
[{file,
“src/base_stats_collector.erl”},
{line,
89}]},
{gen_server,
handle_msg,
5,
[{file,
“gen_server.erl”},
{line,
604}]},
{proc_lib,
init_p_do_apply,
3,
[{file,
“proc_lib.erl”},
{line,
239}]}]}

[ns_server:error,2017-03-02T04:42:55.786Z,ns_1@10.0.0.116:index_stats_collector<0.2431.0>:index_rest:get_json:45]Request to http://127.0.0.1:9102/stats?async=true failed: {error,
{econnrefused,
[{lhttpc_client,
send_request,1,
[{file,
"/home/couchbase/jenkins/workspace/sherlock-unix/couchdb/src/lhttpc/lhttpc_client.erl"},
{line,220}]},
{lhttpc_client,
execute,9,
[{file,
"/home/couchbase/jenkins/workspace/sherlock-unix/couchdb/src/lhttpc/lhttpc_client.erl"},
{line,169}]},
{lhttpc_client,
request,9,
[{file,
"/home/couchbase/jenkins/workspace/sherlock-unix/couchdb/src/lhttpc/lhttpc_client.erl"},
{line,92}]}]}}
[ns_server:error,2017-03-02T06:47:23.003Z,ns_1@10.0.0.116:compaction_new_daemon<0.2356.0>:compaction_new_daemon:log_compactors_exit:1266]Compactor <0.30038.3> exited unexpectedly: {db_compactor_died_too_soon,
<<“app-data/master”>>}. Moving to the next bucket.
[ns_server:error,2017-03-07T22:14:23.473Z,ns_1@10.0.0.116:ns_doctor<0.2239.0>:ns_doctor:update_status:287]The following buckets became not ready on node ‘ns_1@10.0.0.46’: [“auth”], those of them are active []
[ns_server:error,2017-03-07T22:14:26.173Z,ns_1@10.0.0.116:ns_doctor<0.2239.0>:ns_doctor:update_status:287]The following buckets became not ready on node ‘ns_1@10.0.0.46’: [“app-data”], those of them are active []
[ns_server:error,2017-03-07T22:15:13.011Z,ns_1@10.0.0.116:ns_doctor<0.2239.0>:ns_doctor:update_status:287]The following buckets became not ready on node ‘ns_1@10.0.0.46’: [“app-data”], those of them are active []
[ns_server:error,2017-03-07T22:15:15.717Z,ns_1@10.0.0.116:ns_doctor<0.2239.0>:ns_doctor:update_status:287]The following buckets became not ready on node ‘ns_1@10.0.0.46’: [“auth”], those of them are active []
[ns_server:error,2017-03-07T22:18:44.455Z,ns_1@10.0.0.116:ns_doctor<0.2239.0>:ns_doctor:update_status:287]The following buckets became not ready on node ‘ns_1@10.0.0.46’: [“app-data”], those of them are active []
[ns_server:error,2017-03-07T22:18:47.138Z,ns_1@10.0.0.116:ns_doctor<0.2239.0>:ns_doctor:update_status:287]The following buckets became not ready on node ‘ns_1@10.0.0.46’: [“auth”], those of them are active []
[ns_server:error,2017-03-08T02:11:32.338Z,ns_1@10.0.0.116:ns_doctor<0.2239.0>:ns_doctor:update_status:287]The following buckets became not ready on node ‘ns_1@10.0.0.116’: [“app-data”,
“auth”], those of them are active []
[ns_server:error,2017-03-08T02:11:33.171Z,ns_1@10.0.0.116:wait_link_to_couchdb_node<0.6923.291>:ns_server_nodes_sup:do_wait_link_to_couchdb_node:156]ns_couchdb_port({undefined,‘ns_1@10.0.0.116’}) died with reason noproc
[ns_server:error,2017-03-08T02:11:35.801Z,ns_1@10.0.0.116:ns_log<0.7718.291>:ns_log:handle_cast:209]unable to notify listeners because of badarg

Any help or ideas are appreciated. The typical procedure here would be to try the rebalance again, however during the rebalance we saw significantly reduced performance from clients against the cluster and are hoping for answers before we try again.

@sfright,

Need some context of your:

  • use case
  • cluster size
  • % of active data in memory
  • Windows or Linux
    if Linux:
    1. THP off
  1.               ulimits of couchbase process
    
  2.               swappiness=0
    
  • any views