Couchbase rebalance errors

We are testing Couchbase and pretty frequently we‘re getting errors during rebalance.

We are having three nodes in cluster, they are hosted in Windows Azure. They are running on machines with 2 cores and 3.5 GB memory under Ubuntu 12.04. They are configured to have separate disks for data and indexes. They all are joined into vpn and are located close to each other in the datacenter so the network latency is low.
We are using Couchbase server version 2.5.1
Cluster has three buckets, “default” and two custom buckets. Replica function is set on all buckets. Auto-failover is turned on as well.
The most populated bucket has 5700 items and consumes 120MB of disk and 139MB of RAM.

Any advice on how to improve fault tolerance would be appreciated.

Most frequent errors are following:
1.--------------------------------------------------------------------------------------------------------------------------------------------
[ns_server:error,2014-09-23T13:47:24.833,ns_1@cb-node-1.cloudapp.net:<0.12616.19>:misc:inner_wait_shutdown:1555]Expected exit signal from <13070.15971.1> but could not get it in 5 seconds. This is a bug, but process we’re waiting for is dead (noproc), so trying to ignore…
[ns_server:error,2014-09-23T13:47:24.834,ns_1@cb-node-1.cloudapp.net:<0.12616.19>:misc:sync_shutdown_many_i_am_trapping_exits:1537]Shutdown of the following failed: [{<13070.15971.1>,noproc}]
[ns_server:error,2014-09-23T13:47:24.854,ns_1@cb-node-1.cloudapp.net:<0.12604.19>:misc:sync_shutdown_many_i_am_trapping_exits:1537]Shutdown of the following failed: [{<0.12616.19>,
{{badmatch,[{<13070.15971.1>,noproc}]},
[{misc,
sync_shutdown_many_i_am_trapping_exits,
1},
{misc,try_with_maybe_ignorant_after,2},
{gen_server,terminate,6},
{proc_lib,init_p_do_apply,3}]}},
{<0.12617.19>,
{badmatch,
[{‘EXIT’,
{{badmatch,{error,closed}},
{gen_server,call,
[<13070.15971.1>,had_backfill,
infinity]}}}]}}]
[error_logger:error,2014-09-23T13:47:24.854,ns_1@cb-node-1.cloudapp.net:error_logger<0.6.0>:ale_error_logger_handler:log_msg:119]** Generic server <0.12616.19> terminating
** Last message in was {‘EXIT’,<13070.15971.1>,{badmatch,{error,closed}}}
** When Server state == {state,“rex_production”,90,‘ns_1@cb-node-1.cloudapp.net’,
[{‘ns_1@cb-node-2.cloudapp.net’,<13058.24044.5>},
{‘ns_1@cb-node-3.cloudapp.net’,<13070.15971.1>}]}
** Reason for termination ==
** {{badmatch,[{<13070.15971.1>,noproc}]},
[{misc,sync_shutdown_many_i_am_trapping_exits,1},
{misc,try_with_maybe_ignorant_after,2},
{gen_server,terminate,6},
{proc_lib,init_p_do_apply,3}]}

[ns_server:error,2014-09-23T13:47:24.854,ns_1@cb-node-1.cloudapp.net:<0.12604.19>:misc:try_with_maybe_ignorant_after:1573]Eating exception from ignorant after-block:
{error,
{badmatch,
[{<0.12616.19>,
{{badmatch,[{<13070.15971.1>,noproc}]},
[{misc,sync_shutdown_many_i_am_trapping_exits,1},
{misc,try_with_maybe_ignorant_after,2},
{gen_server,terminate,6},
{proc_lib,init_p_do_apply,3}]}},
{<0.12617.19>,
{badmatch,
[{‘EXIT’,
{{badmatch,{error,closed}},
{gen_server,call,
[<13070.15971.1>,had_backfill,infinity]}}}]}}]},
[{misc,sync_shutdown_many_i_am_trapping_exits,1},
{misc,try_with_maybe_ignorant_after,2},
{ns_single_vbucket_mover,mover,6},
{proc_lib,init_p_do_apply,3}]}
[error_logger:error,2014-09-23T13:47:24.855,ns_1@cb-node-1.cloudapp.net:error_logger<0.6.0>:ale_error_logger_handler:log_report:115]
=========================CRASH REPORT=========================
crasher:
initial call: new_ns_replicas_builder:init/1
pid: <0.12616.19>
registered_name: []
exception exit: {{badmatch,[{<13070.15971.1>,noproc}]},
[{misc,sync_shutdown_many_i_am_trapping_exits,1},
{misc,try_with_maybe_ignorant_after,2},
{gen_server,terminate,6},
{proc_lib,init_p_do_apply,3}]}
in function gen_server:terminate/6
ancestors: [<0.12604.19>,<0.1452.19>,<0.1354.19>]
messages: [{‘EXIT’,<0.12604.19>,shutdown}]
links: [<0.12604.19>]
dictionary: []
trap_exit: true
status: running
heap_size: 317811
stack_size: 24
reductions: 38764
neighbours:

[error_logger:error,2014-09-23T13:47:24.855,ns_1@cb-node-1.cloudapp.net:error_logger<0.6.0>:ale_error_logger_handler:log_report:115]
=========================CRASH REPORT=========================
crasher:
initial call: ns_single_vbucket_mover:mover/6
pid: <0.12604.19>
registered_name: []
exception exit: {unexpected_exit,
{‘EXIT’,<0.12617.19>,
{badmatch,
[{‘EXIT’,
{{badmatch,{error,closed}},
{gen_server,call,
[<13070.15971.1>,had_backfill,infinity]}}}]}}}
in function ns_single_vbucket_mover:spawn_and_wait/1
in call from ns_single_vbucket_mover:mover_inner/6
in call from misc:try_with_maybe_ignorant_after/2
in call from ns_single_vbucket_mover:mover/6
ancestors: [<0.1452.19>,<0.1354.19>]
messages: []
links: [<0.1452.19>]
dictionary: [{cleanup_list,[<0.12616.19>,<0.12617.19>]}]
trap_exit: true
status: running
heap_size: 4181
stack_size: 24
reductions: 7334


[ns_server:error,2014-09-23T12:10:47.321,ns_1@cb-node-2.cloudapp.net:<0.14731.149>:misc:inner_wait_shutdown:1555]Expected exit signal from <0.14732.149> but could not get it in 5 seconds. This is a bug, but process we’re waiting for is dead (noproc), so trying to ignore…
[ns_server:error,2014-09-23T12:10:47.320,ns_1@cb-node-2.cloudapp.net:<0.5797.149>:ns_single_vbucket_mover:spawn_and_wait:87]Got unexpected exit signal {‘EXIT’,<0.5787.149>,
{unexpected_exit,
{‘EXIT’,<0.14692.149>,
{badmatch,
[{‘EXIT’,
{{badmatch,{error,closed}},
{gen_server,call,
[<13061.15101.183>,had_backfill,
infinity]}}}]}}}}
[error_logger:error,2014-09-23T12:10:47.321,ns_1@cb-node-2.cloudapp.net:error_logger<0.6.0>:ale_error_logger_handler:log_report:115]
=========================CRASH REPORT=========================
crasher:
initial call: ns_single_vbucket_mover:mover/6
pid: <0.14670.149>
registered_name: []
exception exit: {unexpected_exit,
{‘EXIT’,<0.14692.149>,
{badmatch,
[{‘EXIT’,
{{badmatch,{error,closed}},
{gen_server,call,
[<13061.15101.183>,had_backfill,infinity]}}}]}}}
in function ns_single_vbucket_mover:spawn_and_wait/1
in call from ns_single_vbucket_mover:mover_inner/6
in call from misc:try_with_maybe_ignorant_after/2
in call from ns_single_vbucket_mover:mover/6
ancestors: [<0.5787.149>,<0.5750.149>]
messages: []
links: [<0.5787.149>]
dictionary: [{cleanup_list,[<0.14687.149>,<0.14692.149>]}]
trap_exit: true
status: running
heap_size: 2584
stack_size: 24
reductions: 7304
neighbours:

[error_logger:error,2014-09-23T12:10:47.626,ns_1@cb-node-2.cloudapp.net:error_logger<0.6.0>:ale_error_logger_handler:log_report:115]
=========================CRASH REPORT=========================
crasher:
initial call: ns_single_vbucket_mover:mover/6
pid: <0.14330.149>
registered_name: []
exception exit: {unexpected_exit,
{‘EXIT’,<0.5787.149>,
{unexpected_exit,
{‘EXIT’,<0.14692.149>,
{badmatch,
[{‘EXIT’,
{{badmatch,{error,closed}},
{gen_server,call,
[<13061.15101.183>,had_backfill,
infinity]}}}]}}}}}
in function ns_single_vbucket_mover:spawn_and_wait/1
in call from ns_single_vbucket_mover:mover_inner/6
in call from misc:try_with_maybe_ignorant_after/2
in call from ns_single_vbucket_mover:mover/6
ancestors: [<0.5787.149>,<0.5750.149>]
messages: [{‘EXIT’,<0.5787.149>,
{unexpected_exit,
{‘EXIT’,<0.14692.149>,
{badmatch,
[{‘EXIT’,
{{badmatch,{error,closed}},
{gen_server,call,
[<13061.15101.183>,had_backfill,infinity]}}}]}}}}]
links: [<0.5787.149>]
dictionary: [{cleanup_list,[<0.14337.149>,<0.14365.149>]}]
trap_exit: true
status: running
heap_size: 2584
stack_size: 24
reductions: 4404

Note that frequently, rebalance can be tried again even if there’s a failure. Does rebalance eventually succeed? It’s safe to retry rebalance in all cases.

You may want to search the issue tracker to see if some of the scenarios you outlined are fixed as of Couchbase Server 3.0. There were lots of improvements over 2.5.1, so it’s quite possible that it has been addressed.