Server down while removing nodes + rebalancing

Dynamicscope · January 23, 2015, 8:27am

1 out of 6 servers went down while removing 3 of the 6 and rebalancing them.
This happens quite frequently while rebalancing the nodes. (not just one specific server)
If I wait for the failed server up (~1hr) and try rebalancing the nodes, another server goes down with the same error.

I got the following error from the failed server:

Control connection to memcached on ‘ns_1@10.10.36.122’ disconnected: {{badmatch,
{error,
timeout}},
[{mc_client_binary,
cmd_vocal_recv,
5,
[{file,
“src/mc_client_binary.erl”},
{line,
151}]},
{mc_client_binary,
select_bucket,
2,
[{file,
“src/mc_client_binary.erl”},
{line,
346}]},
{ns_memcached,
ensure_bucket,
2,
[{file,
“src/ns_memcached.erl”},
{line,
1269}]},
{ns_memcached,
handle_info,
2,
[{file,
“src/ns_memcached.erl”},
{line,
744}]},
{gen_server,
handle_msg,
5,
[{file,
“gen_server.erl”},
{line,
604}]},
{ns_memcached,
init,
1,
[{file,
“src/ns_memcached.erl”},
{line,
171}]},
{gen_server,
init_it,
6,
[{file,
“gen_server.erl”},
{line,
304}]},
{proc_lib,
init_p_do_apply,
3,
[{file,
“proc_lib.erl”},
{line,
239}]}]}

and “Rebalancing” aborted after the follwoing message:

Port server memcached on node ‘babysitter_of_ns_1@127.0.0.1’ exited with status 134. Restarting. Messages: Fri Jan 23 16:48:26.112455 KST 3: (xxx) TAP (Consumer) eq_tapq:anon_310 - disconnected
Fri Jan 23 16:48:26.361575 KST 3: (xxx) TAP (Producer) eq_tapq:replication_ns_1@10.10.36.244 - Schedule the backfill for vbucket 206
Fri Jan 23 16:48:26.361700 KST 3: (xxx) TAP (Producer) eq_tapq:replication_ns_1@10.10.36.244 - Sending TAP_OPAQUE with command “complete_vb_filter_change” and vbucket 0
Fri Jan 23 16:48:26.361717 KST 3: (xxx) TAP (Producer) eq_tapq:replication_ns_1@10.10.36.244 - Sending TAP_OPAQUE with command “initial_vbucket_stream” and vbucket 206
asssertion failed [bySeqno >= 0] at /home/buildbot/buildbot_slave/ubuntu-1204-x64-301-builder/build/build/ep-engine/src/item.h:346

What could cause this issue?
This time, the failed server is not going back up again.

asingh · January 23, 2015, 10:38am

Which version of Couchbase are you running on? If you’re running on Couchbase 3.0.1, then there chances that you might have hit MB-12305.

Dynamicscope · January 23, 2015, 11:43am

I am performing online-upgrade from 2.2.0 Community Edition —> 3.0.1 Community Edition.
Could this be also related to MB-12305?

asingh · January 23, 2015, 11:59am

Yes, fix for this issue is in 3.0.2. Suggest taking out all 3.0.1 nodes from the cluster first and doing swap-rebalance upgrade to 3.0.2.

Dynamicscope · January 24, 2015, 11:44am

It looks like there is only 3.0.2 for Enterprise Edition.
Don’t I need to pay for EE? (such as paid license?)

cihangirb · January 24, 2015, 3:15pm

In production, you can use EE free up to 2 nodes for dev and test you don’t need to license it unless you want support.
thanks
-cihan

Dynamicscope · January 24, 2015, 3:19pm

I see. I see. So it’s free up to 2 nodes.
If I need… say 6 nodes, it’s not free, correct?

cihangirb · January 26, 2015, 5:22pm

Correct. You can use our community edition as well. There are a few items that are different between them - they are documented here for the 3.0 version: http://docs.couchbase.com/admin/admin/enterprise-edition.html
thanks
-cihan