Server down while removing nodes + rebalancing

1 out of 6 servers went down while removing 3 of the 6 and rebalancing them.
This happens quite frequently while rebalancing the nodes. (not just one specific server)
If I wait for the failed server up (~1hr) and try rebalancing the nodes, another server goes down with the same error.

I got the following error from the failed server:

Control connection to memcached on ‘ns_1@10.10.36.122’ disconnected: {{badmatch,
{error,
timeout}},
[{mc_client_binary,
cmd_vocal_recv,
5,
[{file,
“src/mc_client_binary.erl”},
{line,
151}]},
{mc_client_binary,
select_bucket,
2,
[{file,
“src/mc_client_binary.erl”},
{line,
346}]},
{ns_memcached,
ensure_bucket,
2,
[{file,
“src/ns_memcached.erl”},
{line,
1269}]},
{ns_memcached,
handle_info,
2,
[{file,
“src/ns_memcached.erl”},
{line,
744}]},
{gen_server,
handle_msg,
5,
[{file,
“gen_server.erl”},
{line,
604}]},
{ns_memcached,
init,
1,
[{file,
“src/ns_memcached.erl”},
{line,
171}]},
{gen_server,
init_it,
6,
[{file,
“gen_server.erl”},
{line,
304}]},
{proc_lib,
init_p_do_apply,
3,
[{file,
“proc_lib.erl”},
{line,
239}]}]}

and “Rebalancing” aborted after the follwoing message:

Port server memcached on node ‘babysitter_of_ns_1@127.0.0.1’ exited with status 134. Restarting. Messages: Fri Jan 23 16:48:26.112455 KST 3: (xxx) TAP (Consumer) eq_tapq:anon_310 - disconnected
Fri Jan 23 16:48:26.361575 KST 3: (xxx) TAP (Producer) eq_tapq:replication_ns_1@10.10.36.244 - Schedule the backfill for vbucket 206
Fri Jan 23 16:48:26.361700 KST 3: (xxx) TAP (Producer) eq_tapq:replication_ns_1@10.10.36.244 - Sending TAP_OPAQUE with command “complete_vb_filter_change” and vbucket 0
Fri Jan 23 16:48:26.361717 KST 3: (xxx) TAP (Producer) eq_tapq:replication_ns_1@10.10.36.244 - Sending TAP_OPAQUE with command “initial_vbucket_stream” and vbucket 206
asssertion failed [bySeqno >= 0] at /home/buildbot/buildbot_slave/ubuntu-1204-x64-301-builder/build/build/ep-engine/src/item.h:346

What could cause this issue?
This time, the failed server is not going back up again.

Which version of Couchbase are you running on? If you’re running on Couchbase 3.0.1, then there chances that you might have hit MB-12305.

I am performing online-upgrade from 2.2.0 Community Edition —> 3.0.1 Community Edition.
Could this be also related to MB-12305?

Yes, fix for this issue is in 3.0.2. Suggest taking out all 3.0.1 nodes from the cluster first and doing swap-rebalance upgrade to 3.0.2.

It looks like there is only 3.0.2 for Enterprise Edition.
Don’t I need to pay for EE? (such as paid license?)

In production, you can use EE free up to 2 nodes for dev and test you don’t need to license it unless you want support.
thanks
-cihan

I see. I see. So it’s free up to 2 nodes.
If I need… say 6 nodes, it’s not free, correct?

Correct. You can use our community edition as well. There are a few items that are different between them - they are documented here for the 3.0 version: http://docs.couchbase.com/admin/admin/enterprise-edition.html
thanks
-cihan

1 Like