Re-balance fails when adding a 5.0.1 node to a 4.6.4-4590 cluster

I am trying to upgrade an existing 2-node 4.6.4-4590 Enterprise Edition (build-4590) cluster to 5.0.1 using the rolling upgrade process. I failed over to a node 1, uninstall on node 2, install 5.0.1 on node 2 and join the cluster on node 1. Everything looks good until I try to re-balance - the process starts but then stops with the “Rebalance failed. See logs for detailed reason. You can try again.” message. The logs from the UI are below. I captured the other log files as well - let me know what additional data would be useful. I rebuilt the node as 4.6.4 and the re-balance worked fine.

Rebalance exited with reason {unexpected_exit,
{‘EXIT’,<0.7433.1>,
{{{child_interrupted,
{‘EXIT’,<0.6180.1>,socket_closed}},
[{dcp_replicator,spawn_and_wait,1,
[{file,“src/dcp_replicator.erl”},
{line,231}]},
{dcp_replicator,handle_call,3,
[{file,“src/dcp_replicator.erl”},
{line,109}]},
{gen_server,handle_msg,5,
[{file,
“c:/cygwin64/home/vagrant/OTP_SR~1/lib/stdlib/src/gen_server.erl”},
{line,585}]},
{proc_lib,init_p_do_apply,3,
[{file,
"c:/cygw …show
ns_orchestrator 000
ns_1@Prd-CCNode-02.sc.com
10:56:08 PM Sun Feb 11, 2018
<0.7058.1> exited with {unexpected_exit,
{‘EXIT’,<0.7433.1>,
{{{child_interrupted,
{‘EXIT’,<0.6180.1>,socket_closed}},
[{dcp_replicator,spawn_and_wait,1,
[{file,“src/dcp_replicator.erl”},{line,231}]},
{dcp_replicator,handle_call,3,
[{file,“src/dcp_replicator.erl”},{line,109}]},
{gen_server,handle_msg,5,
[{file,
“c:/cygwin64/home/vagrant/OTP_SR~1/lib/stdlib/src/gen_server.erl”},
{line,585}]},
{proc_lib,init_p_do_apply,3,
[{file,
“c:/cygwin64/home/vagrant/OTP_SR~1/lib/stdlib/src/proc_lib.erl”},
{line,239}]}]},
{gen_server,call,
[{‘janitor_agent-sync_attachment’,
‘ns_1@Prd-CCNode-02.sc.com’},
{if_rebalance,<0.32138.0>,
{wait_index_updated,874}},
infinity]}}}} hide
ns_vbucket_mover 000
ns_1@Prd-CCNode-02.sc.com
10:56:03 PM Sun Feb 11, 2018
Haven’t heard from a higher priority node or a master, so I’m taking over.
mb_master 000
ns_1@Prd-CCNode-01.sc.com
10:56:02 PM Sun Feb 11, 2018
Bucket “sync_attachment” rebalance appears to be swap rebalance
ns_vbucket_mover 000
ns_1@Prd-CCNode-02.sc.com
10:54:43 PM Sun Feb 11, 2018
Started rebalancing bucket sync_attachment
ns_rebalancer 000
ns_1@Prd-CCNode-02.sc.com
10:54:43 PM Sun Feb 11, 2018
Starting rebalance, KeepNodes = [‘ns_1@Prd-CCNode-01.sc.com’,
‘ns_1@Prd-CCNode-02.sc.com’], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes
ns_orchestrator 004
ns_1@Prd-CCNode-02.sc.com
10:54:43 PM Sun Feb 11, 2018
Rebalance exited with reason {unexpected_exit,
{‘EXIT’,<0.20069.0>,
{{{child_interrupted,
{‘EXIT’,<0.19595.0>,socket_closed}},
[{dcp_replicator,spawn_and_wait,1,
[{file,“src/dcp_replicator.erl”},
{line,231}]},
{dcp_replicator,handle_call,3,
[{file,“src/dcp_replicator.erl”},
{line,109}]},
{gen_server,handle_msg,5,
[{file,
“c:/cygwin64/home/vagrant/OTP_SR~1/lib/stdlib/src/gen_server.erl”},
{line,585}]},
{proc_lib,init_p_do_apply,3,
[{file,
“c:/cygwin64/home/vagrant/OTP_SR~1/lib/stdlib/src/proc_lib.erl”},
{line,239}]}]},
{gen_server,call,
[{‘janitor_agent-sync_attachment’,
‘ns_1@Prd-CCNode-02.sc.com’},
{if_rebalance,<0.12388.0>,
{wait_index_updated,958}},
infinity]}}}} hide
ns_orchestrator 000
ns_1@Prd-CCNode-02.sc.com
10:50:28 PM Sun Feb 11, 2018
<0.19609.0> exited with {unexpected_exit,
{‘EXIT’,<0.20069.0>,
{{{child_interrupted,
{‘EXIT’,<0.19595.0>,socket_closed}},
[{dcp_replicator,spawn_and_wait,1,
[{file,“src/dcp_replicator.erl”},{line,231}]},
{dcp_replicator,handle_call,3,
[{file,“src/dcp_replicator.erl”},{line,109}]},
{gen_server,handle_msg,5,
[{file,
“c:/cygwin64/home/vagrant/OTP_SR~1/lib/stdlib/src/gen_server.erl”},
{line,585}]},
{proc_lib,init_p_do_apply,3,
[{file,
“c:/cygwin64/home/vagrant/OTP_SR~1/lib/stdlib/src/proc_lib.erl”},
{line,239}]}]},
{gen_server,call,
[{‘janitor_agent-sync_attachment’,
‘ns_1@Prd-CCNode-02.sc.com’},
{if_rebalance,<0.12388.0>,
{wait_index_updated,958}},
infinity]}}}} hide
ns_vbucket_mover 000
ns_1@Prd-CCNode-02.sc.com
10:50:27 PM Sun Feb 11, 2018
Bucket “sync_attachment” rebalance appears to be swap rebalance
ns_vbucket_mover 000
ns_1@Prd-CCNode-02.sc.com
10:49:06 PM Sun Feb 11, 2018
Started rebalancing bucket sync_attachment
ns_rebalancer 000
ns_1@Prd-CCNode-02.sc.com
10:49:06 PM Sun Feb 11, 2018
Starting rebalance, KeepNodes = [‘ns_1@Prd-CCNode-01.sc.com’,
‘ns_1@Prd-CCNode-02.sc.com’], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes
ns_orchestrator 004
ns_1@Prd-CCNode-02.sc.com
10:49:05 PM Sun Feb 11, 2018
Haven’t heard from a higher priority node or a master, so I’m taking over. (repeated 1 times)
mb_master 000
ns_1@Prd-CCNode-01.sc.com
10:48:58 PM Sun Feb 11, 2018
IP address seems to have changed. Unable to listen on ‘ns_1@Prd-CCNode-02.sc.com’. (POSIX error code: ‘nxdomain’)
menelaus_web_alerts_srv 000
ns_1@Prd-CCNode-02.sc.com
10:48:12 PM Sun Feb 11, 2018
Client-side error-report for user undefined on node ‘ns_1@Prd-CCNode-02.sc.com’:
User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36
Got unhandled javascript error:
message: The transition errored;

menelaus_web 102
ns_1@Prd-CCNode-02.sc.com
10:48:12 PM Sun Feb 11, 2018
Rebalance exited with reason {unexpected_exit,
{‘EXIT’,<0.9777.0>,
{{{child_interrupted,
{‘EXIT’,<0.9154.0>,socket_closed}},
[{dcp_replicator,spawn_and_wait,1,
[{file,“src/dcp_replicator.erl”},
{line,231}]},
{dcp_replicator,handle_call,3,
[{file,“src/dcp_replicator.erl”},
{line,109}]},
{gen_server,handle_msg,5,
[{file,
“c:/cygwin64/home/vagrant/OTP_SR~1/lib/stdlib/src/gen_server.erl”},
{line,585}]},
{proc_lib,init_p_do_apply,3,
[{file,
"c:/cygw …show
ns_orchestrator 000
ns_1@Prd-CCNode-02.sc.com
10:48:12 PM Sun Feb 11, 2018
<0.9628.0> exited with {unexpected_exit,
{‘EXIT’,<0.9777.0>,
{{{child_interrupted,
{‘EXIT’,<0.9154.0>,socket_closed}},
[{dcp_replicator,spawn_and_wait,1,
[{file,“src/dcp_replicator.erl”},{line,231}]},
{dcp_replicator,handle_call,3,
[{file,“src/dcp_replicator.erl”},{line,109}]},
{gen_server,handle_msg,5,
[{file,
“c:/cygwin64/home/vagrant/OTP_SR~1/lib/stdlib/src/gen_server.erl”},
{line,585}]},
{proc_lib,init_p_do_apply,3,
[{file,
“c:/cygwin64/home/vagrant/OTP_SR~1/lib/stdlib/src/proc_lib.erl”},
{line,239}]}]},
{gen_server,call,
…show
ns_vbucket_mover 000
ns_1@Prd-CCNode-02.sc.com
10:48:12 PM Sun Feb 11, 2018
Haven’t heard from a higher priority node or a master, so I’m taking over.
mb_master 000
ns_1@Prd-CCNode-01.sc.com
10:47:44 PM Sun Feb 11, 2018
Bucket “sync_attachment” rebalance appears to be swap rebalance
ns_vbucket_mover 000
ns_1@Prd-CCNode-02.sc.com
10:47:30 PM Sun Feb 11, 2018
Started rebalancing bucket sync_attachment
ns_rebalancer 000
ns_1@Prd-CCNode-02.sc.com
10:47:30 PM Sun Feb 11, 2018

What do you mean failover?
1.You should rebalance out node 2 out.
2.Install 5.0.1
3. add node 2 back
4. click rebalance.

Looks like your running the Couchbase servers on your windows laptop as vagrant VMs.

The Couchbase replication stream ,DCP , sockets are being closed for some reason.
If you’re looking for an outside resource to use you may want to look at using digital ocean instances instead of your laptop.

I think there’s a bit of confusion in this post, sorry about that!

I’ve tried to separate out some of the key logging (unfortunately the forums tend to mangle it a bit…)

  • Rebalance starts, with node 02 as the orchestrator. (I’m assuming 02 was the one that was upgraded and running 5.0.1?)
10:54:43 PM Sun Feb 11, 2018
ns_1@Prd-CCNode-02.sc.com

Starting rebalance, KeepNodes = [‘ns_1@Prd-CCNode-01.sc.com’,
‘ns_1@Prd-CCNode-02.sc.com’], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes


10:54:43 PM Sun Feb 11, 2018
ns_1@Prd-CCNode-02.sc.com

Started rebalancing bucket sync_attachment


10:54:43 PM Sun Feb 11, 2018
ns_1@Prd-CCNode-02.sc.com

Bucket “sync_attachment” rebalance appears to be swap rebalance

  • Node 01 reports that it doesn’t have a connection to a “higher priority” node, and starts to take over as the orchestrator (at a very high level, the priority is based first on version, then on node name).
10:56:02 PM Sun Feb 11, 2018
ns_1@Prd-CCNode-01.sc.com

Haven’t heard from a higher priority node or a master, so I’m taking over.
  • Node 02 sees that a socket was closed as it performs part of the rebalance.
10:56:03 PM Sun Feb 11, 2018
ns_1@Prd-CCNode-02.sc.com

Rebalance exited with reason {unexpected_exit,
{‘EXIT’,<0.7433.1>,
{{{child_interrupted,
{‘EXIT’,<0.6180.1>,socket_closed}},
[{dcp_replicator,spawn_and_wait,1,
[{file,“src/dcp_replicator.erl”},
...

Now, admittedly this is all “What went on” and not “Why it happened”, but at first glance there might be some kind of connection issue between the two nodes.

It does seem weird that it only happens after an upgrade - was there anything different about how you installed 5.0.1 as opposed to when you went back and installed 4.6.4 again?

@househippo raises a good point about Failover - generally, it’s better to rebalance a node out entirely. That said, it shouldn’t fail either way.

On that note: @househippo just to clarify something: c:/cygwin64/home/vagrant/OTP_SR~1/lib/stdlib/src/gen_server.erl is a product of the build, you’ll probably see similar traces in any Windows install. Probably don’t need to be making a jump to Digital Ocean today…

Both servers are running Windows 2012 R2.

At the start both node-01 and node-02 are running 4.6.4-4590 in a single cluster.
I did the following.

  • Click fail over on node 2 so that at the end I have a single cluster running on node 1
  • On Node 2 - uninstall CB 4.6.4 and install CB 5.0.1
  • Join the cluster on node 1 (still running 4.6.4)
  • Click the rebalance button and it eventually fails - sometimes it makes it further into the process but it never seems to complete

To get back to a stable cluster I did the same process on node 2 but with the 4.6.4 install and it worked fine. Note that I was originally running version 4.5.0-2601 when I started. When I started having issues I upgraded both nodes to 4.6.4 (using this process) to see if that would help with the upgrade to 5.0.1. The upgrade to 4.6.4 went fine - but did not help with the issue upgrading to 5.0.1. I did do this process in a test environment and it worked fine going to 5.0.1 - but the test environment has less data.

This environment has 2 buckets - one bucket has 126K documents and the other has 2.9 Million documents.

I noticed that CPU usage was very high during the re-balance process. I added 2 more cores to each server (for a total of 6) and the re-balance process completed successfully. It looks like 5.x requires more cores that 4.x for the same data set.