CB Server 5.1.1 rebalance has made my server unusable

So we have had 2 servers in our cluster running for over 100 days, without any issues. Today I went to add a new server to the cluster.

I was able to add the server successfully, but then once that was done and I tried to rebalance, the rebalance got about 18% done, but then there were errors in the rebalance (which I unfortunately did not capture). Then one of the original servers seemed to go into a loop of warming up. Then the second server started to do the same thing.

In the end the servers both became completely unreachable running at 100% (I couldn’t even ssh in).

Now they seem to have come back to being available to the cluster, but they can’t seem to finish warming up.

I am getting a bunch of errors like this one for different buckets:

Compactor for view access/_design/main (pid [{type,view}, {name, <<“access/_design/main”>>}, {important,false}, {fa, {#Fun<compaction_new_daemon.25.86110551>, [<<“access”>>, <<"_design/main">>, {config, {30,undefined}, {30,undefined}, undefined,false,false, {daemon_config,30,131072, 20971520}}, false, {[{type,bucket}]}]}}]) terminated unexpectedly (ignoring this): {badmatch, {error, {{case_clause, {{error, vbucket_stream_not_found}, {bufsocket, #Port<11670.12123>, <<>>}}}, [{couch_dcp_client, init, 1, [{file, “/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_dcp/src/couch_dcp_client.erl”}, {line, 312}]}, {gen_server, init_it, 6, [{file, “gen_server.erl”}, {line, 304}]}, {proc_lib, init_p_do_apply, 3, [{file, “proc_lib.erl”}, {line, 239}]}]}}} hide

Plus others like this for “projector” and “indexer”:

Service ‘projector’ exited with status 134. Restarting. Messages: github.com/couchbase/indexing/secondary/projector.(*Projector).doMutationTopic(0xc4201260a0, 0xc44cc6dc20, 0xc44cc60016, 0x0, 0x0) /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/projector/projector.go:375 +0x32b fp=0xc42f0f6f18 sp=0xc42f0f6d98 github.com/couchbase/indexing/secondary/projector.(*Projector).handleRequest(0xc4201260a0, 0x11e0100, 0xc4265f5800, 0x16) /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/projector/adminport.go:94 +0x562 fp=0xc42f0f6f90 sp=0xc42f0f6f18 runtime.goexit() /home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc42f0f6f98 sp=0xc42f0f6f90 created by github.com/couchbase/indexing/secondary/projector.(*Projector).mainAdminPort /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/projector/adminport.go:70 +0x8dc [goport(/opt/couchbase/bin/projector)] 2019/01/22 14:24:32 child process exited with status 134

What does this all mean - what is going on?

All the nodes are on the same version (5.1.1 build 5723). I won’t seem to come out of the warmup state.

HELP!

Thanks,
Scott

Hi @scott, Is this Enterprise Edition or Community Edition ? Can you pls collect logs and share them with us ? Also, what OS are your servers and what was the size (RAM, No. of CPU) of the servers? Were there any operations ongoing at the time of rebalance?

Hello Mihir,

It is the community edition on Ubuntu 14.04 with 8GB Ram and 2 CPUs. There were operations going on at the time of the rebalance.

Here is a link to the logs from one of the servers (have not been able to get the others) - obviously it is big ~ 256 MB).

https://media.smallcubed.com.s3.amazonaws.com/data/collectinfo-2019-01-22T130850-ns_1%40172.31.60.0.zip

Thanks for any help.
Scott

Thanks @scott . It would be great if you can provide logs from all couchbase nodes, especially 172.31.68.27, and also the new node that was being added.

Hello Mihir,

Thanks for trying to help. However, in the end the cluster became unusable and I needed to recreate our cluster and restore from backups. So I cannot get those logs any more. :frowning:

Scott

This happened to one of the clusters in our environment. Basically rebalance was stuck. Apparently, CB rebalance is not very robust so it is really frustrating.