Server keep going down when rebalancing the cluster!

Dynamicscope · June 11, 2015, 5:10am

Now this is making me really annoying…

The server is keep going down when rebalancing the cluster.
I had to reboot my AWS instance and the cluster is re-indexing the bucket.
And I am doing this for about 5 times straight.
Please, anyone help me.

Hardware spec:
EC2 m3.xlarge
vCPU: 4
Ram: 15043 MB
Disk: 300GB

Cluster configuration:
Couchbase Server CE 3.0.1
3 nodes
1 bucket
Per Node RAM Quota: 10000 MB
Cluster RAM Quota: Per Server RAM Quota: 10000 MB
9 Design view

Numbers:
Item Count: 21,000,000
Indexing time: 5 hrs

Logs:

Node ‘ns_1@172.31.4.173’ saw that node ‘ns_1@172.31.9.105’ went down. Details: [{nodedown_reason,
net_tick_timeout}] ns_node_disco005 ns_1@172.31.4.173 04:07:04 - Thu Jun 11, 2015

Rebalance exited with reason {{nodedown,‘ns_1@172.31.9.105’},
{gen_server,call,
[{‘janitor_agent-userhabit’,
‘ns_1@172.31.9.105’},
{if_rebalance,<0.3289.14>,
{uninhibit_view_compaction, #Ref17012.0.122.9318>}},
infinity]}}
ns_orchestrator002 ns_1@172.31.13.128 04:07:04 - Thu Jun 11, 2015

<0.7072.15> exited with {{nodedown,‘ns_1@172.31.9.105’},
{gen_server,call,
[{‘janitor_agent-userhabit’,‘ns_1@172.31.9.105’},
{if_rebalance,<0.3289.14>,
{uninhibit_view_compaction, #Ref<17012.0.122.9318>}},
infinity]}}

Node ‘ns_1@172.31.13.128’ saw that node ‘ns_1@172.31.9.105’ went down. Details: [{nodedown_reason,
net_tick_timeout}]

Some documents tell me to turn off auto-compaction while rebalancing, so I tried but it didn’t help. A server still going down.
While rebalancing, RAM usage goes over 95%. Is this even normal?

I couldn’t sleep because of this for the last 2 days.
Please someone help me…

asingh · June 11, 2015, 5:30am

I’m seeing signs for cluster being undersized in terms of CPU resources. We generally recommend 4 CPU cores for a basic Couchbase Server node
+1 additional core for each design document,
+1 additional core per each XDCR relationship.

If you want to speed up rebalance with current hardware, you could try one of the following options and see if it helps:

Disable index aware rebalance, or
Purge the view indexes for now and let rebalance complete. Once rebalance is done you can add view indexes back.

In long term, you should add more CPU firepower to your nodes.

Can’t say a lot based on error message you posted, though I would recommend to make sure your resident ratio is above 15% and your disk IO capacity is sufficient.

Dynamicscope · June 11, 2015, 5:35am

How could I do disable index aware rebalance? Can’t find doc on that

asingh · June 11, 2015, 6:12am

curl -v -u Administrator:password -X POST http://<cluster-ip>:8091/internalSettings -d indexAwareRebalanceDisabled=true

Documented here: http://docs.couchbase.com/couchbase-manual-2.0/#disabling-consistent-query-results-on-rebalance. Also, it mentions the implications of disabling this flag - so you should make sure that it’s acceptable for your application.

Dynamicscope · June 11, 2015, 6:32am

Thank you very much. I will try this.

Dynamicscope · July 22, 2015, 4:19pm

I have solved the above problem doing the following.

Purge the view indexes for now and let rebalance complete. Once rebalance is done you can add view indexes back.

Now I have to do the same job. I need to scale out my cluster. And this time I want to try disabling index aware rebalance.
But before I do this I have a couple of questions.

Disabling index aware rebalance means that indexing is not performed while rebalancing? (I have read the document, but I quite don’t understand what is actually happening. ) If so, when does indexing happens back again?
I am planning to add a higher spec-server (vCPU 8, 30GB Mem) to the cluster with 3 nodes (vCPU 4, 30GB Mem). Would there be any side effect if I add a server with different spec?
Overall, if I perform this would there be any chance that the server goes down? I am very afraid of performing rebalance because of the past experience.

Thank you for your time.

cihangirb · July 22, 2015, 6:39pm

Hi @Dynamicscope,

With index aware rebalance turned off rebalance is less expensive with views. The indexing of the view will resume automatically without an issue but the query results may not be %100 consistent during rebalance.
You can upgrade your cluster HW however our assumption is eventually all nodes will get to the same HW and you will at that point be clear to increase things like RAM quota.
Rebalance should not cause nodes to go down. rebalance may fail or timeout if the cluster is under a lot of pressure but it is retry-able operation. obviously having a backup cluster with XDCR or having backups is always recommended to full protection.
thanks
-cihan

Dynamicscope · July 23, 2015, 2:45am

Hi @cihangirb,

Of course, I will eventually get the same HW. But I am quite curious about what happens if the HW is different from each nodes. Would the cluster use the minimum HW of all nodes?

Thanks,

cihangirb · July 23, 2015, 3:37am

nodes are fairly independent in their resource usage (cpu, io, network). So with different HW you may see variance in the throughput and latency per node if the resources are drastically different. There are also cluster wide configurations such as the RAM quota that is inherited by every node. If you have nodes with HW that inherit this config that cannot operate effectively under the inherited setting, you may see issues (errors etc) on these nodes.