One node crash will cause several minutes failure of total cluster

javen · October 25, 2013, 10:03am

We do some tests and also encounter the critical exception of Couchbase cluster in prod environment.

The issue summary

One node shutdown wittingly will cause:

Get ops of the cluster becomes 0
Some of the putting records will be lost
Need several minutes to “auto fail-over”, and then the cluster will resume to OK. ( We set auto fail-over time is 30s)

The test environment

Couchbase 2.2
4 nodes
Replicas: 2
Every node: 4CPU, 16G memory
C SDK version: 2.0.6

Testcase 1

Steps:
0. 10m records are set in the cluster, auto fail-over is enabled

3 clients to get from cluster
Shutdown one node unexpectedly (crash)
Result:
All gets encounter timeouts, cannot get any data.
About 4 minutes later, the cluster resume to OK.

Testcase 2

Steps:
0. 10m records are set in the cluster, auto fail-over is enabled

Start clients to write 500k
Shutdown one node when writing
Result:
Just the node shutdown, writing ops becomes 0
After 1 minute the cluster resumes OK
When checking the data: a) Writing timeouts count is 13 b) 5087 records cannot be found in the 500k c) Some of the originally set 10m records are lost d) The lost records cannot be found even mannually rebalanced.
Attached screenshots:

ops drop to 0 when one node crash

ops resume after 1 minute

one node is fail-over state

after rebalance action, the records count

The count should be: 10m + 500k. From desc of testcase 2 result, Couchbase lost some new data and also old existed data.

avsej · October 25, 2013, 1:48pm

What clients are you using?

javen · October 26, 2013, 1:55am

C SDK (client) version: 2.0.6

tgrall · October 28, 2013, 1:25pm

Hello,

The architecture of Couchbase is build to react as follow during a failure.

"One node shutdown wittingly will cause: …"
So when the node is down (and until the fail over):

you cannot read to this node, so a subset of your keys cannot be saved, for example with 4 nodes
you cannot read the active document from this node, you can use replica read operation ( see https://gist.github.com/tgrall/6540011 )
since you have replicas, you have not lose any data, you just have no access (except with replica read) and write to a subset of the data. (Couchbase is a CP database if you look at the CAP theorem)

Once again this is the behavior until you do the fail over (or automatically done). The fail over will promote some replicas as active documents (the fail over operation is very fast, just changing flags in memory).
Once the fail over is done, you have access (in read and write) to all the data, but on 1 less node.

Usually you do a rebalance, to be sure you get all the data and its replica well balanced.

So based on this behavior I do not see why you see (or think):

data loss
long time failure of the cluster. (the only period when you may get some error is before the fail over and yes when you use automatic failover it can takes some time)

Regards
Tug
@tgrall