.NET Client Behavior During Node Failure

.net

#1

Hi All,

We are currently evaluating Couchbase and are doing some testing with the .NET SDK (we are a Microsoft shop). We have a test three node cluster and have created a small test program that spins up 50 threads that do 1000 upserts each. Couchbase and the .NET SDK handle this very well during normal operations as well as when we do a manual graceful failover of one node - no errors or data loss occurs and we see the expected results at the end of the run. However, we are now testing the scenario where one of the three nodes crashes (we are simulating this by simply turning off the CouchbaseServer Windows service) and the .NET SDK doesn’t seem to be behaving as we were expecting.

It may just be a misunderstanding on our part, but in looking at the SDK client configuration and taking a peek at the code we were under the impression that if a node goes down the client will gracefully bounce to using another available node to perform operations. What we are seeing is that the client just gives the message “The node that the key was mapped to is either down or unreachable. The SDK will continue to try to connect every 1000 ms.” and keeps failing over and over. I pulled the SDK code from Github and stepped into what was going on and I do see that the Server._isDown is correctly set to true for the failed node but I don’t see anywhere in the code where this condition would cause the operation to change affinity from the downed server to a live one.

Are we misunderstanding how the client is supposed to behave in this scenario? Here is our client config for reference:

var clientConfiguration = new ClientConfiguration
        {
            Servers = new List<Uri>
            {
                new Uri("http://server1.XXXX.com:8091/pools"),
                new Uri("http://server2.XXXX.com:8091/pools"),
                new Uri("http://server3.XXXX.com:8091/pools")
            }
        };

        var cluster = new Cluster(clientConfiguration);
        var bucket = cluster.OpenBucket("load-testing");

Any help or direction you could provide would be much appreciated.

Regards,

Craig


#2

I had a similar problem. You need to use the clusterhelper to manage connections. Starting a new connection takes time and the helper will keep alive connections.

Keith


#3

It depends upon the operation and more importantly if you are using replica reads. Since keys are mapped directly to the node that they exist on, any mutation operation will fail for keys mapped to the down node. In this case, NodeUnavailable will be returned plus the message The node that the key was mapped to is either down or unreachable. If the operation is a Get and if you have replicas enabled, you can follow up with a replica read if NodeUnavailable is encountered.

When the _isDown flag is set, for that node, a timer will fire every 1000ms (configurable via ClientConfiguration.NodeAvailableCheckInterval) and a NOOP will be attempted on that node on a separate thread. Once the NOOP completes the successfully, the _isDown flag will be set to false and the node will come back online.

-Jeff


#4

My understanding is as @jmorris explained. That is if a node is down, a MANUAL check of the error message should invoke an additional call to the cluster with the Replica flag on. This should be done every GET regardless if the node is up or not.

Can an UPDATE be called with a Replica flag also ?

Itay


#5

No, since an update will be to the master first; eventually the delta will be replicated out to any nodes configured to support replicas.

-Jeff


#6

Thanks, @jmorris,

So, for example, if one node in a cluster of 6 fails, then the entire app is halted until the failed node is fixed (assuming that updates are a necessary part of the app execution) and there is no hot replacement ?


#7

@itay -

No, the node will be put into a temporary down state and NodeUnavailable wil be returned to any key mapped to it; all keys mapped to nodes other the down node will be processed.

-Jeff


#8

So practically, having a multi-node cluster is important mainly for scaling but not for availability since if 1 node fails and the app continues, data consistency will be jeopardize (due to partial writes).

Is there a plan to direct writes to replicas instead of to the failed node ?


#9

Hello,
Same question from me. I saw same behavior with .NET.
I tried to store IIS session in the Couchbase and found that some sessions were broken when one node goes down.
How fast replica will become to main node in the cluster?


#10

You definitely have automatic high availability if you have 3 or more nodes and auto-failover enabled. Auto-failover failover is not on by default.