Cluster Node Connection Failure Handling


#1

When some nodes of the configured cluster fail, couchnode is not trying to remove them from the client connections, and I see continuous errors although there are still some healthy nodes in my cluster.
It seems couchnode is not trying to reconnect also, when failed node is again up & available.

Any plans to support connection reconnect/failover in module?

There’s a simple implementation of such scenario in influx node.js module here
https://github.com/node-influx/node-influx/blob/master/lib/InfluxRequest.js


#2

Nodes are only removed from the client if they are failed over, meaning that they have been either manually removed from the cluster, or they have been removed via auto-failover (an optional feature allowing the cluster to automatically remove servers it thinks are dead). Otherwise, the client library must assume that any sort of socket or connectivity error is temporary. Note that Failover is not just a colloquial term to indicate a dead server, but corresponds to an actual Couchbase Cluster Management API call (via REST) to indicate that a specific server has been “Failed over”, as failover is a potentially destructive operation with respect to data.

If a socket connection has been broken (ECONNRESET for example), the client will try to reconnect. It will not try to reconnect on a timeout error because a timeout does not indicate a broken connection – it does indicate that the connection is still open, but for some reason the remote host is not responding - often due to resource exhaustion. In such a scenario, it is certainly not a good idea to add more TCP connections to an already-exhausted server.

It is possible to pre-empt and determine if a given document ID (key) will be routed to a specific server, and thus you can write code which can check for each given key if it would be routed to a server which is known to be “dead” (where “dead” is something defined in your application, and exceeds Couchbase’s definition of dead, which is failed-over). I am not sure if such an API is exposed in the node.js library, but is certainly available in the underlying C library


#3

May be I was not clear enough @mnunberg
When one of nodes is down (not failed over, or removed) temporarly and node.js client faces timeouts or connection resets, it can circuit break current node and pass requests to other nodes in cluster. However I see node.js module still sends subsequent requests to all cluster nodes, so some of them get connection errors…