[couchnode] How does couchnode handle unavailability from one or some nodes of the couchbase cluster

mxav · February 12, 2014, 11:45am

Hello,

We are using couchnode for a couple of months now and we really like it, but we have experienced an issue a couple of days ago while adding a new node to an existing couchbase cluster and we would like to understand better the way couchnode handles unavailability from one or some nodes in a couchbase cluster.

The issue we had happened when we added a new node to couchbase but forgot to allow DNS resolution for this new node from the application server using couchnode.

As soon as the rebalancing started, some keys where moved to the new couchbase node, but the application servers were not able to get/set them because they could not resolve the IP of the new couchbase node ( we identify couchbase nodes via a DNS name rather than IP address)

We, of course, understand that there is no magic and that couchnode/libCouchbase cannot resolve a hostname that has not been declared, but the unfortunate thing is that gets and sets didn’t complain about any thing :

The “Connection” class didn’t emit any error
The couchbase .get .set, etc … didn’t call back any error

We know that operations in couchnode are queued until the connection is ready, but what does this really mean :

Does “connection is ready” means that couchnode can reach http://first_available_host:8091 ?
Does it check for real availability of each cluster node on each bucket port ?
Does “connectionTimeout” and/or “operationTimeout” options passed to “new Connection” includes DNS resolution ?
Is there any limit for the number of operations couchnode can queue before throwing an error ? (Can we change this queue depth | What happens when the max depth is reached ?)
Is there any way to watch this queue depth ?
If auto-failover is set to false in couchbase, what happens when a couchbase node goes done ?
-> Are we supposed to receive some errors or will the operations on this node be queued and until when/what ?

We are using couchnode 1.2.0 ans couchbase 2.1.1CE

Last question : does the 2.5 couchbase version changes anything about this “connection is ready” behaviour ?
I am referring to the “Optimized connection management” introduced in 2.5 that now connects to 11211 instead of 8091

Thank you for any hints you could provide

Xavier

mnunberg · February 12, 2014, 11:20pm

“Connection Readiness” indicates that we have a cluster map and know where to forward keys to. It does not indicate we actually have a connection to all of the nodes.

If you didn’t get prompt callbacks for set/get operations within the declared timeout intervals, this is certainly a bug within the library and I’d recommend you file an issue in our issue tracker (http://couchbase.com/issues).

the connectionTimeout specifies the number of time for which the library shall wait until it decides that it cannot bootstrap the cluster. The operationTimeout is really in effect once the client is actually connected.

Internally the queue happens at the couchnode level, while the timeout parameters are forwarded directly to the underlying C library - so effectively the timeout before the instance has been connected would be the connection timeout, not the operation timeout.

We currently use blocking DNS host resolutions. In most deployments this should not matter much, but especially in cases where the DNS server is very slow to respond you may run into unexpected issues.

The initial operation queue is there for developer convenience, specifically to eliminate the need of having to wait for a “connect” event in order to start scheduling operations within the library. However it should still be possible to cut out the operation queue entirely by not scheduling operations until the ‘connect’ emitter has been fired.

Normatively, if there is an initial connect error - you should receive an operation timeout within the connectionTimeout interval - in practice this may be a bit later than requested (much of this is fixed in the upcoming libcouchbase release) – but it should not exceed something like (connectionTimeout * numberOfNodes).

When a node goes down, couchnode/libcouchbase handles updates to the cluster topology. Specifically in the case of a failover - it would depend on how many replicas you have configured and whether they are online or not.

In the worst case scenario you have no replicas and the client will continue to report errors for any operations which were destined to the failed over node until a rebalance is complete (with the rate of failures decreasing as the rebalance progresses). If you do have a replica available, the failover operation promotes it to a master and your application should receive little to no errors.

Finally, the 2.5 cluster version does offer improvements, but you would also need a compatible client to take advantage of this. Currently the stable release version of libcouchbase (2.2.0) does not support these new features, but the developer preview version (2.3.0-DP2) does. Read http://blog.couchbase.com/libcouchbase-23-dp2-enhanced-configuration-updates to see more about the new features and improvements.

The primary changes with 2.5 are the new features added to allow the client libraries to update to failure scenarios in a more reliable and efficient manner.

Hope this answers your questions,
Mark

mxav · February 13, 2014, 8:34am

Hello Mark,

Thank you very much for this really detailed answer (and for your great job on the c bindings)

We will open the issue today

Xavier