Couchbase read operation not attempting to create a new connection

I was able to reproduce the problem of couchbase SDK not reestablishing a connection. Here is how.

I created a couchbase client which connects to a bucket and try a read operation every 10 secs.

Configuration:
2 Node Couchbase cluster 2.2
1.4.9 Couchbase SDK

Test #1:
Fair close of a connection to the couchbase without killing the process.

Run the java client which gets a document every 10 sec
Find pid owned by process – sudo netstat -tanp | grep 11210
Find file descriptor for the port – lsof -np <pid> | grep <local_port>
Find the fd which is nothing but the 3rd column with u removed
Connect to process via gdb – gdp -p <pid>

gdb>call shutdown(<fd>)
gdb>cont

Everything worked fine and the sdk reconnected and continued the 10 sec loop of getting the document.

Logs>

2015-06-16 05:31:18.383 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl:  Discarding partially completed op: Cmd: 10 Opaque: 21
2015-06-16 05:31:18.383 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl:  Discarding partially completed op: Cmd: 0 Opaque: 22 Key: 00048938-46a6-4c3a-8b65-3f1034a171c0
2015-06-16 05:31:18.384 INFO com.couchbase.client.CouchbaseConnection:  Reconnecting due to exception on {QA sa=cqzsdb02.xyz.com/10.84.80.107:11210, #Rops=0, #Wops=0, #iq=0, topRop=null, topWop=null, toWrite=0, interested=1}
java.io.IOException: Disconnected unexpected, will reconnect.
at net.spy.memcached.MemcachedConnection.handleReadsWhenChannelEndOfStream(MemcachedConnection.java:911)
at net.spy.memcached.MemcachedConnection.handleReads(MemcachedConnection.java:823)
at net.spy.memcached.MemcachedConnection.handleReadsAndWrites(MemcachedConnection.java:720)
at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:683)
at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:436)
at com.couchbase.client.CouchbaseConnection.run(CouchbaseConnection.java:325)
2015-06-16 05:31:18.385 WARN com.couchbase.client.CouchbaseConnection:  Closing, and reopening {QA sa=cqzsdb02.xyz.com/10.84.80.107:11210, #Rops=0, #Wops=0, #iq=0, topRop=null, topWop=null, toWrite=0, interested=1}, attempt 0.
05:31:18,386 ERROR Thread-10 javatut.CouchbaseConnIncreaseTest:105 - Error occurred is java.util.concurrent.CancellationException: Cancelled
2015-06-16 05:31:20.389 INFO com.couchbase.client.CouchbaseConnection:  Reconnecting {QA sa=cqzsdb02.xyz.com/10.84.80.107:11210, #Rops=0, #Wops=0, #iq=0, topRop=null, topWop=null, toWrite=0, interested=0}
2015-06-16 05:31:20.495 INFO net.spy.memcached.auth.AuthThread:  Authenticated to cqzsdb02.xyz.com/10.84.80.107:11210

Test #2:
Unfair close of a connection to the couchbase without killing the process.

Run the java client which gets a document every 10 sec
Find pid owned by process – sudo netstat -tanp | grep 11210
Find file descriptor for the port – lsof -np <pid> | grep <local_port>
Find the fd which is nothing but the 3rd column with u removed
Connect to process via gdb – gdp -p <pid>

gdb>call shutdown(<fd>)
gdb>cont

SDK went into an infinite loop of retries and got the document after 3hr 28min!

Logs>

    05:44:33,204 ERROR Thread-10 javatut.CouchbaseConnIncreaseTest:105 - Error occurred is net.spy.memcached.OperationTimeoutException: Timeout waiting for value: waited 2,500 ms. Node status: Connection Status { cqzsdb02.xyz.com/10.84.80.107:11210 active: true, authed: true, last read: 13,906 ms ago cqzsdb01.xyz.com/10.84.80.106:11210 active: true, authed: true, last read: 717 ms ago }
    05:44:45,719 ERROR Thread-10 javatut.CouchbaseConnIncreaseTest:105 - Error occurred is net.spy.memcached.OperationTimeoutException: Timeout waiting for value: waited 2,500 ms. Node status: Connection Status { cqzsdb02.xyz.com/10.84.80.107:11210 active: true, authed: true, last read: 26,422 ms ago cqzsdb01.xyz.com/10.84.80.106:11210 active: true, authed: true, last read: 4,221 ms ago }
    05:44:58,220 ERROR Thread-10 javatut.CouchbaseConnIncreaseTest:105 - Error occurred is net.spy.memcached.OperationTimeoutException: Timeout waiting for value: waited 2,500 ms. Node status: Connection Status { cqzsdb02.xyz.com/10.84.80.107:11210 active: true, authed: true, last read: 38,923 ms ago cqzsdb01.xyz.com/10.84.80.106:11210 active: true, authed: true, last read: 4,988 ms ago }

2015-06-17 02:13:18.442 WARN com.couchbase.client.CouchbaseConnection:  sun.nio.ch.SelectionKeyImpl@6fc757 exceeded continuous timeout threshold


2015-06-17 02:13:18.442 WARN com.couchbase.client.CouchbaseConnection:  Closing, and reopening {QA sa=cqzsdb02.xyz.com/10.84.80.107:11210, #Rops=0, #Wops=2996, #iq=0, topRop=null, topWop=Cmd: 10 Opaque: 51, toWrite=0, interested=4}, attempt 0.



2015-06-17 02:13:18.443 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl:  Discarding partially completed op: Cmd: 10 Opaque: 51
2015-06-17 02:13:18.443 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl:  Discarding partially completed op: Cmd: 0 Opaque: 52 Key: 00048938-46a6-4c3a-8b65-3f1034a171c0
2015-06-17 02:13:18.444 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl:  Discarding partially completed op: Cmd: 10 Opaque: 54
2015-06-17 02:13:18.444 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl:  Discarding partially completed op: Cmd: 10 Opaque: 56



2015-06-17 02:13:20.445 INFO com.couchbase.client.CouchbaseConnection:  Reconnecting {QA sa=cqzsdb02.xyz.com/10.84.80.107:11210, #Rops=0, #Wops=0, #iq=1, topRop=null, topWop=null, toWrite=0, interested=0}

This goes infinitely and I get the following exception after along time - exactly 3hour 28minutes . This is a very long time . After this SDK is able to reestablish the connection but how do you reduce this time to a lesser value like 2 minutes.I have not used any tuning setting from SDK documentation.

Problems:
Why is couchbase client sdk taking so long to connect back.

@arungeorge81 great to see people kicking the tires in that way. What you need to take into account for the full picture is that the client SDK has a configuration which it loads from the server and it will retry the sockets until it retrieves a new configuration noting that this node has been removed from the cluster.

So in order to reproduce this fully:

  • Was the node correctly removed from the cluster (so in the UI, failed over and subsequently removed or rebalanced out)
  • Can you share DEBUG or TRACE logs so we can see what configuration the cluster did send back to the server? Since you are using 2.2, the server will push a new configuration to the client.
  • We might end up in a pathological scenario where you are silently killing the streaming connection and the client never knows. We have heuristics in place (based on load) that proactively reestablish the configuration, but by one op/s they don’t work as good as under higher load.

So if you can shed more light on this, I’m sure we’ll track this down quickly.
Thanks,
Michael

@daschl - The node is not removed per se. The scenario we are trying is that we are killing the streaming connection without knowledge of SDK as you told, via gdb. Only gdb sends the connection closure and SDK is not aware of it. Now what is happening is that SDK is counting the read timeouts but SDK is doing that check as late as 3hr 28min . As you told the heuristics may not kick in on a very low load like 1 ops/sec. I will try the same with a high load like 10k/sec.

[Update] I tried with a load of 5k/sec and I can see the connection repair happening in 30 minutes without any SDK tuning parameter.

I have taken a look at the code of spymemcached and found that the MemcachedConnection@ checkPotentiallyTimedOutConnection indeed checks this with timeoutExceptionThreshold but why is this thread not detecting the timeouts after 1000 ops as configured which is 2.7hr(1000 * 10) instead it is doing that at 3hr 28min and discarding 5024 operations.

What is your suggested value of timeoutExceptionTheshold parameter assuming there can be max peak load of 15K and average load of 8k for a 10 node couchbase cluster.

Regards
Arun

@arungeorge81, do you have some sample code to share that shows how you reproduce the case? I’d like to run it like you do to get a better feeling on how the SDK behaves. If not, I’ll see if I can assemble you some params to try.