OperationTimeoutException when reading from replicas


#1

Hi,

I’m running Couchbase 2.2.0 with 3-node cluster and 2 replicas. My client is using Java sdk 1.2.3
I’m testing the failover scenario by shutting down one of the nodes and letting the client read from replicas. I’m doing it to minimize the impact of one of the nodes failing.
My code as follows:

try{
String l_value =couchbaseClient().get(_key);
}
catch( CancellationException e){
String l_valueFromReplica = (String)couchbaseClient().getFromReplica(_key);
}

Unfortunately, out of 100 calls to replica about 20 calls fail with the following exception:

net.spy.memcached.OperationTimeoutException: Timeout waiting for value
at com.couchbase.client.CouchbaseClient.getFromReplica(CouchbaseClient.java:835) ~[couchbase-client-1.2.3.jar:1.2.3]
at com.couchbase.client.CouchbaseClient.getFromReplica(CouchbaseClient.java:822) ~[couchbase-client-1.2.3.jar:1.2.3]

I was wondering if there is anything that can be done to mitigate the issue. Any ideas would be greatly appreciated.

Thank you very much


#2

Do you have a custom timeout configured?

Try setting a higher timeout and see what happens… if you have a flaky network it could be that you have network spikes that go over your configured (or the default timeout).


#3

Hi @daschl, I am getting the same error for Couchbase queries. How do I check for the custom timeout that you mentioned?
I use a ForkJoinPoolTask Executor in Java for executing tasks and these tasks fire requests to couchbase. My doubt is, since it spawns many threads at a time (A ForkJoinPool is constructed with a given target parallelism level; by default, equal to the number of available processors which is 16 in my case, and the maximum thread number is limited to 32767).

I think these many requests is too much to be handles by couchbase(2 replicas with 3 nodes each ) and hence throwing timeout exceptions.

I m pasting the logs I obtain in my tomcat for this, please suggest:

2015-03-10 11:58:32 ERROR EmailHistoryDAOImpl:82 [ForkJoinPool-1-worker-3677] - Timeout waiting for value: waited 2,500 ms. Node status: Connection Status { couch-03.cen-01.traveljigsaw.com/10.184.221.2:11210 active: true, authed: true, last read: 120 ms ago couch-05.cen-01.traveljigsaw.com/10.184.221.3:11210 active: true, authed: true, last read: 233 ms ago couch-01.cen-01.traveljigsaw.com/10.184.221.1:11210 active: true, authed: true, last read: 233 ms ago }
net.spy.memcached.OperationTimeoutException: Timeout waiting for value: waited 2,500 ms. Node status: Connection Status { couch-03.cen-01.traveljigsaw.com/10.184.221.2:11210 active: true, authed: true, last read: 120 ms ago couch-05.cen-01.traveljigsaw.com/10.184.221.3:11210 active: true, authed: true, last read: 233 ms ago couch-01.cen-01.traveljigsaw.com/10.184.221.1:11210 active: true, authed: true, last read: 233 ms ago }
at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:1240)
at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:1257)
at com.rentalcars.email.scheduler.nosql.dao.impl.EmailHistoryDAOImpl.exists(EmailHistoryDAOImpl.java:49)
at com.rentalcars.email.queue.processor.helper.MailExclusionTask.call(MailExclusionTask.java:43)
at com.rentalcars.email.queue.processor.helper.MailExclusionTask.call(MailExclusionTask.java:14)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ForkJoinTask$AdaptedRunnable.exec(ForkJoinTask.java:1265)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:334)
at java.util.concurrent.ForkJoinWorkerThread.execTask(ForkJoinWorkerThread.java:604)
at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:784)
at java.util.concurrent.ForkJoinPool.work(ForkJoinPool.java:646)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:398)
Caused by: net.spy.memcached.internal.CheckedOperationTimeoutException: Timed out waiting for operation - failing node: couch-03.cen-01.traveljigsaw.com/10.184.221.2:11210
at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:167)
at net.spy.memcached.internal.GetFuture.get(GetFuture.java:69)
at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:1230)
… 12 more


#4

On the getFromReplica you have an overload where you can pass in a custom timeout (higher than the default one). Is that what you are looking for?


#5

Well Not really, we dont use getFromReplica anywhere, if I m not wrong. We use Couchbase client and make a connection to fetch documents.


#6

Ah I see because you were asking on the thread for replica read. The regular get() methods also provide additional overloads for timeouts.


#7

Thanks for the reply @daschl :slight_smile: Well just to make sure, the stack trace i sent you in my previous posts, will it confirm that using more number of threads is the reason for this failure intermittently. The reason I am asking this is, I dont find that error always but only sometimes in a day and continuously for 2 hours or so. May be cos of load or wat, not sure. I donot know how to address this issue. Please suggest if you have come across something like this.


#8

Well, timeouts like this are very hard to diagnose without more information. It could also be because of network latency issues, garbage collection in your JVM,…

Set the timeout to a threshold that fits the SLA’s of your application and if you see timeouts happening you need to investigate thread pool sizes, resource utilizations, heap usage and GC pauses, network latencies and last but not least server timings.