Frequent TimeoutException from Java SDK

sparky · January 2, 2021, 5:32am

Our environment
Couchbase java client 2.7.6
Couchbase Server CE6.6.0 build 7909
We have multiple clusters with 4-6 Couchbase server nodes in each of them.
Connection settings:

 CouchbaseEnvironment environment = DefaultCouchbaseEnvironment
			    .builder()
			    .connectTimeout(20000)
			    .retryStrategy(FailFastRetryStrategy.INSTANCE)
				.kvTimeout(5000)
				.autoreleaseAfter(5000)
			    .queryTimeout(75000)
			    .keepAliveInterval(50000)
			    .keepAliveTimeout(5000)
			    .maxRequestLifetime(150000)
			    .build();

We decided to use FailFastRetryStrategy as requests were held at times which created a ripple effect in our system.

Our current problem is that our application fails to insert of read from couchbase frequently. This happens normally after 5-6 days. Restarting the application (no change in couchbase cluster) invariably fixes the issue. Application fails with following error always.

Failed to insert params in couchbase with error: java.lang.RuntimeException: java.util.concurrent.TimeoutException: {"b":"mybucket","s":"kv","t":5000000,"i":"0x79c97e"}

We are also seeing similar issue when a node fails over. A Buch of requests fails with following error which create a service issue at the customer end.

in couchbase with error: com.couchbase.client.core.RequestCancelledException: Could not dispatch request, cancelling instead of retrying.

Our cluster configuration is such that 3 nodes will have data, query, index and search services and 1 with data, query and index and remaining (if there are more than 4 nodes) with just data service.

Appreciate some help to identity and resolve these two issues. Currently we are clueless as why insert and search fail every few days. We are suspecting that the second issue is due to FailFastStrategy.

graham.pople · January 4, 2021, 10:55am

Hi @sparky

For the RequestCancelledException, it usually indicates that the SDK has detected a connection to a server process has been disconnected, and it’s cancelled any requests that were in-flight to that process. So this is something you’d expect to see when a node fails over.
For the TimeoutException, this is more of a symptom than a cause. In any complex distributed system you will occasionally see timeouts, due to any number of issues, such as temporary network congestion, a transiently slow disk, a GC spike when working with the JVM, etc. All we can say from here is that we didn’t get a response from the server in time. It can be very hard to debug since they can come from so many sources (99% of the time, when investigated, the root cause is not the SDK itself). If you’re seeing one every few days I certainly wouldn’t be concerned, but it is something you’ll need to handle programmatically (pr perhaps consider increasing the timeout). Bear in mind that in a timeout, the result of the operation is ambiguous - it may have succeeded or not.

sparky · January 11, 2021, 6:03pm

Thanks for the suggestions. For the time being, we changed the retry strategy to BestEffort and increased the timeout a bit. I’m hoping the first one should take care of node failure scenario as the client will reattempt to establish connection to cluster. We’ll monitor this for the next 3 weeks and look at alternatives.

synesty · January 11, 2021, 9:55pm

We were also struggling with occasional timeouts too. We finally tracked it down to Full-Garbage-Collections (Stop-the-world GC) which took longer than the configured Couchbase timeouts.
We are on Java JDK 8 and mostly had the issue with the “old” CMS GC. After switching to the “new” G1 GC and a -XX:MaxGCPauseMillis smaller than our Couchbase Timeout, we almost never saw a CB Timeout again for weeks now.

graham.pople · January 12, 2021, 3:08pm

@sparky that sounds like a plan. You might also want to consider a move to the 3.x SDK, which provides more automated retrying of more errors under the BestEffort retry strategy.
@synesty that’s very interesting, thanks for sharing.

sparky · January 12, 2021, 5:03pm

@graham.pople Does migrating to 3.x SDK require rewrite of code or is it just a drop in replacement of 2.x jar? (I’ll also look at the documentation about this upgrade path).
@synesty interesting observation. We noticed these errors appear when GC kicks in mostly. Thanks for pointing this out. We’ll try your suggestion.

graham.pople · January 12, 2021, 7:44pm

@sparky it does require some code modifications, but they’re pretty straightforward. The main change is to pass around a Collection rather than Cluster object, and if you’re using the reactive API, it’s now based around Project Reactor rather than RxJava, so the primitives are a little difference (Flux/Mono) vs (Observable/Single).

sparky · January 13, 2021, 5:25am

@graham.pople Thank you for the info. We’ll plan for this update.