Timeout Exception

We keep receiving Timeout Exception frequently.

Here is exception:

java.util.concurrent.TimeoutException: null
at com.couchbase.client.java.util.Blocking.blockForSingle(Blocking.java:73)
at com.couchbase.client.java.CouchbaseBucket.query(CouchbaseBucket.java:642)

Queries which we perform do succeed most of the time, but end up with Timeout every now and then.

We are using Java Client SDK 2.3.3 with Couchbase Server 4.5

These Timeouts became more frequent after we upgraded to 2.3.3

We enabled Trace level logging, but I don’t see any logs during the 5 seconds window when Exception occurs.

There is Support Request #14214 open for this issue as well.

Hi @nkapre,

Timeouts can be an effect of some other cause - network slowness, long running query (since you have set a non-default timeout of 5s), sdk issue, gc. Is it possible to get the trace level logs for the sdk. Thanks.

And note that the IO occurs on a thread separate from where you experience the timeout, so looking at the stack trace for the exception alone won’t tell us much @nkapre. You’ll want to look at your app level logs for the kinds of things that may cause a timeout.

Were there any other changes when you uprgraded to 2.3.3?

I have enabled Trace level logs on Application side. Is there any specific logs which I should look for?

I do see lot of logs coming from Couchbase package:

here are few log messages -

Connected (DEGRADED) to Node couchbase-XXX/XXX

[XXX][ConfigEndpoint]: Channel Inactive.

Our Application logs are pushed to logstash and displayed via Kibana. Let me know if there are any logger classes which I can filter out to narrow down the exact logs I should be looking at.

I might recommend looking to correlate from one of the timeouts to messages around there on com.couchbase.client.core. As @subhashni mentioned, seeing a TimeoutException is typically an effect, not the cause. The cause may be anything from a networking issue to a slow query to a JVM that is busy to an OS virtual machine that is oversubscribed.

One thing you may want to consider is enabling the in-client metrics collection. It could be that you have a number of fast operations, but some outliers. Some of these outliers may be so high that they’re going over the 5sec you’re allowing.

For example:

        int networkMetricsInterval = 10;
        int runtimeMetricsInterval = 30;

        LatencyMetricsCollectorConfig latencyConfig = networkMetricsInterval <= 0
                ? DefaultLatencyMetricsCollectorConfig.disabled()
                : DefaultLatencyMetricsCollectorConfig
                .builder()
                .emitFrequency(networkMetricsInterval)
                .emitFrequencyUnit(TimeUnit.SECONDS)
                .build();

        MetricsCollectorConfig runtimeConfig = runtimeMetricsInterval <= 0
                ? DefaultMetricsCollectorConfig.disabled()
                : DefaultMetricsCollectorConfig.create(runtimeMetricsInterval, TimeUnit.SECONDS);

        DefaultCouchbaseEnvironment.Builder builder = DefaultCouchbaseEnvironment
                .builder()
                .runtimeMetricsCollectorConfig(runtimeConfig)
                .networkLatencyMetricsCollectorConfig(latencyConfig);

        Cluster cluster = CouchbaseCluster.create(builder.build(), "10.112.150.101");
        Bucket bucket = cluster.openBucket("travel-sample");

That’ll output lines like this every 10s:

INFO: {"10.112.150.101":{"QUERY":{"GenericQueryRequest":{"SUCCESS":{"metrics":{"percentiles":{"50.0":52953,"90.0":106954,"95.0":126353,"99.0":272629,"99.9":666894},"min":37748,"max":666894,"count":241,"timeUnit":"MICROSECONDS"}}}},"BINARY":{"KeepAliveRequest":{"SUCCESS":{"metrics":{"percentiles":{"50.0":448,"90.0":479,"95.0":524,"99.0":524,"99.9":524},"min":382,"max":524,"count":7,"timeUnit":"MICROSECONDS"}}},"GetBucketConfigRequest":{"SUCCESS":{"metrics":{"percentiles":{"50.0":1228,"90.0":3063,"95.0":3063,"99.0":3063,"99.9":3063},"min":602,"max":3063,"count":5,"timeUnit":"MICROSECONDS"}}}},"VIEW":{"KeepAliveRequest":{"SUCCESS":{"metrics":{"percentiles":{"50.0":1122,"90.0":1613,"95.0":1695,"99.0":1695,"99.9":1695},"min":737,"max":1695,"count":10,"timeUnit":"MICROSECONDS"}}}}},"10.112.150.103":{"QUERY":{"GenericQueryRequest":{"SUCCESS":{"metrics":{"percentiles":{"50.0":48758,"90.0":89128,"95.0":116391,"99.0":182452,"99.9":1543503},"min":38535,"max":1543503,"count":240,"timeUnit":"MICROSECONDS"}}}},"BINARY":{"KeepAliveRequest":{"SUCCESS":{"metrics":{"percentiles":{"50.0":1179,"90.0":1384,"95.0":1384,"99.0":1384,"99.9":1384},"min":391,"max":1384,"count":4,"timeUnit":"MICROSECONDS"}}},"GetBucketConfigRequest":{"SUCCESS":{"metrics":{"percentiles":{"50.0":790,"90.0":1564,"95.0":1564,"99.0":1769,"99.9":1769},"min":479,"max":1769,"count":14,"timeUnit":"MICROSECONDS"}}}},"VIEW":{"KeepAliveRequest":{"SUCCESS":{"metrics":{"percentiles":{"50.0":897,"90.0":5144,"95.0":7438,"99.0":7438,"99.9":7438},"min":733,"max":7438,"count":10,"timeUnit":"MICROSECONDS"}}}}},"10.112.150.102":{"QUERY":{"GenericQueryRequest":{"SUCCESS":{"metrics":{"percentiles":{"50.0":53477,"90.0":103809,"95.0":121634,"99.0":173015,"99.9":3405774},"min":40108,"max":3405774,"count":241,"timeUnit":"MICROSECONDS"}}}},"BINARY":{"KeepAliveRequest":{"SUCCESS":{"metrics":{"percentiles":{"50.0":518,"90.0":2572,"95.0":2572,"99.0":2572,"99.9":2572},"min":378,"max":2572,"count":5,"timeUnit":"MICROSECONDS"}}},"GetBucketConfigRequest":{"SUCCESS":{"metrics":{"percentiles":{"50.0":1245,"90.0":5308,"95.0":19529,"99.0":19529,"99.9":19529},"min":456,"max":19529,"count":10,"timeUnit":"MICROSECONDS"}}}},"VIEW":{"KeepAliveRequest":{"SUCCESS":{"metrics":{"percentiles":{"50.0":1064,"90.0":2834,"95.0":3096,"99.0":3096,"99.9":3096},"min":638,"max":3096,"count":10,"timeUnit":"MICROSECONDS"}}}}},"event":{"name":"NetworkLatencyMetrics","type":"METRIC"}}

Note that in mine, some are very fast, but there are some outliers. If you see some getting up to the seconds in the 99th or 95th percentile, you may need to see if you can correlate to JVM/VM activities, etc.:

"percentiles":{"50.0":52953,"90.0":106954,"95.0":126353,"99.0":272629,"99.9":666894}```

My environment would have the console showing from 50ms to 3.5s, all one query type:
<img src="//discourse-cloud-file-uploads.s3.dualstack.us-west-2.amazonaws.com/couchbase/original/2X/e/eb851ef40f3575f15678d8c04200aa0c152ef603.jpg" width="690" height="376">

The RuntimeMetricsEvent can help you spot GCs, though there are better profilers for that kind of thing.