Couchbase client randomly timeouts during document upsert

I am experiencing strange random exceptions on my instance of Couchbase Server Version: 3.0.1 Community Edition (build-1444). Sometimes, when my app tries to save a document a TimeoutException occurs.

I’ve tried to increase the kvTimeout up to 30 seconds, which gave no effect. What could be the problem?

2016-10-05 10:16:57,208 ERROR (c.p.s.c.CouchbaseConnection:109) Blad podczas zapisywania obiektu dla klucza FB;S115699-R220-L2;114287629037464
java.lang.RuntimeException: java.util.concurrent.TimeoutException
        at com.couchbase.client.java.util.Blocking.blockForSingle(Blocking.java:75)
        at com.couchbase.client.java.CouchbaseBucket.upsert(CouchbaseBucket.java:358)
        at com.couchbase.client.java.CouchbaseBucket.upsert(CouchbaseBucket.java:353)
        at com.picadilla.services.couchbase.CouchbaseConnection.save(CouchbaseConnection.java:103)
        at com.picadilla.services.impl.tournaments.ScoreService.recordPlayerScore(ScoreService.java:80)
        at com.picadilla.services.impl.tournaments.ScoreService.set(ScoreService.java:55)
        at com.picadilla.commands.QualificationScoreCommand.execute(QualificationScoreCommand.java:117)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at com.picadilla.services.sockets.CommandDispatcher.execute(CommandDispatcher.java:250)
        at com.picadilla.services.sockets.CommandDispatcher.execute(CommandDispatcher.java:418)
        at com.picadilla.services.sockets.CommandDispatcher.execute(CommandDispatcher.java:51)
        at com.picadilla.utils.service.PlayerLockExecutor.nextExecution(PlayerLockExecutor.java:203)
        at com.picadilla.utils.service.PlayerLockExecutor.run(PlayerLockExecutor.java:158)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: null
        ... 17 common frames omitted

My couchbase java-client version is 2.2.4.

If you increase the log level to trace/finest, you should be able to check if the custom timeout was honored. Timeouts are usually an indication of some other cause.

I’ve set the logging level to TRACE, like you suggested. It seems, that my custom timeout value is honored.

2016-10-06 08:55:28,485 INFO (c.c.c.c.l.Slf4JLogger:121) CouchbaseEnvironment: {sslEnabled=false, sslKeystoreFile='null', sslKeystorePassword='null', queryEnabled=false, queryPort=8093, bootstrapHttpEnabled=true, bootstrapCarrierEnabled=true, bootstrapHttpDirectPort=8091, bootstrapHttpSslPort=18091, bootstrapCarrierDirectPort=11210, bootstrapCarrierSslPort=11207, ioPoolSize=3, computationPoolSize=3, responseBufferSize=16384, requestBufferSize=16384, kvServiceEndpoints=1, viewServiceEndpoints=1, queryServiceEndpoints=1, searchServiceEndpoints=1, ioPool=NioEventLoopGroup, coreScheduler=CoreScheduler, eventBus=DefaultEventBus, packageNameAndVersion=couchbase-jvm-core/1.2.4 (git: 1.2.4), dcpEnabled=false, retryStrategy=BestEffort, maxRequestLifetime=75000, retryDelay=ExponentialDelay{growBy 1.0 MICROSECONDS; lower=100, upper=100000}, reconnectDelay=ExponentialDelay{growBy 1.0 MILLISECONDS; lower=32, upper=4096}, observeIntervalDelay=ExponentialDelay{growBy 1.0 MICROSECONDS; lower=10, upper=100000}, keepAliveInterval=30000, autoreleaseAfter=2000, bufferPoolingEnabled=true, tcpNodelayEnabled=true, mutationTokensEnabled=false, socketConnectTimeout=1000, dcpConnectionBufferSize=20971520, dcpConnectionBufferAckThreshold=0.2, queryTimeout=75000, viewTimeout=75000, kvTimeout=30000, connectTimeout=30000, disconnectTimeout=25000, dnsSrvEnabled=false}

But I’ve discovered some other thing. The timeouts seem to occure close to a failed attempt to fetch some bucket configuration.
Another TimeoutException:

2016-10-06 12:47:06,512 ERROR (c.p.s.c.CouchbaseConnection:109) Blad podczas zapisywania obiektu dla klucza FB;S100564-R3860-L7;104590140009935
    java.lang.RuntimeException: java.util.concurrent.TimeoutException
        at com.couchbase.client.java.util.Blocking.blockForSingle(Blocking.java:75)
        at com.couchbase.client.java.CouchbaseBucket.upsert(CouchbaseBucket.java:358)
        at com.couchbase.client.java.CouchbaseBucket.upsert(CouchbaseBucket.java:353)
        at com.picadilla.services.couchbase.CouchbaseConnection.save(CouchbaseConnection.java:103)
        at com.picadilla.services.impl.tournaments.ScoreService.recordPlayerScore(ScoreService.java:80)
        at com.picadilla.services.impl.tournaments.ScoreService.set(ScoreService.java:55)
        at com.picadilla.commands.QualificationScoreCommand.execute(QualificationScoreCommand.java:117)
        at sun.reflect.GeneratedMethodAccessor388.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at com.picadilla.services.sockets.CommandDispatcher.execute(CommandDispatcher.java:250)
        at com.picadilla.services.sockets.CommandDispatcher.execute(CommandDispatcher.java:418)
        at com.picadilla.services.sockets.CommandDispatcher.execute(CommandDispatcher.java:51)
        at com.picadilla.utils.service.PlayerLockExecutor.nextExecution(PlayerLockExecutor.java:203)
        at com.picadilla.utils.service.PlayerLockExecutor.run(PlayerLockExecutor.java:158)
        at java.lang.Thread.run(Thread.java:745)
    Caused by: java.util.concurrent.TimeoutException: null
        ... 16 common frames omitted

And the failed attempt to get a bucket configuration at nearly the same time:

2016-10-06 12:50:02,625 TRACE (c.c.c.c.l.Slf4JLogger:71) [cb-hb-fb-test.awspicadilla.eu/10.91.133.32:11210][KeyValueEndpoint]: Started decoding of UpsertRequest{observable=rx.subjects.AsyncSubject@7885bd5b, bucket='hb'}
2016-10-06 12:50:02,625 TRACE (c.c.c.c.l.Slf4JLogger:71) [cb-hb-fb-test.awspicadilla.eu/10.91.133.32:11210][KeyValueEndpoint]: Finished decoding of UpsertRequest{observable=rx.subjects.AsyncSubject@7885bd5b, bucket='hb'}
2016-10-06 12:50:03,548 DEBUG (c.c.c.c.l.Slf4JLogger:111) Could not fetch config from bucket "hb_ranking" against "cb-hb-fb-test.awspicadilla.eu/10.91.133.32".
com.couchbase.client.core.RequestCancelledException: Could not dispatch request, cancelling instead of retrying.
    at com.couchbase.client.core.retry.RetryHelper.retryOrCancel(RetryHelper.java:51)
    at com.couchbase.client.core.RequestHandler.onEvent(RequestHandler.java:206)
    at com.couchbase.client.core.RequestHandler.onEvent(RequestHandler.java:76)
    at com.couchbase.client.deps.com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:128)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at com.couchbase.client.deps.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    at java.lang.Thread.run(Thread.java:745)
2016-10-06 12:50:03,549 DEBUG (c.c.c.c.l.Slf4JLogger:111) Error while refreshing bucket config, ignoring.
com.couchbase.client.core.RequestCancelledException: Could not dispatch request, cancelling instead of retrying.
    at com.couchbase.client.core.retry.RetryHelper.retryOrCancel(RetryHelper.java:51)
    at com.couchbase.client.core.RequestHandler.onEvent(RequestHandler.java:206)
    at com.couchbase.client.core.RequestHandler.onEvent(RequestHandler.java:76)
    at com.couchbase.client.deps.com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:128)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at com.couchbase.client.deps.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    at java.lang.Thread.run(Thread.java:745)
2016-10-06 12:50:03,549 DEBUG (c.c.c.c.l.Slf4JLogger:111) Could not fetch config from bucket "hb_ranking_archive" against "cb-hb-fb-test.awspicadilla.eu/10.91.133.32".
com.couchbase.client.core.RequestCancelledException: Could not dispatch request, cancelling instead of retrying.
    at com.couchbase.client.core.retry.RetryHelper.retryOrCancel(RetryHelper.java:51)
    at com.couchbase.client.core.RequestHandler.onEvent(RequestHandler.java:206)
    at com.couchbase.client.core.RequestHandler.onEvent(RequestHandler.java:76)
    at com.couchbase.client.deps.com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:128)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at com.couchbase.client.deps.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    at java.lang.Thread.run(Thread.java:745)
2016-10-06 12:50:03,549 DEBUG (c.c.c.c.l.Slf4JLogger:111) Error while refreshing bucket config, ignoring.
com.couchbase.client.core.RequestCancelledException: Could not dispatch request, cancelling instead of retrying.
    at com.couchbase.client.core.retry.RetryHelper.retryOrCancel(RetryHelper.java:51)
    at com.couchbase.client.core.RequestHandler.onEvent(RequestHandler.java:206)
    at com.couchbase.client.core.RequestHandler.onEvent(RequestHandler.java:76)
    at com.couchbase.client.deps.com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:128)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at com.couchbase.client.deps.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    at java.lang.Thread.run(Thread.java:745)
2016-10-06 12:50:03,549 DEBUG (c.c.c.c.l.Slf4JLogger:111) Could not fetch config from bucket "hb" against "cb-hb-fb-test.awspicadilla.eu/10.91.133.32".
com.couchbase.client.core.RequestCancelledException: Could not dispatch request, cancelling instead of retrying.
    at com.couchbase.client.core.retry.RetryHelper.retryOrCancel(RetryHelper.java:51)
    at com.couchbase.client.core.RequestHandler.onEvent(RequestHandler.java:206)
    at com.couchbase.client.core.RequestHandler.onEvent(RequestHandler.java:76)
    at com.couchbase.client.deps.com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:128)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at com.couchbase.client.deps.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    at java.lang.Thread.run(Thread.java:745)
2016-10-06 12:50:03,549 DEBUG (c.c.c.c.l.Slf4JLogger:111) Error while refreshing bucket config, ignoring.
com.couchbase.client.core.RequestCancelledException: Could not dispatch request, cancelling instead of retrying.
    at com.couchbase.client.core.retry.RetryHelper.retryOrCancel(RetryHelper.java:51)
    at com.couchbase.client.core.RequestHandler.onEvent(RequestHandler.java:206)
    at com.couchbase.client.core.RequestHandler.onEvent(RequestHandler.java:76)
    at com.couchbase.client.deps.com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:128)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at com.couchbase.client.deps.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    at java.lang.Thread.run(Thread.java:745)

Are those the problem? And if so - what seems to be causing them?

It looks like one of your couchbase nodes (10.91.133.32:11210) is not reachable from the client (can be due a firewall). Timeouts can be caused by it.

It definitely isn’t a problem with a firewall, because all my couchbase nodes are AWS instances with all of the necessary ports open in the security group. And I don’t use any additional firewalls.

@roman.sorokowski can you try with 2.3.4 and if it doesn’t work send us a TRACE log when the error happens? that might help us pin it down.

@daschl

I’ve been able to reproduce this reliably now. This occurs 100% of the time for me if I have a breakpoint (even if it evaluates to false) in a static resource. Obviously, this isn’t a real world scenario, but whatever happens makes for timeouts 100% of the time; it does not recover. We have similar issues with the Java clients when we perform cluster maintenance. I run this in IntelliJ; I don’t know if IDE would make a difference, but I’d hope not. Edit: occurs in NetBeans too.

Attached is my demo project.

ClientTimeoutTroubleshooter.zip (3.4 KB)

Hopefully this helps find some stuff out.

@daschl @ingenthr Have you guys had a chance to look at this? We’ve been struggling with Couchbase Cluster maintenance for quite a long time here and this is the only thing I’ve come up with so far that helps point to client instability symptoms we’ve seen. I’ve been (intermittently) attempting to walk through the connection process, but I don’t really have insight into how the connection process works and what to look for.

In the future, it might be better to create a thread rather pick up this old one as there’s a lot of mixed info here…

Glad to try to help. If you have an enterprise subscription, you can also open a case with support to get assistance. There are some tools support can use to look for common problems/causes in the deployment.

All a timeout really means is a thread managed outside the client started a timer, and after the time period specified, the result hasn’t been received. There can be many causes.

A few things to check that are common causes:

  • Turn on GC logging. Do you have long GC pause times?
  • Configure the metrics collection on the Java client to be more frequent. Do you see a correlation between high latency and GC activities.
  • See if the server side stats (cbstats timings) show some long backfills. Owing to MB-10291 this can lead to congestion of a sort that would cause more timeouts. You can tune around it by increasing the KV endpoints.
  • Run SDK-doctor in the environment just to check for common configuration issues. We’ve seen, for instance, DNS requests timing out from misconfiguration cause issues.

And some questions:

  • What timeout are you using? The default of 2.5s?
  • What client version are you using?
  • What are you trying to migrate from and to?
  • What can you tell us about the environment?
  • What kind of ‘maintenance’ are you doing?

The most common cause for timeouts tends to be overcommitment of virtualized resources. GC is another common cause.

In some cases, like a throughput oriented workload, it would make sense to turn up the timeout you’re using. If you’re going for throughput rather than latency, you want to keep all of the buffers pretty full to maximize data movement by the various components of the system, but that means higher latencies.

Thanks. We do have Enterprise Support, so I’ll follow up there. We haven’t had many issues of late because we have not performed maintenance. It’s also hit or miss - perhaps 5-10% of the services we do have fail and don’t recover.

The example I provided above (and requiring) debugging is 100% reproducible - even if the breakpoint is conditional and is set to not break. Once the failure occurs, there is absolutely no recovery. This is the same behavior that we see in production; we have to bounce our failed services to get them to function again.

I’ll run off to investigate the causes you suggest.

  1. I’ll see what I can find out about GC logging.
  2. Increasing logging in production may not be feasible. We’ve done some sampling in the past, but never came up with anything since other services failed.
  3. Server side tends to be fine. The failures / timeouts are application side. Other application instances continue to function perfectly.
  4. Will do (SDK Doctor).

Some details about our usage:

  1. Timeouts - we use, by default, 10s for connect, KV, and query timeout. We have, by default, 75s for view queries. Our developers are free to override our defaults, but I’m not aware of where they have.
  2. Client version - this behavior has affected us for a long time. We had some issues with the 1.x clients but the later versions improved our situation greatly. We have had this issue across many versions of the 2.x client. My test, for example, fails with 2.5.1. Safe to assume all revs.
  3. We most recently went from 3.1.3 to 4.5.1.
  4. This happens mostly in production, but we’ve seem similar behaviors in dev / alpha / staging.
  5. When I say “maintenance”, I mean adding / removing nodes from the cluster. Mostly, this is to upgrade Couchbase to a newer version.

Again, my primary concern is not necessarily the timeouts, it’s the fact that they do not recover. Would you please take a look at the example?

I did have a look here and I can see that on IJ on my mac as well. There are some other oddities though, like setting breakpoints on java.util.concurrent.TimeoutException don’t pause the JVM and go into the debugger as expected. I’ll have a look further.

Filed JCBC-1155. That said, I don’t think it’s a valid test, since it messes the debugger itself up. It feels like something about debugging on the static initialization.

How does this manifest in production?

@ingenthr - Thanks for the quick reply.

This specific process, of course, does not manifest itself in production. It does, however, mimic the symptoms that we do see in production while we perform cluster maintenance. What occurs for us is that we wind up in a state where clients begin to Timeout and they do not recover. This then becomes an outage where we have to bounce our services, which then function properly again.

I was excited when I stumbled upon this by accident because it duplicated the symptoms we see and was easily reproducible. I have not yet been able to decipher the async code that manages connections to actually see what the cause is…