Java client throwing timeout exceptions under load

We are planning on replacing our current NoSQL servers with Couchbase. Our evaluation and testing of the Coucbase server (3.0) showed a massive improvement over our current choice and we’re looking forward to switching over completely to Couchbase. However, we ran into some unexpected trouble in our staging servers.

We are using the Java Client 2.0.0 in our Scala Play application. We have a global Cluster and Bucket object, and we use those to get and upsert. Sometimes (not always), with the couchbase server still doing minimal ops/sec (~30), the Java client starts throwing timeout exceptions during which the service is unusable. This goes away after sometime, and seems to be related to load. Here’s what we see in our logs:

Started at Sat Nov 08 06:55:08 UTC 2014","ex":"java.lang.RuntimeException: java.util.concurrent.TimeoutException
        at rx.observables.BlockingObservable.blockForSingle(BlockingObservable.java:481) ~[io.reactivex.rxjava-1.0.0-rc.3.jar:1.0.0-rc.3]
        at rx.observables.BlockingObservable.singleOrDefault(BlockingObservable.java:382) ~[io.reactivex.rxjava-1.0.0-rc.3.jar:1.0.0-rc.3]
        at com.couchbase.client.java.CouchbaseBucket.get(CouchbaseBucket.java:76) ~[com.couchbase.client.java-client-2.0.0.jar:2.0.0-beta-14-gbe5dc12-dirty]
        at com.couchbase.client.java.CouchbaseBucket.get(CouchbaseBucket.java:71) ~[com.couchbase.client.java-client-2.0.0.jar:2.0.0-beta-14-gbe5dc12-dirty]
...

I’ve not seen this in the development environment. We just started testing 2.0.1 client in dev, but we had not seen the timeout exceptions with 2.0.0, so I’m not sure updgrading our staging machines to 2.0.1 would fix this issue. I’m currently working on upgrading staging servers with the new client, so I will know if that works.

So, the documentation suggests we use a global Cluster object to share connections. What about Bucket? Is opening/closing a bucket around a request a bad idea? Are there any other settings that we should be looking at tuning (timeouts, etc.)?

Any other ideas/pointers to either reproduce this consistently (that would be a start!) or a way to resolve this would be highly appreciated.

Thanks

Hi @elephas_maximus,

thanks for asking that here, I’ll do my best to get to the bottom of it with you.

So a TimeoutException can be happening for two reasons:

  1. For whatever reason the operation really takes too long (server issues, network issues, client issues including app server and OS)
  2. There is a bug in the SDK that prevents the operations from finishing at all

With the “old” SDK I’d immediately jump to investigate on 1), but since 2.0 is pretty new there is a reason it can still be 2) as well (but less likely I think).

So in order to track it down, and I don’t know how much you want/do to get to the bottom of it:

  • When you say it goes away, can you give me a sense of what that means? how many ops/s are failing? only gets?
  • Can you give me more infos on how your cluster setup looks like (nodes, environment)…
  • Can you give me more infos in what sate the cluster is when those timeouts happen (rebalance going on? steady state?) 30 ops/s doesn’t sound like much at all, I’m curious to see whats going on
  • You maybe want to enable debug logging to dig deeper if there is maybe something screwed up in the state machines internally
  • Can you get me code that I can run and see what’s probably going wrong? If you don’t want to share it publicly you can email it to me (michael [dot] nitschinger [at] couchbase [dot] com)
  • Is it possible that you run some comparisons against 1.4 and see if it behaves very differently

Let’s start with that
Cheers
Michael

Hi @daschl,

So, it looks like it was some strange combination of 2.0.0 client along with poor timeout configuration we used when setting up the Couchbase environment. I wish I had kept notes on what was going on.

We moved over all environments to 2.0.1 and we’ve not seen this issue again! We felt confident enough to push it out to production last night for one of the services and it has been working great.

Thanks,

@elephas_maximus great to hear! Please feel free to reach out again if you run into any issues in production - or if you have paid support of course straight through our support crew.

Cheers,
Michael