Couchbase server performance degradation

The upserts operations in our dev/stating/production Couchbase servers degrade over time and we see a constant " java.util.concurrent.TimeoutException ". This looks like a serious issue. Could someone please take a look at it and let us know what is happening?

We are using the Couchbase java sdk to perform our operations, version is : 2.7.9

Couchbase server version is : 6.0.2 build 2413

Please let me know if you need more logs

Exception in thread “pool-1005-thread-42” java.lang.RuntimeException: java.util.concurrent.TimeoutException

at rx.exceptions.Exceptions.propagate(Exceptions.java:57)

at rx.observables.BlockingObservable.blockForSingle(BlockingObservable.java:463)

at rx.observables.BlockingObservable.single(BlockingObservable.java:340)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.util.concurrent.TimeoutException

at rx.internal.operators.OnSubscribeTimeoutTimedWithFallback$TimeoutMainSubscriber.onTimeout(OnSubscribeTimeoutTimedWithFallback.java:166)

at rx.internal.operators.OnSubscribeTimeoutTimedWithFallback$TimeoutMainSubscriber$TimeoutTask.call(OnSubscribeTimeoutTimedWithFallback.java:191)

at rx.internal.schedulers.EventLoopsScheduler$EventLoopWorker$2.call(EventLoopsScheduler.java:189)

at rx.internal.schedulers.ScheduledAction.run(ScheduledAction.java:55)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

… 3 more

@varun the timeout is always the effect, never the cause. You need to share more information so we can help you triage this:

  • what has changed that you see more timeouts? to they come in spikes? is there a pattern w.r.t time/date, workload?
  • does it get better if you restart your java app?
  • can you share sdk logs?
  • can you share gc logs?
  • what is your workload and on what system are you running it on?

Thank you for looking into this Michael! (@daschl )
Here are my answers in-order :
* Just to give you a background, here’s how the application is structured :
We have 4 k8s pods and each pod has 8 threads. Each thread does some data-processing, and finally does a Bulk Upsert and followed by a single Upsert. Each pod has one Couchbase environment. The pattern of the Timeout looks like this : the first 2-4 hours i do not see any issues. After which i see Timeout in spikes. The workload is constant.
* It gets better if i re-start the pod.
* I can definitely share the sdk logs, but could you please let me know what the process or how do i enable and collect
these logs?
* I can share the gc logs too, but could you please let me know the process again?
P.S. I prefer sharing the logs to a private location, rather than uploading it here in the forum. Could you please give me an
email address or a secure location, so that i can send these logs?
* The client is running on a Debian based docker container, on a different VPC. The Couchbase server is running on a secondary VPC all these are on r5.4xlarge EC2 instances. The java application communicates with the secondary vpc via vpc-peering.

@daschl - any update on this?

We (CB 6.0 CE) are also seeing timeouts from to time and we discovered that this happens when the java heap usage is high (close to max.). E.g. for Java xmx=12GB we start to see CB timeouts when heap usage is around >=10GB. So maybe it is related to GC.

We could reduce it by tweaking other parts of the application which were causing high memory usage. So we got it down to a tolerable level - although sometimes even a 240s timeout is reached. The affected queries are usually larger SELECTs which query meta.id on a set of around 300k documents. Those queries do not cause timeouts when memory usage is low (e.g. around 4GB).

Hey @varun, at the moment @daschl is on vacation so I wanted to pick up the replies.

It sounds like it might be best for you to review the logs. What you’ll be looking for in the GC logs is any long stop-the-world kind of GCs. It depends on the GC you’re using, but a ParNew that goes for several hundred milliseconds or seconds would be the kind of thing to look for.

The reason this is important is that none of the app or client threads can make progress during a stop-the-world GC. Items are processed generally in order, so if you have a few of those in a row, that might be an indication that you need more memory or you’re making too much garbage.

In the client logs, you’ll want to look for indications of connections being interrupted or anything happening with a lot of frequency at the WARN/ERROR level.

The forums here are community support. There are a number of us from Couchbase on, but the kind of assistance you’re referring to there would come from having an Enterprise Subscription. If you do have one, reach out to support@couchbase.com and they can set this up for you.

From your description, I wonder if there is a slow, small leak in your code. If you’re maintaining references to something unexpectedly, it can lead to the kind of behavior you’re describing. I would consider running under something like Java Flight Recorder so you can see if there are refs unexpectedly growing over time. Before that though, just looking at a simple GC log with timestamps will probably get you pretty far.