Strange TimeoutException on some jobs (others work fine)


#1

Hi,

I’m having a hard time finding out an issue causing TimeoutException. I have a WAR file which runs several batch jobs internally. Those batch jobs are managed by a Spring Scheduler.

Java Client version: 2.1.4
Couchbase Server: 3.0.1 Community (single EC2 instance, T2.m)

This is the kind of exception i’m seeing each time the batch job is triggered:

ERROR c.g.c.s.spring.SchedulerConfig - Error in scheduled task java.lang.RuntimeException: java.util.concurrent.TimeoutException
at rx.observables.BlockingObservable.blockForSingle(BlockingObservable.java:472)
at rx.observables.BlockingObservable.single(BlockingObservable.java:341)
at com.grinstone.design.crud.dao.vu.VirtualUserService.findDescriptionByProjectId(VirtualUserService.java:98)
at com.grinstone.design.cleanup.task.BlobResourcesCleanup.call(BlobResourcesCleanup.java:82)
at com.grinstone.design.cleanup.task.BlobResourcesCleanup.call(BlobResourcesCleanup.java:45)
Caused by: java.util.concurrent.TimeoutException: null
at rx.internal.operators.OperatorTimeoutBase$TimeoutSubscriber.onTimeout(OperatorTimeoutBase.java:169)
at rx.internal.operators.OperatorTimeout$1$1.call(OperatorTimeout.java:42)
at rx.internal.schedulers.ScheduledAction.run(ScheduledAction.java:55)
at rx.schedulers.ExecutorScheduler$ExecutorSchedulerWorker.run(ExecutorScheduler.java:98)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

I’m also seeing:

2015/07/26 09:26:16 [ip-10-0-0-147] [taskScheduler-2] ERROR c.g.c.s.spring.SchedulerConfig - Error in scheduled task java.lang.IllegalStateException: The Content of this Observable is already released. Subscribe earlier or tune the CouchbaseEnvironment#autoreleaseAfter() setting. at com.couchbase.client.core.utils.UnicastAutoReleaseSubject$OnSubscribeAction.call(UnicastAutoReleaseSubject.java:230) at com.couchbase.client.core.utils.UnicastAutoReleaseSubject$OnSubscribeAction.call(UnicastAutoReleaseSubject.java:202) at rx.Observable$1.call(Observable.java:145) at rx.Observable$1.call(Observable.java:137) at rx.Observable$1.call(Observable.java:145) 2015/07/26 09:26:16 [ip-10-0-0-147] [taskScheduler-1] ERROR c.g.c.s.spring.SchedulerConfig - Error in scheduled task java.lang.RuntimeException: java.util.concurrent.TimeoutException at rx.observables.BlockingObservable.blockForSingle(BlockingObservable.java:472) at rx.observables.BlockingObservable.firstOrDefault(BlockingObservable.java:198) at ...

This WAR is running on an EC2 t2.micro instance, with 1 cpu / 1Gb RAM. Other batch jobs inside this same war are running fine. Only some of the batch jobs are failing.

Here is a code example of what a job looks like:

public void run() {
final long now = System.currentTimeMillis();
  users
  .asyncFindAll()
  .filter(user -> now - user.getLastLoginTimestamp() >= 86400L)
  .map(User::getId)
  .filter(indices::exists)
  .toBlocking()
  .forEach(indices::close);
}

The underlying users.asyncFindAll() operation does a view query (with reduce=false):

public Observable<T> queryView(final ViewQuery query) {
return bucket.query(query)
  .flatMap(AsyncViewResult::rows)
  .flatMap(row -> row.document(RawJsonDocument.class))
  .map(deserializer)
  .timeout(timeout, unit);
}

Nothing special there, and it still doesn’t work.

What i already tried:

  • Check if the couchbase view are correctly created, happened to be the case. Tried to query views via the web UI, without any issue. Queried the database through the rest API from the EC2 batch work instance successfully,
  • Tune the io / compute pool number of threads; i thought it may be a deadlock, but the pools are already with a minimum size of 3 threads each,
  • Tune RxJava Compute pool through RxJavaPlugins hook: didn’t change anything,
  • Tried to run this war on my local machine (4 cores): could not manage to reproduce the issue with a local couchbase database,
  • Tried to understand why only specific batches are failing whereas other in the same war using the same couchbase connection don’t, could not figure out,
  • Tried to switch to blocking observable to ensure the scheduled task waits until the observable is exhausted,
  • Tried to increase Observable timeout (configured to 30sec) to 5min but didn’t change anything,
  • Tried to increase autorelease timeout without success,
  • Tried to profile the Java App for possible Deadlock, YourKit didn’t found anything nore i did,
  • Tried to run those failing batches every 2min instead of every day to see if the error is the same, and it is,
  • Tried to check jvm health with JVisualVM, sampling CPU and memory but nothing interesting showed up.

Note that:

  • the connection to the database is working fine, other batches are running fine inside the same WAR and they are connecting to the database too without any error.
  • The abstraction layer above the couchbase client is also successfully used in another war without any issue.

I’m absolutely clueless now on how to fix this issue. I would really appreciate some help to find out what i can do to debug this and find the root cause. We can even setup a meeting with a profiler with the couchbase team if it helps.

Thanks in advanced for your time.


#2

Still no clue?

We’re stuck with this, and have no idea why we have random TimeoutException. :frowning:


#3

@vanadium would it be possible for you to run the scenario with TRACE logging enabled (at least on the com.couchbase namespace)? Maybe from the network traffic we can see, maybe something is stuck or returns too slow. You can also DM Me the logs here in the forums if you don’t want them publicly visible.

Cheers,
Michael


#4

I have DM you the log files in TRACE mode :slight_smile:


#5

I finally found out how to “solve” the issue. It’s not a good solution, because it forces to have all queried items into memory, causing a possible out of mem in the future when there will be many ones.

What solved the issue:
Switching to completely synchronous code in all batch jobs.

For example, the following code was failing with a timeout:

  bucket
  .query(query)
  .flatMap(AsyncViewResult::rows)
  .flatMap(row -> row.document(RawJsonDocument.class))
  .map(deserializer)
  .timeout(timeout, unit)
  .toBlocking()
  .forEach(renew);

The following code does not fail anymore:

  bucket
  .query(query)
  .flatMap(AsyncViewResult::rows)
  .flatMap(row -> row.document(RawJsonDocument.class))
  .map(deserializer)
  .timeout(timeout, unit)
  .toList()
  .toBlocking()
  .single()
  .forEach(renew);

The “renew” instance implements Action1. It performs queries via couchbase SDK too. In the second code, the renew instance now implements java.util.function.Consumer interface.

Some batch jobs were running fine because they weren’t doing queries to couchbase inside an Observable map / filter operation emitted by the couchbase SDK itself. But, they were randomly failing with a TimeoutException when nested database calls were done in other batch jobs.

I think there is a deadlock somewhere in the Java SDK 2.1.4 causing the timeouts when doing nested queries. IO and Computation pools have a thread pool size of 3, which is the default config.

To reproduce the issue, run more threads performing nested queries simultaneously than the configured pool size and you should be able to reproduce the tlmeoutexception issue. If you need my help to reproduce the issue, you’re welcome :slight_smile:


#6

Hi,
thanks for posting. I just had this problem, too (also with 2.1.3) and found out that making all calls synchronous fixes this, just like you.

Here is a unit test which reproduces this behaviour:

@Test(expected=TimeoutException.class)
//this test shows the problem that exists when mixing async.get and get
public void testTimeoutProblems() throws Throwable {
    
    CountDownLatch latch = new CountDownLatch(1);
    Bucket bucket = /*get normal couchbase bucket here*/;
    
    bucket.insert(JsonDocument.create("top"));
    bucket.insert(JsonDocument.create("1"));
    bucket.insert(JsonDocument.create("2"));
    
    Observable.just("top")
        .flatMap(bucket.async()::get)
        .subscribe(new Subscriber<JsonDocument>() {

        @Override
        public void onCompleted() {
        }

        @Override
        public void onError(Throwable e) {
            e.printStackTrace();
        }

        @Override
        public void onNext(JsonDocument doc) {
            
            List<String> keys = Arrays.asList("3", "1", "5", "2", "4", "1");
            
            Observable
            .from(keys)
            .doOnEach(System.out::println)
            .map(bucket::get)
            .doOnEach(System.out::println)
            .subscribe(new Subscriber<JsonDocument>() {
                
                @Override
                public void onCompleted() {
                    latch.countDown();
                }
                
                @Override
                public void onError(Throwable e) {
                    e.printStackTrace();
                    
                    ThrowableWrapper.e = e;
                    latch.countDown();
                }
                
                @Override
                public void onNext(JsonDocument doc) {
                    if (doc == null) {
                        return;
                    }
                    System.out.println("Read document from multiple documents: " + doc.id());
                }
            });
        }
    });
    
    latch.await();
    
    if (ThrowableWrapper.e != null) {
        throw ThrowableWrapper.e.getCause();
    }
}

This on my machine throws a timeout for document “2”. If you only try to get “2” by itself, it works fine. We predominantly had these problems when nesting bucket.get in another bucket.get, but also in other cases. It seems random.

Changing all calls we ever make to async does not fix this, either (even though it does for this test). We now use Observable.just(key).map(bucket::get) and so on so we can easily switch to flatMap(bucket.async()::get) once the issue is fixed.

Hope I could help and please fix!


#7

Nice you could reproduce this in a JUnit. :smile:

I’m pleased to see that i’m not alone having this issue. The randomness of the issues seems to indicate there is a race condition or a deadlock inside the Java SDK. Anyway, it seems to be a concurrency issue. Hope this will be fixed soon too!


#8

@vanadium, @SebastianF this is most likely because you are blocking all the threads on your computation pool.

For example mine is set to 8 by default and your test case passes. If I set it to 3 it deadlocks and times out. You can’t use blocking calls in async callbacks on the computation scheduler. IF you want to do that, and you shouldn’t, you need to offload it to a different scheduler, like the IO one for rx.

Change the code to

Observable
.from(keys)
.doOnEach(System.out::println)
.observeOn(Schedulers.io())
… (blocking calls down here)

and it will work, I’m sure.


#9

@daschl Thanks for your reply.
The thing is, we didn’t change anything in our implementation, it just occurred around 3-4 weeks ago - maybe we were lucky before, ok.

So what is the real solution to this problem? Yours sounds like a workaround which we shouldn’t be doing.
Like I said, using only async calls everywhere does not solve the issue in general (though the unit testI provided would work).

If I want all my Couchbase bucket calls to be as concurrently executing as possible and if I want to have as many calls per time unit as possible, what is the Couchbase setting and rx call structure you’d recommend?