Problem with timeouts, eventually

So I’m setting up a connection to couchbase and pull data as needed, this works fine for several hours. Then it starts getitng timeouts a few at first but then more and more until I just have a rash of them and am completly unable to fetch data from couchbase. The relevant code to pull the data from couchbase is included below, the 5 second timeout gets triggered over and over again. :disappointed:

The couchbase cluster doesn’t seem to be under load, and it can dispatch 99% of queries in under 10 miliseconds, so the 5 second timeout should be plenty. Instead it collapses, as timeouts seem to breed more timeouts. :sob:

`

private Observable<Entry<V>> bulkGet(final Iterable<? extends String> ids) {
        return rx.Observable
                .from(ids)
                .flatMap(
                    this::fetchFromCouchbase,
                    this::logError,
                    this::logCompletion,
                    100
                )
                .timeout(5000, MILLISECONDS)
                .filter(v -> v.data != null)
                .onBackpressureBuffer(1000, () -> logError("transform"), BackpressureOverflow.ON_OVERFLOW_DROP_LATEST);
    }


private Observable<Entry<V>> fetchFromCouchbase(String id) {
        if(id == null) {
            return Observable.<Entry<V>>empty().onBackpressureDrop();
        }
        return couchbaseBucket.async().get(id, LegacyDocument.class)
                .cacheWithInitialCapacity(1)
                .singleOrDefault(null)
                .map((Func1<Document, Entry<V>>) val -> transform(id, val));
    }

`

To give some context here on slow machines the timeout deathspiral is far more common, on fast machines it is rare. also it tends to work fine early on when the load is heaviest because nothing is cached, but when traffic begins to drop it get more timeouts that seem to spawn timeouts in turn. It leaves me completely baffled.

If it was having trouble under heavy load that would be one thing, but the trouble appears well after peak load.And even while it is timing out some things other things still get fetched blazingly fast.

Hi @Dirk_walter
That is indeed strange, especially after peak load…

One thing that is probably not related, but I was wondering: in the fetchFromCouchbase method, any particular reason why you used cacheWithInitialCapacity?

Since it doesn’t sound like something that you can reproduce quickly and at will, I guess it’ll be difficult to obtain a TRACE-level log of operations in flight when the timeouts occur, but if by chance you can get one, that would greatly help in figuring out the issue.

The cache with initial capacity is used to hold the result, it’s probably no longer needed since it was used to work around an issue with a recent release of RX java but we upgraded to a bugfixed version once available.what is is supposed to do is hold the result form the cb observable and pass it on, important for dealing with backpressure in the observables and if you have multiple subscribers. Since one key in couchbase can only return one result capacity one is sufficient.

I have been trying to get logs form the timeouts but it’s indeed difficult.

using RxJava 1.1.5, and couchbase client 2.2.5. Our couchbase is 3.0.1

Still investigating this issue though my datacenter caught of fire and was flooded in two unrelated incidents.

I managed to replicate the issue in a controlled enviorment and what is happening is wierd… there are still requests being sent and answers received from couch-base, it just seems the observables don’t fire their on-next. I’m going to try and trace it better to figure out where it is going wrong.

And I think I found the root cause… a JVM bug that was fixed years ago… but we are still using an ancient version of java.

Interesting! Which version of Java have you observed this on?

1.8.0_60 I think. We upgraded machines so I can’t go back and check.

Can you pls. share more details on this error. We are facing something similar, and would need some more insight.