Couchbase Java Client 2.2.5 stuck in bad state

unhuman · November 28, 2016, 7:09pm

We are running Java Client v2.2.5 and our client seems to be getting in a state where it will only return TimeoutException. For example, we are requesting a document that doesn’t exist, yet TimeoutException. Restarting the service and the proper result (not found) is returned.

We have had some cluster issues in the past week, but this service was restarted yesterday and our issues were resolved.

Here is some stack trace:

exception: java.lang.RuntimeException: java.util.concurrent.TimeoutException
at com.couchbase.client.java.util.Blocking.blockForSingle(Blocking.java:75)
at com.couchbase.client.java.CouchbaseBucket.counter(CouchbaseBucket.java:746)
at com.couchbase.client.java.CouchbaseBucket.counter(CouchbaseBucket.java:731)
at **OUR CODE**

ingenthr · November 28, 2016, 10:08pm

Unfortunately, there is not much information that can be gleaned from the TimeoutException itself. The IO for that timeout exception occurs on a different thread so if it’s related to the ‘cluster issues’, we’d see more in that other thread.

We did come across a tiny bug in the last week or so where the 2.3 client would try to reconnect correctly, but didn’t log correctly info about the reconnect attempt. It would log only at DEBUG level rather than WARNING/INFO level so it appeared the connection was being lost but not recovering from a log examination.

What else is in your logs that correlates with this TimeoutException?

There’s no issue I know of that would affect this, but 2.2.5 is about 8 months old now. There are known issues with the query streaming parser that could lead to this (fixed in the 2.3 series), but that’s not the case here since it appears to only be with KV operations.

p.s.: good meeting you at Couchbase Connect '16!

unhuman · November 28, 2016, 11:11pm

Likewise! Connect was great! My 2nd year.

I didn’t find anything else in the logs that I can find… I’m not sure what that other thread is if we need to adjust our logging settings to get that logged… But I don’t see anything. Any pointers on what to look for / how to expose? Of course, maybe it’s not logging anything…

-H

subhashni · November 29, 2016, 12:13am

The log level property has to be set as DEBUG to see server re-connection logs. It is logged with thread name as "cb-io… Some information on how set the log level http://developer.couchbase.com/documentation/server/4.5/sdk/java/collecting-information-and-logging.html (you may be aware of this one already)

ingenthr · November 29, 2016, 12:45am

When the client starts and bootstraps, it INFO logs the environment settings. Do you at least see that in the logs somewhere? We should verify your logging settings are capturing Couchbase logging output at all.

If we validate that we’re getting that logging at least and you still don’t see anything just before the TimeoutExceptions started to happen that we can correlate to, then a couple questions…

Have you restarted the app server? If so, does it get back into this state again?

If not-- then don’t necessarily do it yet-- can you check the state of the connection at the client and at the Couchbase servers? in other words, does the client think the connection is ESTABLISHED while the cluster somehow doesn’t think so? If so, we’ll want to try to work out why the client isn’t recovering.

So you know, we do have specific functionality in the client to detect these half-open kinds of situations and recover from them. The basic idea is that if we see consistent timeouts from a connection, we drop it and rebuild it. We also check on a regular interval to be sure the cluster topology hasn’t changed. The former there relies on a certain amount of regular workload though, so it’s possible it won’t kick in if you don’t have enough workload. We test regularly to be sure this works as expected.

unhuman · November 29, 2016, 7:12pm

Restarts and yes, the app is getting into this state again. We are upgrading to 2.3.5 and we’ll see how that goes. If we continue to have failures, we’ll up our logging detail. Funny bit is this is happening in our Staging env, but not our development envs. Will update as I get more info.

Thanks for all the pointers.

ingenthr · November 29, 2016, 7:40pm

Interesting. That makes it sound like a possible network port issue that’s transient?

One thing to possibly try is a new experimental project that @brett19 has been working on to make it easier to diagnose environmental problems. Its working name at the moment is SDK doctor. Maybe try building it and running it in that env? It may not find anything, but it’s worth a shot.

@brett19: maybe you have a place to point to a binary?

unhuman · December 2, 2016, 4:22pm

I’m wondering, is it possible that our cluster is out of connections? we have about 14k connections to our 3 node cluster (each node running all 3 concerns - data, index, query). We really don’t think this is it, but… throwing it out there.

ingenthr · December 7, 2016, 6:46am

I wouldn’t think that’d be the case. It’d certainly cause some issues, but I think you’d see that in the logs for sure. Definitely you’d see the reconnects if you’re getting logging from Couchbase properly.

ingenthr · March 13, 2017, 4:31pm

@unhuman: did you work this one out? Are things good from later versions of the client?

unhuman · March 27, 2017, 6:05pm

We aren’t 100% sure yet. We are upgrading our clients to 2.4.3.

We won’t really know if we have problems until we perform Couchbase maintenance.

Thanks - H

ingenthr · March 27, 2017, 9:53pm

No worries, thanks. We’re of course interested in how things get on when you get there.