Sporadic timeouts: opening bucket fails with Java SDK for 4.5.X

Hi!

  • 3 nodes, 2 buckets.
  • Java SDK 2.3.3/1.3.3
  • CB 4.5.1-2844 (src-based, but this error also happens on 4.5.0-GA, i’ve checked separately)
    There are sporadic timeout errors opening bucket, that happen often enough (~ 1-2 good try : ~ 2-3 bad tries). There is no such error at all for 4.1.X (many-many-many runs with absolutely the same OS/nodes/buckets configuration). Error ends with:

Sep 26 09:06:49 82.node [cb-core-3-1] com.couchbase.client.core.ResponseHandler Retrying GetBucketConfigRequest{observable=rx.subjects.AsyncSubject@1524e7c2, bucket=‘links’} with a delay of 100000 MICROSECONDS
Sep 26 09:06:49 82.node [cb-core-3-1] com.couchbase.client.core.ResponseHandler Retrying GetBucketConfigRequest{observable=rx.subjects.AsyncSubject@60196094, bucket=‘links’} with a delay of 100000 MICROSECONDS
Sep 26 09:06:49 82.node [cb-core-3-1] com.couchbase.client.core.ResponseHandler Retrying GetBucketConfigRequest{observable=rx.subjects.AsyncSubject@1eb1fbb4, bucket=‘links’} with a delay of 100000 MICROSECONDS
Sep 26 09:06:49 82.node [cb-core-3-1] com.couchbase.client.core.ResponseHandler Retrying GetBucketConfigRequest{observable=rx.subjects.AsyncSubject@1524e7c2, bucket=‘links’} with a delay of 100000 MICROSECONDS
Sep 26 09:06:49 82.node #011java.lang.RuntimeException: java.util.concurrent.TimeoutException
Sep 26 09:06:49 82.node #011at com.couchbase.client.java.util.Blocking.blockForSingle(Blocking.java:73)
Sep 26 09:06:49 82.node #011at com.couchbase.client.java.CouchbaseCluster.openBucket(CouchbaseCluster.java:307)
Sep 26 09:06:49 82.node #011at com.couchbase.client.java.CouchbaseCluster.openBucket(CouchbaseCluster.java:285)

detailed log with Level.ALL is below (34.4 KB)

[ UPDATE ] dowloadable link: https://aws1.discourse-cdn.com/couchbase/original/2X/5/5d0128a3ee363db68d3193ec54b28220649a93e6.zip

Could you please take a look, is it my problem or a bug (or, maybe, it’s a 4.5.X branch bug) ?

Hi @egrep,

I do see a problem when connecting to bucket links. It would be great if we can get a tcpdump and also can you run the test with GA build. I created a tracking ticket here https://issues.couchbase.com/browse/JCBC-1008. Are you building the couchbase server source code for all the versions? Was there a successful connection to links before as I see only the retries and timeout?

@subhashni,

Are you building the couchbase server source code for all the versions?

No, 2844 only, GA was downloaded as .deb from couchbase website

Was there a successful connection to links before as I see only the retries and timeout?

No, it was an “error during application server initialization”. But as i mentioned before, in case of restart there is a “chance” of normal connection.

  1. Is this a Java-SDK-connection-related problem or “server-connection-handling-related problem” ?
  2. do you want me to reinstall with GA + retest + provide new logs to you ? If you need just a “verbal confirmation” i can say “yes, with GA it happens even often then with 2844”
  3. tcpdump is not a problem, but do you really need it when you have a detailed log from application level ? Maybe you, @daschl or @simonbasle could take a look at application-level frames content and try to determine the potential cause ?

And the most impressive thing, of course, is this one:

Sep 26 09:06:39 82.node [cb-io-1-1] com.couchbase.client.core.endpoint.Endpoint [null][KeyValueEndpoint]: Endpoint connect completed, but got instructed to disconnect in the meantime.

Is it normal ? According to the following it’s not:

On this stage of connection i definitely don’t want to disconnect, but looks like server instructs JavaSDK something kinda “please, disconnect”. Tcpdump is not a helper in this case, most likely there are:

  1. Either a some kind of problem with server connection state machine
  2. Or server instructions are interpreted wrong by JavaSDK
    • maybe something that i missed / interpreted wrong because of lack of knowledge about “how all this works together”
  • more: iptables are absolutely clear

@egrep I was able to reproduce the behavior. I think I know where the problem lies, looks like you had bootstrapped with the public IPs of the server nodes and I can see from the logs that client and server are reachable from private IPs. Can you change the bootstrap list to make sure the hostname resolution is consistent from the clients and servers? Note that you can set the hostname the cluster node will advertise itself to other cluster nodes as at setup time. The cluster will set things up automatically if there is a single interface, but if you have multiple interfaces you may need to configure something that is consistently resolvable among all nodes. This can either be a hostname or an IP address.

1 Like

@subhashni,

because of lack of emojii “cool girl is on the bridge” (while @daschl and @simonbasle are somewhere far away), please take this one: :kissing_heart:
All ip-addresses, of course, are private, but your idea is absolutely correct: servers were bootstrapped using one ip’s block while clients tried to connect via another (each node has 2 ip’s but one is a “public ip emulation”). After correction and multiple restarts looks like error has gone.

Thank you!