Couchbase java client fails to reconnect to cluster

jzeng · June 5, 2018, 5:55pm

It seems it is a common issue that the couchbase client fails to reconnect to the cluster in the case of connect time out. See https://www.couchbase.com/forums/t/couchbase-java-client-not-reconnecting-after-connection-timeout/12583

We also experience the same issue. We had 20 app instances pointing to a 6-node couchbae cluster. We shutdown 1 couchbase node and started it after a few minutes. The KeyValueEndpoint connection is never reconstructed. We have to restart the apps to restore the connection.

@daschl
By checking the code, it seems this is an intended behavior. The reconnect would be disabled if there was a socket connection time out!

We wonder why socket connection timeout is not treated for reconnect like connection refuse.

ingenthr · June 6, 2018, 6:09am

How did you do the shutdown of the 1 node? Note that there is a recent known issue that impacts query/fts/views where it may take until a TCP timeout in some circumstances to rebuild the connection, but this shouldn’t impact apps since the pool should grow and it will automatically resolve itself over time. JVMCBC-543 covers it.

Looking at the code you point to, indeed it warns about a closed channel, but the next request through or a retry would build a new connection (an endpoint). Correct me if I’m wrong @daschl.

Can you identify the specific version of the client/cluster you’re using and how you cause a shutdown/startup?

jzeng · June 6, 2018, 5:17pm

Thanks for the info.

Regarding the shutdown, we just ran “sudo systemctl stop couchbase-server”. To start, we ran “sudo systemctl start couchbase-server”.

It is good to know the endpoint can be rebuilt. Can you point me to the code where this happens? My understanding of the code so far is once the endpoint enters to the DISCONNECTED state, the reconnect is never triggered and the dispatch always fails because of disconnected state.

We are running v4.5 for the server. We are using v2.2.3 for the client. We also tried v2.5.8. We only use key value lookup (no query and view), so the endpoint will be KeyValueEndpoint.

We particularly want to know why connection time out is treated different than connection refuse.

Too, we observed a netty error “c.c.c.d.i.n.u.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it’s garbage-collected. See http://netty.io/wiki/reference-counted-objects.html for more information.” when a node is shutdown. It is 100% reproducible. I will start a new thread on that.

It would be helpful if we could get some more understanding about the expected client behavior when a node was down.