Java client don't connect another cluster node when first is down


#22

@slodyczka.stanislaw but that makes kinda sense, right? If you have a cluster of those 3 nodes, the downed node is part of the server cluster map… so it connects to one of the other, gets a new config and then keeps trying to connect to the downed node until you bring it up again or remove it from the cluster.

We need to try reconnecting to the downed node since it contains 1/3rd of the data!


#23

@daschl but I cannot save document in this bucket when node is down.


#24

of course! your downed node contains 1/3rd of the partitions, you need to either bring it up again or do failover.


#25

@daschl so if one node is down all cluster is not to use until I run failover or run this node again?


#26

@slodyczka.stanislaw, yes. Failover “rebuilds” entire cluster excluding the dead node.
As i understand, your problem (write to the cluster in case of “dead node”) should be solved on application level for any write-call with persistence flag; see http://docs.couchbase.com/sdk-api/couchbase-java-client-2.3.5/com/couchbase/client/java/PersistTo.html (you can also read from replicas, see bucket.getFromReplica)


#27

@egrep so if I use e.g bucket.insert(doc, PersistTo.THREE) and my nodes is replica this should resolve my problem?


#28

@slodyczka.stanislaw,
no, in case of three nodes use PersistTo.ONE. As i understand, with three-nodes cluster (and 1 replica setup for bucket) PersistTo.THREE should never be success (you have 1 master + 1 replica = 2 copies of document; where third copy should be placed ?). Now, if 1 node dies (no matter if it is none-node,master-node or replica-node for you document) you can still successfully persist to at least one (i.e. alive master or alive replica; one of these two is definitely still alive). Make a simple experiment with failover during insertion and you’ll gonna see it by yourself.


#29

@daschl
I am getting same issue. My couchbase client stop working after first node goes down
As you mention in your comment:

"When you have a 3 node cluster and one node is down but still part of the cluster it is expected that the client tries to reconnect to that one node until it is removed from the cluster! And if 1 node is down then 1/3 of your reads/writes won’t succeed if that node still has partitions on it, which is likely the case "

How can I remove failed node from cluster ?
I am using couchbase-client, version: 1.4.12


#30

@nitinvavdiya if a node is down you need to fail it over in the cluster UI, this will remove it from the cluster. The SDK will pick up the topology change.


#31

@daschl
In production we may not remove node from UI as soon as node goes down.How can I handle it?


#32

You either need to fail it over or bring it up again, otherwise the data for the partitions on those nodes won’t be available. I recommend you to check out our documentation which covers the architectural decisions in great detail!


#33

@daschl
I enable Auto-Failover with 30 seconds, so when 1 node goes down then node will consider as failed and should remove from cluster after 30 seconds.but when I stop service of one node and wait for 1 min but i am fetching same issue.
If i do failover from UI console then application works fine.


#34

Are you sure you enabled it, set it to 30s and do you have at least 3 nodes in the cluster? see for everything regarding auto-failover: https://developer.couchbase.com/documentation/server/current/clustersetup/automatic-failover.html


#35

TL:DR
To get the ClusterInfo, couchbase always uses the first ip from the list of configured hosts.
If this is not available a ConnectTimeoutException is thrown.
Looks like a bug to me.

After some debugging, it looks like the java couchbase sdk cannot make a initial connection, if the node with the first ip in the configuration is not available.

Many applications need to access com.couchbase.client.java.cluster.ClusterInfo

spring needs this, for example, to create the CouchbaseTemplate.

To get the ClusterInfo you need to call com.couchbase.client.java.Cluster#clusterManager
this calls
DefaultAsyncClusterManager#ensureServiceEnabled

which does
return Observable
.just(connectionString.hosts().get(0).getHostName())

Only the first host from the connection string is used.
When this first host is not available you get an exception

ConnectTimeoutException: connection timed out: /193.168.21.22:8091

In the exception message you can also see that the first host name was used.

So it does just not work as expected.
The above is true, only for the access to the ClusterInfo

When connecting to buckets, couchbase client correctly iterates over the list of ips, tries to connect to the node, and uses the first successful connection.

In this thread, it was stated, that couchbase would iterate over the ips in reverse order. I think that is not true.
Instead java.util.Collections#shuffle(java.util.List<?>) is used to bring the ips in a random order.
However, this shuffle does not change the order of the original connection string. So the access to ClusterInfo always used the first ip from the configuration.

I am using couchbase java client 2.2.8


#36

It seem the issue is https://issues.couchbase.com/browse/JCBC-999 and is marked as resolved on SDK versions 2.4.1 and 2.3.7. Can someone confirm if the issue is already fixed?

UPDATE:
Related with the issue I have found the open issue https://issues.couchbase.com/browse/JCBC-996.

Thanks.


#37

I’m using client v2.4.6 and still experiencing the issue described.
I’ve checked that the source code for DefaultAsyncClusterManager#ensureServiceEnabled still does:

connectionString.hosts().get(0)).getAddress().getHostAddress()

as @stefan-isele spotted.

In my case I’ve got Couchbase nodes containerised with Docker and Kubernetes. The idea is being able to take a node down for maintenance in production (applying security fixes for instance).

I find that after restarting any Couchbase nodes I need to restart client applications, otherwise they wouldn’t work at all, even when the node is up & running again.

Is there any configuration or procedure that I can apply?
NOTE: My application clients use Spring.


#38

Thanks @jjurado1982. I have reviewed also the master branch of the Java Client SDK and it seems the problem is still there.


#39

In my case the problem seems more related to this other thread:
https://forums.couchbase.com/t/couchbase-java-client-not-reconnecting-after-connection-timeout/12583/6