Connecting to initially disconnected/off cluster nodes

I’m monitoring the NodeConnectedEvent and NodeDisconnectedEvents for a cluster in an attempt to log/alarm on potential issues from the java client.

I notice on initialization we will connect to a single node and it will discover the rest of the nodes in the cluster. However, if I start with one of those nodes down and bring it up sometime later, I don’t see a NodeConnectedEvent for that node, and the client doesn’t appear to attempt to connect to the now-up node.

Is this behavior design intent? Perhaps I have a configuration problem?

It appears new nodes introduced to the cluster do get connected but I still find no activity against the node brought up sometime after the client starts.

@jberglund hmm so you are saying you add and rebalance a node into the cluster after the connections have been established and it is not picking it up?

How does your cluster setup look like, which SDK version, what kind of workload are you running?

@daschl Couchbase 4.1, three node cluster (call them node155, node158, and node159), java sdk 2.2.8

I turn off node158 before starting the client, via /etc/init.d/couchbase-server stop

Then I start the client and wait for it to spin up, observing the system events as they come in through the environment event bus. I see the NodeConnectedEvent for node155 and the ConfigUpdatedEvent after it connects to the it.

Then it discovers that the other two nodes exist, but I only see the NodeConnectedEvent for node159, which I expected.

I wait a minute or two then start the other node158 with /etc/init.d/couchbase-server start

I never see a NodeConnectedEvent for this node.

@jberglund interesting, okay… and I guess you also don’t see a log message indicating the same but the app works as expected?

The now-up node is never used by the client, I see n1ql hits against 155 and 159, but not 158.

2016.07.11 13:33:36:205 EDT
NodeConnectedEvent{host=10.17.4.155/10.17.4.155}

2016.07.11 13:33:36:700
ConfigUpdatedEvent{bucketNames=[bwecl], clusterNodes=[10.17.4.155/10.17.4.155, 10.17.4.159/10.17.4.159, 10.17.4.158/10.17.4.158]}

2016.07.11 13:33:36:789
NodeConnectedEvent{host=10.17.4.159/10.17.4.159}

2016.07.11 13:35:10:877 EDT
ConfigUpdatedEvent{bucketNames=[bwecl], clusterNodes=[/10.17.4.155, /10.17.4.159, /10.17.4.158]}

//10.17.4.158 brought up, no more SYSTEM events recieved :frowning:

side note: it would be useful to use if the ConfigUpdatedEvent contained the cluster name that it is reporting the nodes for. We can configure one couchbase environment so only get the one event bus, but can configure multiple clusters to share the environment. When we receive the config updated event with a new list of nodes (like on a failover-rebalance) we have to map that back to the affected cluster.

Hm we currently don’t get a cluster name as part of the server config, since its “self contained”. Would you be able to do some kind of identification checking on the nodes that are part of the cluster?

I’ll see if I can reproduce your case

Cluster in ConfigUpdatedEvent: We were thinking since its not in the event, of looking up each of the update list nodes in a cached list of the ClusterManager.rawInfo nodes until we get a hit, so we can detect from the client when, say, a node we saw disconnected event for is removed from a cluster (in other words, we dont care anymore and can lower an alarm) or a new node is added that we need to watch for, we can cache against the right cluster.

Some more detailed version info:
Couchbase Version: 4.1.0-5005 Enterprise Edition (build-5005) on CentOS release 6.7
couchbase-core-io-1.2.9.jar
couchbase-java-client-2.2.8.jar
rxjava-1.0.17.jar

I wonder how should the client be informed that the node is now available? Perhaps I can enable a log on the client?

Actually I think you hit a bug there, I’m currently investigating :slight_smile:

During bootstrap we’re swallowing an error on the socket and the endpoint stays in CONNECTING all the time, as a result you don’t get the event. This logic is different from the one used when already connected and something happens during runtime, so as far as I can see its an issue isolated to bootstrap of the client.

I’ll keep you posted.

@jberglund I think I’ve merged a fix for your issue https://github.com/couchbase/couchbase-jvm-core/commit/9948e5a28d4dd78779328d2a1c0e972c99baa3bc - are you in the mood to compile and test it? If so I can tell you how.

If not this will most likely make it int o 1.3.2 (so 2.3.2 java-client) beginning of august.

Thanks.

I’m chasing another issue that I see intermittently that I wonder could be related to the same root cause, where I get a NodeConnectedEvent followed immediately by a NodeDisconnectedEvent within a ridiculously low amount of time, typically during the openBucket operation. I never see a NodeConnectedEvent for it again.

2016.07.13 12:29:57:632 EDT
Initializing DB/cluster: db1, cluster1

(my logs weren’t putting out the full event string, but this is our event bus subscriber consuming the NodeConnectedEvent)
2016.07.13 12:29:58:013 EDT
Couchbase Node Connected: 10.17.4.155

(…and disconnected event)
2016.07.13 12:29:58:029 EDT
Couchbase Node Disconnected: 10.17.4.155

I actually have two more apps connecting to .155 at the same time, in this instance, only one of the apps saw this, the other two connected to 155 just fine.

Which services are running on that node?

we’re running index, kv, and n1ql on all three nodes.

Okay, so to give you some of the background: those events are sent out based on state transitions in the underlying components… is it by chance the node you bootstrap from in the list on the SDK?

Btw if I could get my hands on a TRACE or DEBUG log from that bootstrap where it happens that would be awesome :slight_smile:

Yes that is the node I bootstrapped the SDK with. I’ll see about getting a log on that

PM sent with the logs I could grab.