What determines a NodeDisconnectedEvent?

#1

This section of the documenation http://developer.couchbase.com/documentation/server/4.1/sdks/java-2.2/event-bus-metrics.html
mentions monitoring and watching for NodeDisconnectedEvent.

I’m attempting integrate this with out monitoring system but am having difficulty understanding what and how a NodeDisconnectedEvent is triggered, in order to test this type of notification.

I’m running Couchbase 4.1.0-5005 Enterprise Edition (build-5005). I have tested with Java SDK 2.2.3 and 2.2.6

I’m running a 3 node cluster. After bringing up my environment I see the 3 NodeConnectedEvents on my event bus.

I then block access to one of the nodes by dropping all packets from/to it, from the client machine.
iptables -A INPUT -s IP_HERE -j DROP
iptables -A OUTPUT -d IP_HERE -j DROP

While trying to use the SDK, every third request times out. I don’t see a NodeDisconnectedEvent until about 20-25 minutes later.

#2

Dropping packets is a bit different than terminating the connection. Is there a regular workload? The way we’ve approached this is that once we see a continuous number of timeouts to a given node (tuneable by a threshold), we drop and attempt to rebuild the connection at the client.

Normally this will happen within seconds or minutes, but it could take as long as 20-25 minutes later if there isn’t any workload.

This is a great test by the way. We do something like this regularly.

One way you can probably simulate a NodeDisconnectedEvent is if you kill the memcached process on one of the nodes. That will terminate the TCP connections (sending TCP FINs) and the client would then have to rebuild them.

#3

Thanks for the information! I’m not doing a regular workload but am running some n1ql queries ad-hoc against the cluster after enabling the iptables firewall rules to block one of the nodes.

I’m trying to simulate the connection being broken from the perspective of the client SDK, for example, a firewall cutting a stale TCP connection, but not isolating the node the rest of the cluster, so am reluctant to kill the memcached process on the node.

I was expecting the Socket Keepalive Interval (set to the default 30s) mentioned here http://developer.couchbase.com/documentation/server/current/sdks/java-2.2/env-config.html would trigger the NodeDisconnectedEvent after some (shorter) period of time.

What configuration represents the number of timeouts to a given node that you mentioned above? I would like to try tuning it.

#4

Perhaps https://issues.couchbase.com/browse/JVMCBC-340 is contributing to my confusion.

#5

Maybe @daschl can weigh in here when he has a moment.

#6

the NodeDisconnectedEvent is triggerd the same time you’d see a node disconnect in the logs. That is when its internal state goes into DISCONNECTED from being previously CONNECTED. Most commonly this is the case when all sockets of a node go down (shutdown, failover, rebalance out).

We don’t do tcp level keepalive but the app level keepalive (sending various msgs over the app protocol in idle states) has no effect on this directly.

Make sure to not cut 1 stale TCP connection but rather perfom actual actions that will make the node removed, like a failover or a rebalance out with a node. if you just cut down one tcp socket the client will try to reconnect (since the node is still part of the server config) and you won’t see the event!

#7

thanks for the information!

1 Like
#8

@jberglund does my explanation align with the behavior you see?

#9

yes, that does align with what i am seeing. i was thinking the client would trigger this event on a broken tcp connection but that is not the case. i see the event when turning off the couchbase service, on failover, and removal.