Issue with number of java client connections increasing rapidly after fail over on single node in cluster

We’re having an issue with client after a fail over. After the node has been removed (as soon as fail over and rebalance has been started) the client seems to rapidly try to reconnect to the removed node, each time creating a new connection without the previous connection being closed so the number of connections keeps going up until the client runs out of file handles (5000 connections or so). Steps to re-produce:

Create a cluster with 3 nodes and multiple buckets (we using 8)
Create a client that connects to each bucket and reads a bunch of documents every second or so (creating about 500 documents)
Do a hard fail over on one node and start the rebalance
As soon as the rebalance happens the client connections start to increase rapidly

I’ve tried to increase retry interval etc without any affect. It doesn’t happen every time but seems to be more consistent with more buckets and more documents. Haven’t been able to reproduce for a single bucket.

Client version: 2.4.6 (Java)
Server version: 4.5.0

Here is an excerpt of the log (couchtest3 is the node that was removed):
14:26:03.855 [RxComputationScheduler-5] DEBUG c.c.client.core.endpoint.Endpoint - [null][KeyValueEndpoint]: Endpoint connect completed, but got instructed to disconnect in the meantime.
14:26:03.864 [cb-io-1-1] DEBUG c.c.c.c.e.kv.KeyValueFeatureHandler - Requesting supported features: [TCPNODELAY, XATTR, SELECT_BUCKET]
14:26:03.864 [cb-io-1-2] DEBUG c.c.c.c.e.AbstractGenericHandler - [couchtest3.aquto.internal/10.98.76.206:11210][KeyValueEndpoint]: Channel Inactive.
14:26:03.864 [cb-io-1-2] INFO c.c.client.core.endpoint.Endpoint - [null][KeyValueEndpoint]: Got notified from Channel as inactive, attempting reconnect.
14:26:03.865 [cb-io-1-2] DEBUG c.c.c.c.e.kv.KeyValueFeatureHandler - Requesting supported features: [TCPNODELAY, XATTR, SELECT_BUCKET]
14:26:03.865 [cb-io-1-5] DEBUG c.c.c.c.e.AbstractGenericHandler - [couchtest3.aquto.internal/10.98.76.206:11210][KeyValueEndpoint]: Channel Inactive.
14:26:03.865 [cb-io-1-5] INFO c.c.client.core.endpoint.Endpoint - [null][KeyValueEndpoint]: Got notified from Channel as inactive, attempting reconnect.
14:26:03.865 [cb-io-1-7] DEBUG c.c.c.c.e.kv.KeyValueFeatureHandler - Negotiated supported features: [TCPNODELAY]
14:26:03.865 [cb-io-1-3] DEBUG c.c.c.c.e.AbstractGenericHandler - [couchtest3.aquto.internal/10.98.76.206:11210][KeyValueEndpoint]: Channel Inactive.
14:26:03.865 [cb-io-1-3] INFO c.c.client.core.endpoint.Endpoint - [null][KeyValueEndpoint]: Got notified from Channel as inactive, attempting reconnect.
14:26:03.866 [cb-core-3-1] DEBUG c.c.c.c.config.ConfigurationProvider - Received signal to proactively refresh (a maybe outdated) configuration.
14:26:03.869 [cb-io-1-3] DEBUG c.c.c.c.e.kv.KeyValueFeatureHandler - Requesting supported features: [TCPNODELAY, XATTR, SELECT_BUCKET]
14:26:03.871 [RxComputationScheduler-6] WARN c.c.client.core.endpoint.Endpoint - [null][KeyValueEndpoint]: Socket connect took longer than specified timeout.
14:26:03.871 [cb-computations-3] DEBUG c.c.c.c.config.ConfigurationProvider - New Bucket datarewards config proposed.
14:26:03.871 [RxComputationScheduler-6] WARN c.c.client.core.endpoint.Endpoint - [null][KeyValueEndpoint]: Could not connect to endpoint, retrying with delay 4096 MILLISECONDS:
com.couchbase.client.deps.io.netty.channel.ConnectTimeoutException: Connect callback did not return, hit safeguarding timeout.
at com.couchbase.client.core.endpoint.AbstractEndpoint$3.call(AbstractEndpoint.java:346) ~[core-io-1.4.5.jar:na]
at com.couchbase.client.core.endpoint.AbstractEndpoint$3.call(AbstractEndpoint.java:339) ~[core-io-1.4.5.jar:na]
at rx.internal.operators.SingleOperatorOnErrorResumeNext$2.onError(SingleOperatorOnErrorResumeNext.java:69) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.operators.SingleTimeout$TimeoutSingleSubscriber$OtherSubscriber.onError(SingleTimeout.java:133) [rxjava-1.2.7.jar:1.2.7]
at rx.Single$1.call(Single.java:460) [rxjava-1.2.7.jar:1.2.7]
at rx.Single$1.call(Single.java:456) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.operators.SingleTimeout$TimeoutSingleSubscriber.call(SingleTimeout.java:110) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.schedulers.EventLoopsScheduler$EventLoopWorker$2.call(EventLoopsScheduler.java:189) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.schedulers.ScheduledAction.run(ScheduledAction.java:55) [rxjava-1.2.7.jar:1.2.7]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77]
14:26:03.871 [cb-core-3-1] DEBUG c.c.c.c.config.ConfigurationProvider - Received signal to proactively refresh (a maybe outdated) configuration.
14:26:03.872 [cb-computations-3] DEBUG c.c.c.c.c.refresher.CarrierRefresher - Completed refreshing config for bucket "datarewards"
14:26:03.874 [RxComputationScheduler-7] WARN c.c.client.core.endpoint.Endpoint - [null][KeyValueEndpoint]: Socket connect took longer than specified timeout.
14:26:03.874 [RxComputationScheduler-7] WARN c.c.client.core.endpoint.Endpoint - [null][KeyValueEndpoint]: Could not connect to endpoint, retrying with delay 4096 MILLISECONDS:
com.couchbase.client.deps.io.netty.channel.ConnectTimeoutException: Connect callback did not return, hit safeguarding timeout.
at com.couchbase.client.core.endpoint.AbstractEndpoint$3.call(AbstractEndpoint.java:346) ~[core-io-1.4.5.jar:na]
at com.couchbase.client.core.endpoint.AbstractEndpoint$3.call(AbstractEndpoint.java:339) ~[core-io-1.4.5.jar:na]
at rx.internal.operators.SingleOperatorOnErrorResumeNext$2.onError(SingleOperatorOnErrorResumeNext.java:69) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.operators.SingleTimeout$TimeoutSingleSubscriber$OtherSubscriber.onError(SingleTimeout.java:133) [rxjava-1.2.7.jar:1.2.7]
at rx.Single$1.call(Single.java:460) [rxjava-1.2.7.jar:1.2.7]
at rx.Single$1.call(Single.java:456) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.operators.SingleTimeout$TimeoutSingleSubscriber.call(SingleTimeout.java:110) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.schedulers.EventLoopsScheduler$EventLoopWorker$2.call(EventLoopsScheduler.java:189) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.schedulers.ScheduledAction.run(ScheduledAction.java:55) [rxjava-1.2.7.jar:1.2.7]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77]
14:26:03.874 [cb-core-3-1] DEBUG c.c.c.c.config.ConfigurationProvider - Received signal to proactively refresh (a maybe outdated) configuration.
14:26:03.877 [cb-io-1-1] DEBUG c.c.c.c.e.kv.KeyValueFeatureHandler - Negotiated supported features: [TCPNODELAY]
14:26:03.877 [cb-io-1-2] DEBUG c.c.c.c.e.kv.KeyValueFeatureHandler - Negotiated supported features: [TCPNODELAY]
14:26:03.877 [cb-io-1-3] DEBUG c.c.c.c.e.kv.KeyValueFeatureHandler - Negotiated supported features: [TCPNODELAY]
14:26:03.885 [cb-io-1-6] DEBUG c.c.c.c.e.kv.KeyValueFeatureHandler - Negotiated supported features: [TCPNODELAY]
14:26:03.900 [cb-io-1-7] DEBUG c.c.c.c.e.AbstractGenericHandler - [couchtest3.aquto.internal/10.98.76.206:11210][KeyValueEndpoint]: Channel Inactive.
14:26:03.900 [cb-io-1-7] INFO c.c.client.core.endpoint.Endpoint - [null][KeyValueEndpoint]: Got notified from Channel as inactive, attempting reconnect.
14:26:03.900 [cb-core-3-1] DEBUG c.c.c.c.config.ConfigurationProvider - Received signal to proactively refresh (a maybe outdated) configuration.
14:26:03.904 [cb-computations-7] DEBUG c.c.c.c.config.ConfigurationProvider - New Bucket datarewards config proposed.
14:26:03.905 [cb-computations-7] DEBUG c.c.c.c.c.refresher.CarrierRefresher - Completed refreshing config for bucket "datarewards"
14:26:03.908 [RxComputationScheduler-8] WARN c.c.client.core.endpoint.Endpoint - [null][KeyValueEndpoint]: Socket connect took longer than specified timeout.
14:26:03.908 [RxComputationScheduler-8] WARN c.c.client.core.endpoint.Endpoint - [null][KeyValueEndpoint]: Could not connect to endpoint, retrying with delay 4096 MILLISECONDS:
com.couchbase.client.deps.io.netty.channel.ConnectTimeoutException: Connect callback did not return, hit safeguarding timeout.
at com.couchbase.client.core.endpoint.AbstractEndpoint$3.call(AbstractEndpoint.java:346) ~[core-io-1.4.5.jar:na]
at com.couchbase.client.core.endpoint.AbstractEndpoint$3.call(AbstractEndpoint.java:339) ~[core-io-1.4.5.jar:na]
at rx.internal.operators.SingleOperatorOnErrorResumeNext$2.onError(SingleOperatorOnErrorResumeNext.java:69) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.operators.SingleTimeout$TimeoutSingleSubscriber$OtherSubscriber.onError(SingleTimeout.java:133) [rxjava-1.2.7.jar:1.2.7]
at rx.Single$1.call(Single.java:460) [rxjava-1.2.7.jar:1.2.7]
at rx.Single$1.call(Single.java:456) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.operators.SingleTimeout$TimeoutSingleSubscriber.call(SingleTimeout.java:110) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.schedulers.EventLoopsScheduler$EventLoopWorker$2.call(EventLoopsScheduler.java:189) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.schedulers.ScheduledAction.run(ScheduledAction.java:55) [rxjava-1.2.7.jar:1.2.7]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77]
14:26:03.908 [cb-core-3-1] DEBUG c.c.c.c.config.ConfigurationProvider - Received signal to proactively refresh (a maybe outdated) configuration.
14:26:03.910 [RxComputationScheduler-1] WARN c.c.client.core.endpoint.Endpoint - [null][KeyValueEndpoint]: Socket connect took longer than specified timeout.
14:26:03.910 [RxComputationScheduler-1] WARN c.c.client.core.endpoint.Endpoint - [null][KeyValueEndpoint]: Could not connect to endpoint, retrying with delay 4096 MILLISECONDS:
com.couchbase.client.deps.io.netty.channel.ConnectTimeoutException: Connect callback did not return, hit safeguarding timeout.
at com.couchbase.client.core.endpoint.AbstractEndpoint$3.call(AbstractEndpoint.java:346) ~[core-io-1.4.5.jar:na]
at com.couchbase.client.core.endpoint.AbstractEndpoint$3.call(AbstractEndpoint.java:339) ~[core-io-1.4.5.jar:na]
at rx.internal.operators.SingleOperatorOnErrorResumeNext$2.onError(SingleOperatorOnErrorResumeNext.java:69) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.operators.SingleTimeout$TimeoutSingleSubscriber$OtherSubscriber.onError(SingleTimeout.java:133) [rxjava-1.2.7.jar:1.2.7]
at rx.Single$1.call(Single.java:460) [rxjava-1.2.7.jar:1.2.7]
at rx.Single$1.call(Single.java:456) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.operators.SingleTimeout$TimeoutSingleSubscriber.call(SingleTimeout.java:110) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.schedulers.EventLoopsScheduler$EventLoopWorker$2.call(EventLoopsScheduler.java:189) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.schedulers.ScheduledAction.run(ScheduledAction.java:55) [rxjava-1.2.7.jar:1.2.7]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77]
14:26:03.910 [cb-core-3-1] DEBUG c.c.c.c.config.ConfigurationProvider - Received signal to proactively refresh (a maybe outdated) configuration.
14:26:03.918 [cb-computations-2] DEBUG c.c.c.c.config.ConfigurationProvider - New Bucket datarewards config proposed.
14:26:03.918 [cb-computations-2] DEBUG c.c.c.c.c.refresher.CarrierRefresher - Completed refreshing config for bucket "datarewards"
14:26:03.925 [RxComputationScheduler-2] WARN c.c.client.core.endpoint.Endpoint - [null][KeyValueEndpoint]: Socket connect took longer than specified timeout.
14:26:03.925 [RxComputationScheduler-3] WARN c.c.client.core.endpoint.Endpoint - [null][KeyValueEndpoint]: Socket connect took longer than specified timeout.
14:26:03.925 [RxComputationScheduler-4] WARN c.c.client.core.endpoint.Endpoint - [null][KeyValueEndpoint]: Socket connect took longer than specified timeout.
14:26:03.925 [RxComputationScheduler-4] DEBUG com.couchbase.client.core.node.Node - Disconnected (CONNECTING) from Node couchtest3.aquto.internal/10.98.76.206
14:26:03.925 [RxComputationScheduler-2] WARN c.c.client.core.endpoint.Endpoint - [null][KeyValueEndpoint]: Could not connect to endpoint, retrying with delay 4096 MILLISECONDS:
com.couchbase.client.deps.io.netty.channel.ConnectTimeoutException: Connect callback did not return, hit safeguarding timeout.
at com.couchbase.client.core.endpoint.AbstractEndpoint$3.call(AbstractEndpoint.java:346) ~[core-io-1.4.5.jar:na]
at com.couchbase.client.core.endpoint.AbstractEndpoint$3.call(AbstractEndpoint.java:339) ~[core-io-1.4.5.jar:na]
at rx.internal.operators.SingleOperatorOnErrorResumeNext$2.onError(SingleOperatorOnErrorResumeNext.java:69) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.operators.SingleTimeout$TimeoutSingleSubscriber$OtherSubscriber.onError(SingleTimeout.java:133) [rxjava-1.2.7.jar:1.2.7]
at rx.Single$1.call(Single.java:460) [rxjava-1.2.7.jar:1.2.7]
at rx.Single$1.call(Single.java:456) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.operators.SingleTimeout$TimeoutSingleSubscriber.call(SingleTimeout.java:110) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.schedulers.EventLoopsScheduler$EventLoopWorker$2.call(EventLoopsScheduler.java:189) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.schedulers.ScheduledAction.run(ScheduledAction.java:55) [rxjava-1.2.7.jar:1.2.7]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77]
14:26:03.925 [RxComputationScheduler-3] WARN c.c.client.core.endpoint.Endpoint - Error during reconnect:
com.couchbase.client.deps.io.netty.channel.ConnectTimeoutException: Connect callback did not return, hit safeguarding timeout.
at com.couchbase.client.core.endpoint.AbstractEndpoint$3.call(AbstractEndpoint.java:346) ~[core-io-1.4.5.jar:na]
at com.couchbase.client.core.endpoint.AbstractEndpoint$3.call(AbstractEndpoint.java:339) ~[core-io-1.4.5.jar:na]
at rx.internal.operators.SingleOperatorOnErrorResumeNext$2.onError(SingleOperatorOnErrorResumeNext.java:69) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.operators.SingleTimeout$TimeoutSingleSubscriber$OtherSubscriber.onError(SingleTimeout.java:133) [rxjava-1.2.7.jar:1.2.7]
at rx.Single$1.call(Single.java:460) [rxjava-1.2.7.jar:1.2.7]
at rx.Single$1.call(Single.java:456) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.operators.SingleTimeout$TimeoutSingleSubscriber.call(SingleTimeout.java:110) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.schedulers.EventLoopsScheduler$EventLoopWorker$2.call(EventLoopsScheduler.java:189) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.schedulers.ScheduledAction.run(ScheduledAction.java:55) [rxjava-1.2.7.jar:1.2.7]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77]
14:26:03.925 [cb-core-3-1] DEBUG c.c.c.c.config.ConfigurationProvider - Received signal to proactively refresh (a maybe outdated) configuration.
14:26:03.925 [RxComputationScheduler-4] WARN c.c.client.core.endpoint.Endpoint - [null][KeyValueEndpoint]: Could not connect to endpoint, retrying with delay 4096 MILLISECONDS:
com.couchbase.client.deps.io.netty.channel.ConnectTimeoutException: Connect callback did not return, hit safeguarding timeout.
at com.couchbase.client.core.endpoint.AbstractEndpoint$3.call(AbstractEndpoint.java:346) ~[core-io-1.4.5.jar:na]
at com.couchbase.client.core.endpoint.AbstractEndpoint$3.call(AbstractEndpoint.java:339) ~[core-io-1.4.5.jar:na]
at rx.internal.operators.SingleOperatorOnErrorResumeNext$2.onError(SingleOperatorOnErrorResumeNext.java:69) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.operators.SingleTimeout$TimeoutSingleSubscriber$OtherSubscriber.onError(SingleTimeout.java:133) [rxjava-1.2.7.jar:1.2.7]
at rx.Single$1.call(Single.java:460) [rxjava-1.2.7.jar:1.2.7]
at rx.Single$1.call(Single.java:456) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.operators.SingleTimeout$TimeoutSingleSubscriber.call(SingleTimeout.java:110) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.schedulers.EventLoopsScheduler$EventLoopWorker$2.call(EventLoopsScheduler.java:189) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.schedulers.ScheduledAction.run(ScheduledAction.java:55) [rxjava-1.2.7.jar:1.2.7]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77]
14:26:03.925 [RxComputationScheduler-3] WARN c.c.client.core.endpoint.Endpoint - [null][KeyValueEndpoint]: Could not connect to endpoint, retrying with delay 4096 MILLISECONDS:
com.couchbase.client.deps.io.netty.channel.ConnectTimeoutException: Connect callback did not return, hit safeguarding timeout.
at com.couchbase.client.core.endpoint.AbstractEndpoint$3.call(AbstractEndpoint.java:346) ~[core-io-1.4.5.jar:na]
at com.couchbase.client.core.endpoint.AbstractEndpoint$3.call(AbstractEndpoint.java:339) ~[core-io-1.4.5.jar:na]
at rx.internal.operators.SingleOperatorOnErrorResumeNext$2.onError(SingleOperatorOnErrorResumeNext.java:69) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.operators.SingleTimeout$TimeoutSingleSubscriber$OtherSubscriber.onError(SingleTimeout.java:133) [rxjava-1.2.7.jar:1.2.7]
at rx.Single$1.call(Single.java:460) [rxjava-1.2.7.jar:1.2.7]
at rx.Single$1.call(Single.java:456) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.operators.SingleTimeout$TimeoutSingleSubscriber.call(SingleTimeout.java:110) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.schedulers.EventLoopsScheduler$EventLoopWorker$2.call(EventLoopsScheduler.java:189) [rxjava-1.2.7.jar:1.2.7]
at rx.internal.schedulers.ScheduledAction.run(ScheduledAction.java:55) [rxjava-1.2.7.jar:1.2.7]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77]
14:26:03.926 [cb-core-3-1] DEBUG c.c.c.c.config.ConfigurationProvider - Received signal to proactively refresh (a maybe outdated) configuration.
14:26:03.926 [cb-core-3-1] DEBUG c.c.c.c.config.ConfigurationProvider - Received signal to proactively refresh (a maybe outdated) configuration.
14:26:03.928 [RxComputationScheduler-5] DEBUG c.c.client.core.endpoint.Endpoint - [null][KeyValueEndpoint]: Endpoint connect completed, but got instructed to disconnect in the meantime.
14:26:03.931 [cb-io-1-6] DEBUG c.c.c.c.e.kv.KeyValueFeatureHandler - Requesting supported features: [TCPNODELAY, XATTR, SELECT_BUCKET]
14:26:03.931 [cb-io-1-8] DEBUG c.c.c.c.e.kv.KeyValueFeatureHandler - Requesting supported features: [TCPNODELAY, XATTR, SELECT_BUCKET]
14:26:03.934 [cb-computations-6] DEBUG c.c.c.c.config.ConfigurationProvider - New Bucket datarewards config proposed.
14:26:03.934 [cb-computations-6] DEBUG c.c.c.c.c.refresher.CarrierRefresher - Completed refreshing config for bucket "datarewards"
14:26:03.935 [cb-io-1-6] DEBUG c.c.c.c.e.AbstractGenericHandler - [couchtest3.aquto.internal/10.98.76.206:11210][KeyValueEndpoint]: Channel Inactive.
14:26:03.935 [cb-io-1-3] DEBUG c.c.c.c.e.AbstractGenericHandler - [couchtest3.aquto.internal/10.98.76.206:11210][KeyValueEndpoint]: Channel Inactive.
14:26:03.935 [cb-io-1-6] INFO c.c.client.core.endpoint.Endpoint - [null][KeyValueEndpoint]: Got notified from Channel as inactive, attempting reconnect.
14:26:03.935 [cb-io-1-8] DEBUG c.c.c.c.e.AbstractGenericHandler - [couchtest3.aquto.internal/10.98.76.206:11210][KeyValueEndpoint]: Channel Inactive.
14:26:03.935 [cb-io-1-3] INFO c.c.client.core.endpoint.Endpoint - [null][KeyValueEndpoint]: Got notified from Channel as inactive, attempting reconnect.
14:26:03.935 [cb-io-1-8] INFO c.c.client.core.endpoint.Endpoint - [null][KeyValueEndpoint]: Got notified from Channel as inactive, attempting reconnect.

Hi @poreilly,

I’m tagging @subhashni on this, since she may be able to help, or at least be able to determine if this is a bug.

Thanks @matthew.groves. @subhashni, any idea what that issue might be here?

@matthew.groves or @subhashni any progress on this? Is there someone else that we should be seeking an answer on this from?

Note that I created a JIRA for this issue: https://issues.couchbase.com/browse/JVMCBC-435

1 Like

Hi @poreilly,

How many kv endpoints have you configured?
Can you provide the debug logs or even better would be trace level logs from the client application.

Thanks.

@subhashni I am taking over this issue for @poreilly. When you say kv endpoints do you mean how many buckets? I think there are 8 with 6 of them being Couchbase persistent bucket types and 2 of them being plain Memcached bucket types. I will try and provide trace level logs for you some time in the next day or so.

EDIT
It appears that we are using the default setting of 1 for kvEndpoints

@cbax007 can you check with netstat in which states those connections are?

@daschl I need to get some time and get a repro going. Once I do that I can get you that info.

So @daschl looks like all the connections that pile up during this issue are in the ESTABLISHED state

@subhashni I have attached the trace level logs for this issue to the JIRA I created

Note @cbax007 that 2.4.7 released recently has some new capabilities for detecting half-open connections and getting rid of them. We had a deployment where a stateful firewall between the clients and servers was occasionally somehow causing this condition. I can see yours is on the release just before this from the log.

We’ll have a further look at the log, but if you have an environment handy, it might be useful to see if this update has an impact.

It introduces three new environment properties as uncommitted (meaning it may change before being committed in a future release, probably 2.5.0):

/**
 * If set to true, KeepAlives will be sent on a regular basis in the interval also if there
 * is traffic on the socket,  not only if its idle.
 *
 * @return true if enabled, false otherwise.
 */
continuousKeepAliveEnabled();

/**
 * Specifies the number of times a KeepAlive message on a socket can fail before the connection
 * is recycled.
 *
 * @return the number of keepalive errors allowed to happen before the socket is reopened.
 */
keepAliveErrorThreshold();

/**
 * Specifies the timeout of a keepalive operation on the socket in milliseconds.
 *
 * @return
 */
keepAliveTimeout();

public static final boolean CONTINUOUS_KEEPALIVE_ENABLED = true;
public static final long KEEPALIVE_ERROR_THRESHOLD = 4;
public static final long KEEPALIVE_TIMEOUT = 2500;

@ingenthr I’ve run scenario again with version 2.4.7 and see similar results. It is possible that the connections are released a bit quicker than before but other than that it seems to be about the same. It runs up to maybe 1000 connections in my last tests and then drops back down to normal over the course of maybe 20 seconds.

With the defaults, we wouldn’t expect that code to clean things up for 2 minutes. Hm.

@subhashni- can you have a look at the log?

@ingenthr @subhashni @daschl can we get a status update on this? Is anyone looking into it at all?

Sorry about the delay. I’ll have a look here in a couple minutes.