Couchbase java client not reconnecting after connection timeout

Hi there,

We are using Couchbase server 4.1.1 and java client 2.3.6 and 2.4.4 (tried both the versions). We run the couchbase server using docker container. The steps and the exception stack traces that we get are provided below:
Step 1: Couchbase server is up and running and we start our spring application running in tomcat which uses java client to connect to couchbase server. The logs provide the couchbase configuration details as below:
com.couchbase.client.core.CouchbaseCore - CouchbaseEnvironment: {sslEnabled=false, sslKeystoreFile=‘null’, sslKeystorePassword=false, sslKeystore=null, bootstrapHttpEnabled=true, bootstrapCarrierEnabled=true, bootstrapHttpDirectPort=8091, bootstrapHttpSslPort=18091, bootstrapCarrierDirectPort=11210, bootstrapCarrierSslPort=11207, ioPoolSize=4, computationPoolSize=4, responseBufferSize=16384, requestBufferSize=16384, kvServiceEndpoints=1, viewServiceEndpoints=12, queryServiceEndpoints=12, searchServiceEndpoints=12, configPollInterval=10000, ioPool=NioEventLoopGroup, kvIoPool=null, viewIoPool=null, searchIoPool=null, queryIoPool=null, coreScheduler=CoreScheduler, memcachedHashingStrategy=DefaultMemcachedHashingStrategy, eventBus=DefaultEventBus, packageNameAndVersion=couchbase-java-client/2.4.4 (git: 2.4.4, core: 1.4.4), dcpEnabled=false, retryStrategy=BestEffort, maxRequestLifetime=75000, retryDelay=ExponentialDelay{growBy 1.0 MICROSECONDS, powers of 2; lower=100, upper=100000}, reconnectDelay=ExponentialDelay{growBy 1.0 MILLISECONDS, powers of 2; lower=32, upper=4096}, observeIntervalDelay=ExponentialDelay{growBy 1.0 MICROSECONDS, powers of 2; lower=10, upper=100000}, keepAliveInterval=30000, autoreleaseAfter=2000, bufferPoolingEnabled=true, tcpNodelayEnabled=true, mutationTokensEnabled=false, socketConnectTimeout=1000, dcpConnectionBufferSize=20971520, dcpConnectionBufferAckThreshold=0.2, dcpConnectionName=dcp/core-io, callbacksOnIoPool=false, disconnectTimeout=25000, requestBufferWaitStrategy=com.couchbase.client.core.env.DefaultCoreEnvironment$2@348b8445, queryTimeout=75000, viewTimeout=75000, searchTimeout=75000, analyticsTimeout=75000, kvTimeout=2500, connectTimeout=100000, dnsSrvEnabled=false}

2017-04-21 20:09:23,088 [cb-io-1-1] INFO com.couchbase.client.core.node.Node - Connected to Node localhost

2017-04-21 20:09:23,567 [cb-computations-4] INFO com.couchbase.client.core.config.ConfigurationProvider - Opened bucket default

Step 2:- We shutdown the couchbase server and the java client asyncronously try to connect to the server resulting in connect exception as below:
java.net.ConnectException: Connection refused: localhost/127.0.0.1:11210
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
at com.couchbase.client.deps.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:223)
at com.couchbase.client.deps.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:285)
at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:589)
at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:513)
at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:427)
at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:399)
at com.couchbase.client.deps.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at com.couchbase.client.deps.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
2017-04-24 12:01:16,592 [cb-io-1-4] WARN com.couchbase.client.core.endpoint.Endpoint - [localhost/127.0.0.1:8092][ViewEndpoint]: Could not connect to remote socket.
2017-04-24 12:01:16,592 [cb-io-1-4] WARN com.couchbase.client.core.endpoint.Endpoint - [localhost/127.0.0.1:8092][ViewEndpoint]: Could not connect to endpoint, retrying with delay 4096 MILLISECONDS:
java.net.ConnectException: Connection refused: localhost/127.0.0.1:8092
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
at com.couchbase.client.deps.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:223)
at com.couchbase.client.deps.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:285)
at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:589)
at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:513)
at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:427)
at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:399)
at com.couchbase.client.deps.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at com.couchbase.client.deps.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
2017-04-24 12:01:16,606 [cb-io-1-1] WARN com.couchbase.client.core.endpoint.Endpoint - [localhost/127.0.0.1:8093][QueryEndpoint]: Could not connect to remote socket.
2017-04-24 12:01:16,606 [cb-io-1-1] WARN com.couchbase.client.core.endpoint.Endpoint - [localhost/127.0.0.1:8093][QueryEndpoint]: Could not connect to endpoint, retrying with delay 4096 MILLISECONDS:
java.net.ConnectException: Connection refused: localhost/127.0.0.1:8093
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
at com.couchbase.client.deps.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:223)
at com.couchbase.client.deps.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:285)
at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:589)
at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:513)
at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:427)
at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:399)
at com.couchbase.client.deps.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at com.couchbase.client.deps.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
2017-04-24 12:01:16,843 [cb-io-1-2] WARN com.couchbase.client.core.endpoint.Endpoint - [localhost/127.0.0.1:11210][KeyValueEndpoint]: Could not connect to remote socket.
2017-04-24 12:01:16,843 [cb-io-1-2] WARN com.couchbase.client.core.endpoint.Endpoint - [localhost/127.0.0.1:11210][KeyValueEndpoint]: Could not connect to endpoint, retrying with delay 4096 MILLISECONDS:
java.net.ConnectException: Connection refused: localhost/127.0.0.1:11210
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
at com.couchbase.client.deps.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:223)
at com.couchbase.client.deps.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:285)
at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:589)
at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:513)
at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:427)
at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:399)
at com.couchbase.client.deps.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at com.couchbase.client.deps.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)

Step 3:- After connection timeout we get the following exception:
2017-04-24 12:14:20,263 [RxComputationScheduler-1] WARN com.couchbase.client.core.endpoint.Endpoint - [localhost/127.0.0.1:11210][KeyValueEndpoint]: Could not connect to endpoint, retrying with delay 4096 MILLISECONDS:
com.couchbase.client.deps.io.netty.channel.ConnectTimeoutException: Connect callback did not return, hit safeguarding timeout.
at com.couchbase.client.core.endpoint.AbstractEndpoint$3.call(AbstractEndpoint.java:317)
at com.couchbase.client.core.endpoint.AbstractEndpoint$3.call(AbstractEndpoint.java:310)
at rx.internal.operators.SingleOperatorOnErrorResumeNext$2.onError(SingleOperatorOnErrorResumeNext.java:69)
at rx.Single$18.onError(Single.java:1883)
at rx.observers.SafeSubscriber._onError(SafeSubscriber.java:152)
at rx.observers.SafeSubscriber.onError(SafeSubscriber.java:115)
at rx.observers.SerializedObserver.onError(SerializedObserver.java:153)
at rx.observers.SerializedSubscriber.onError(SerializedSubscriber.java:78)
at rx.internal.operators.OperatorTimeoutBase$TimeoutSubscriber$1.onError(OperatorTimeoutBase.java:187)
at rx.Single$1$1.onError(Single.java:85)
at rx.Single$3.call(Single.java:505)
at rx.Single$3.call(Single.java:501)
at rx.Single$1.call(Single.java:90)
at rx.Single$1.call(Single.java:70)
at rx.Observable.unsafeSubscribe(Observable.java:9861)
at rx.internal.operators.OperatorTimeoutBase$TimeoutSubscriber.onTimeout(OperatorTimeoutBase.java:200)
at rx.internal.operators.OperatorTimeout$1$1.call(OperatorTimeout.java:41)
at rx.internal.schedulers.EventLoopsScheduler$EventLoopWorker$2.call(EventLoopsScheduler.java:189)
at rx.internal.schedulers.ScheduledAction.run(ScheduledAction.java:55)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Step 4:- After the elapse of connection time out, I start teh couchbase server. But the java client is still not able to connect to couchbase server. The stack trace look like below:

EVERE: Servlet.service() for servlet [ourApp] in context with path [/ourApp] threw exception [Request processing failed; nested exception is java.lang.RuntimeException: java.util.concurrent.TimeoutException] with root cause

java.util.concurrent.TimeoutException

at com.couchbase.client.java.util.Blocking.blockForSingle(Blocking.java:73)

at com.couchbase.client.java.CouchbaseBucket.get(CouchbaseBucket.java:139)

at com.couchbase.client.java.CouchbaseBucket.get(CouchbaseBucket.java:134)

at com.couchbase.client.spring.cache.CouchbaseCache.get(CouchbaseCache.java:166)

at org.springframework.cache.interceptor.AbstractCacheInvoker.doGet(AbstractCacheInvoker.java:68)

at org.springframework.cache.interceptor.CacheAspectSupport.findInCaches(CacheAspectSupport.java:533)

at org.springframework.cache.interceptor.CacheAspectSupport.findCachedItem(CacheAspectSupport.java:499)

at org.springframework.cache.interceptor.CacheAspectSupport.execute(CacheAspectSupport.java:389)

at org.springframework.cache.interceptor.CacheAspectSupport.execute(CacheAspectSupport.java:327)

The java configuration we use for the couchbase client will look like bleow:

@Bean
public CouchbaseEnvironment couchbaseEnvironment() {
	return DefaultCouchbaseEnvironment
	    .builder()
            .connectTimeout(connectTimeout)
            .build();
}

@Bean(destroyMethod = "disconnect")
public Cluster cluster() {
	return CouchbaseCluster.create(couchbaseEnvironment(),seedNode);
}

@Bean(destroyMethod = "close")
public Bucket bucket() {
	return cluster().openBucket(bucketName, bucketPassword, bucketTimeOut, TimeUnit.MILLISECONDS);
}

@Bean CacheBuilder cacheBuilder(){
	return CacheBuilder.newInstance(bucket());
}

@Bean
public CacheManager cacheManager(){
	return new CouchbaseCacheManager(caches());
}

public Map<String, CacheBuilder> caches(){
	//The named cached are externalized using this utility class
	CacheDynamicEnum[] enumValues = CacheDynamicEnum.values();
    // Each named cache is set with a predefined TTL in seconds
	Map<String, CacheBuilder> caches = new HashMap<String, CacheBuilder>();
	for(int i=0;i<enumValues.length;i++){
		caches.put(enumValues[i].name(), cacheBuilder().withExpiration(Integer.parseInt(enumValues[i].timeToLive())));
	}
	return caches;
}

Please advice us on what we are missing here. Looks like there was a bug on Couchbase on reconnect after timeout and being fixed, but clueless on this behaviour. Appreciate any sort of help to move forward.

Thanks
Sethukumar

Well, so you are seeing timeouts or connection issues after the node is brought back up - can you share at least DEBUG logs from a full shutdown/startup on the client side so we can see what the client is doing?

@daschl ,

Please find the zipped log file attached.
Initially we have the CB server and our application server up and running. Then we shutdown CB server. then we wait till timeout. Then we restart CB server. These are the steps we followed in this log file.

couchbase.log.zip (137.8 KB)

@sethunr hm, from the logs it looks like the connect callback did not return, we fail it because of our safeguard timeout and then there is no more reconnect attempt until shutdown. Can you also share the code which drives that program so I can try to reproduce? I see you are running on osx localhost with jdk 1.8, which server version is this?

@daschl Could you please elaborate a bit on the safeguard timeout? We’ve had intermittent cases where we have services fail, primarily during cluster maintenance, and they require restart to be resolved.

I’m experiencing the same issue. Any update or workaround please?
When I restart any Couchbase nodes, client applications need restarting as well, otherwise they won’t work.

Client version: Java SDK 2.4.6 (with Spring Boot 1.5.1 + Spring-Data Couchbase 2.2.0)
Server version: 4.6.2 (as Docker image)

I am also experiencing the same issue. It looks like a known BUG for couchbase.

I think we have 2 different posts going on about the same symptom… I awakened this one:

with a goofy way to reproduce the symptoms, but might not be the same underlying cause.