SDK 2.1.2+2.1.3: Connection refused without accessing bucket

#1

Every precisely 30 minutes I get an error message in my logs stating

[<URL HERE>/<IP HERE>:8092][ViewEndpoint]: Could not connect to endpoint, retrying with delay x MILLISECONDS: 

about 5-10 times. After that I get around 5 times

[/<IP HERE>:8092][ViewEndpoint]: Could not connect to endpoint, retrying with delay x MILLISECONDS:

The full stacktrace for the exception for that is

java.net.ConnectException: Connection refused: /<IP HERE>:8092
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_31]
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716) ~[?:1.8.0_31]
    at com.couchbase.client.deps.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208) ~[core-io-1.1.3.jar:1.1.3]
    at com.couchbase.client.deps.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:281) [core-io-1.1.3.jar:1.1.3]
    at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) [core-io-1.1.3.jar:1.1.3]
    at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) [core-io-1.1.3.jar:1.1.3]
    at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) [core-io-1.1.3.jar:1.1.3]
    at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) [core-io-1.1.3.jar:1.1.3]
    at com.couchbase.client.deps.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) [core-io-1.1.3.jar:1.1.3]
    at com.couchbase.client.deps.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) [core-io-1.1.3.jar:1.1.3]
    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_31]

Please note the leading slash / before the IP address and the difference in the error message, once with URL and IP and once without URL.
How could this happen?

The cluster consists of 3 servers, freshly set up. I have another cluster replicating to this cluster into one bucket via XDCR and another bucket which is empty. The error messages occur for all 3 servers in the cluster with the repetitions mentioned above. After that the client can connect to the cluster again.
If I am in the web admin console while the exceptions are happening, in the “Data Buckets” view, I can see all buckets being in the state of rebalancing for a short period of time (yellow portion of the green circle). Reloading the web console during this state takes longer than usual.
Besides that nothing is happening on that cluster other than a client being connected to the empty bucket, but not accessing it.
Another cluster set up just like this one (using Chef) works without problems.
I am using Couchbase 3.0.2 (1603) Enterprise + JDK 2.1.2 (happens with 2.1.3, too).

Googling the exception yields exactly one result - There is a problem when i using java sdk 2.1.x which is the topic before this one.

#2

I played around a bit on one cluster server.
Tried to fail over a node, then I got loaded with error messages (“Difficulties reaching node” or like that).

I reloaded that node’s web interface and I had to set it up again, all data was lost! That node now forms its own cluster, impossible to add it to the “old” one.

What happened…?

The other two nodes still appear in the client log, but now each server only once, saying that connection will be established in 4096 milliseconds. That is now happening every 30 minutes.

#3

We managed to fix this issue by including

"default_attributes": {
  "couchbase": {
    "server": {
      "database_path": "/data/xvdb/data",
      "index_path": "/data/xvdb/index"
    }
  }
}

in the knife environment for Chef. We can’t explain why one cluster worked and the other one didn’t.

#4

Ah, good to hear it’s fixed now… I looked over the ticket and errors like this very often indicate an issue on the network/server side, since the client is desperately trying to reach the server but can’t for some reason.