Java - Initial connection through VIP

vip
java
client

#1

We want to use a VIP to front our Couchbase cluster to simplify configuration and maintenance. Our VIP is configured to round-robin port 8091 to the nodes in our Couchbase cluster. No other ports are forwarded.

Basically we’d expect to go through the VIP to get the latest cluster map at bootstrap and then the clients should use that for all communication afterwards.

This seems to work fine with the Java 1.4.11 Client and the .NET client (2.2.2). It does not work with Java 2.2.3.

The error I get is:
> WARNING: [null][KeyValueEndpoint]: Could not connect to endpoint, retrying with delay 32 MILLISECONDS:
> java.net.ConnectException: Connection refused: /1.2.3.4:11210

where 1.2.3.4 is the IP of our VIP

For this simple example, I’m connecting to the cluster by:

CouchbaseCluster cluster = CouchbaseCluster.create(host);
bucket = cluster.openBucket(bucketName, bucketPassword);

Thanks!


#2

I noticed in the logs: “bootstrapCarrierDirectPort=11210” which correlates with the error… I wouldn’t expect to need access to this port until after the client is fully connected.


#3

Maybe @simonbasle can help you on that one.


#4

@unhuman actually, 11210 is used for the bootstrap process before falling back to HTTP. So both 11210 and optionally 8091 are needed in the bootstrap process for buckets… you can see it falling back to 8091 if it can’t do 11210, but if you can make 11210 work.

Note that you can explicitly also disable carrierBootstrap on the environment and it will go to 8091 directly, but we don’t recommend doing that.


#5

Well, if I had 11210 disabled, the client wouldn’t connect at all, hence this topic. It didn’t successfully fall back. But, enabling 11210 did work for us. We can do that going forward, just good to know, as it appears to be a new change that didn’t affect the old Java client nor the .NET client.

Thanks for the explanation. We can work with it.


#6

Can you disable 11210 just for testing, enable TRACE logging and send me the full log? So I can see the bootstrap process, communication and where/why it fails.

Thanks,
Michael


#7

@daschl Thanks!
Attached logs. I changed IP to 11.22.33.44, but otherwise they’re as reported.
CouchbaseVIPStartup11210PortProblem.zip (5.4 KB)


#8

Hey @daschl Any update on this?


#9

@daschl
Checking in again.

We’re actually having an issue with the 1.4 client (with port 11210 open):

We’re seeing port 11210 connections through the VIP being maintained even after the Cluster Map is received and the client should, in theory, have closed the connections on port 11210 and reconnected to what was provided in the Cluster Map. This is using Java Client 1.4.10. I’ve checked and the diff in 1.4.11 shouldn’t have an effect here.

I guess my hope for the above is that it’s a leaked / unused connection.

We see this in the call stack:
Exception in thread “Memcached IO over {CouchbaseConfigConnection to /10.15.16.30:11210}” java.lang.NullPointerException
at com.couchbase.client.vbucket.provider.BucketConfigurationProvider$2.connectionLost(BucketConfigurationProvider.java:276)
at net.spy.memcached.MemcachedConnection.lostConnection(MemcachedConnection.java:636)
at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:703)
at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:436)
at com.couchbase.client.CouchbaseConnection.run(CouchbaseConnection.java:325)

On line 276 of BucketConnectionProvider, I can’t figure out how that could throw an NPE.


#10

Also, we are seeing a ton of errors on our clients that are not reconnecting back to the nodes that have been rebalanced. We see that with a ton of warnings / errors in the logs, no traffic for a particular bucket on a particular node (even though the other nodes are taking traffic for that bucket). Our .NET Clients are not having similar problems.

{"timestamp":"2016-03-03T19:48:55.606Z","level":"DEBUG","thread":"Memcached IO over {MemcachedConnection to 10.52.40.121/10.52.40. 121:11210 10.52.40.122/10.52.40.122:11210 10.52.40.123/10.52.40.123:11210}","logger":"com.couchbase.client.CouchbaseConnection","m essage":"Handling IO for: sun.nio.ch.SelectionKeyImpl@48756c48 (r=true, w=false, c=false, op={QA sa=10.52.40.123/10.52.40.123:112 10, #Rops=1, #Wops=0, #iq=0, topRop=Cmd: 2 Opaque: 1190090 Key: S2::XYZXYZ-rabbitmq-consumer::string::message::lock::S2::555 58684-e267-4326-bf47-597c1b2aa8b0 Cas: 0 Exp: 605 Flags: 0 Data Length: 3, topWop=null, toWrite=0, interested=1})","context":"defa ult"} {"timestamp":"2016-03-03T19:48:55.606Z","level":"DEBUG","thread":"Memcached IO over {MemcachedConnection to 10.52.40.121/10.52.40. 121:11210 10.52.40.122/10.52.40.122:11210 10.52.40.123/10.52.40.123:11210}","logger":"com.couchbase.client.CouchbaseConnection","m essage":"Read 7885 bytes","context":"default"} {"timestamp":"2016-03-03T19:48:55.606Z","level":"DEBUG","thread":"Memcached IO over {MemcachedConnection to 10.52.40.121/10.52.40. 121:11210 10.52.40.122/10.52.40.122:11210 10.52.40.123/10.52.40.123:11210}","logger":"net.spy.memcached.protocol.binary.StoreOpera tionImpl","message":"Reading 24 header bytes","context":"default"} {"timestamp":"2016-03-03T19:48:55.606Z","level":"DEBUG","thread":"Memcached IO over {MemcachedConnection to 10.52.40.121/10.52.40. 121:11210 10.52.40.122/10.52.40.122:11210 10.52.40.123/10.52.40.123:11210}","logger":"net.spy.memcached.protocol.binary.StoreOpera tionImpl","message":"Reading 7861 payload bytes","context":"default"} {"timestamp":"2016-03-03T19:48:55.606Z","level":"DEBUG","thread":"Memcached IO over {MemcachedConnection to 10.52.40.121/10.52.40. 121:11210 10.52.40.122/10.52.40.122:11210 10.52.40.123/10.52.40.123:11210}","logger":"net.spy.memcached.protocol.binary.StoreOpera tionImpl","message":"Transitioned state from READING to RETRY","context":"default"} {"timestamp":"2016-03-03T19:48:55.609Z","level":"DEBUG","thread":"Memcached IO over {MemcachedConnection to 10.52.40.121/10.52.40. 121:11210 10.52.40.122/10.52.40.122:11210 10.52.40.123/10.52.40.123:11210}","logger":"com.couchbase.client.vbucket.config.Couchbas eConfig","message":"Nodes with active VBuckets: [10.52.40.123, 10.52.40.121, 10.52.40.122]","context":"default"} {"timestamp":"2016-03-03T19:48:55.609Z","level":"DEBUG","thread":"Memcached IO over {MemcachedConnection to 10.52.40.121/10.52.40. 121:11210 10.52.40.122/10.52.40.122:11210 10.52.40.123/10.52.40.123:11210}","logger":"com.couchbase.client.CouchbaseConnection","m essage":"Reschedule read op due to NOT_MY_VBUCKET error: Cmd: 2 Opaque: 1190090 Key: S2::XYZXYZ-rabbitmq-consumer::string::m essage::lock::S2::55558684-e267-4326-bf47-597c1b2aa8b0 Cas: 0 Exp: 605 Flags: 0 Data Length: 3 ","context":"default"} {"timestamp":"2016-03-03T19:48:55.612Z","level":"DEBUG","thread":"Memcached IO over {MemcachedConnection to 10.52.40.121/10.52.40. 121:11210 10.52.40.122/10.52.40.122:11210 10.52.40.123/10.52.40.123:11210}","logger":"net.spy.memcached.protocol.binary.BinaryMemc achedNodeImpl","message":"Setting interested opts to 0","context":"default"} {"timestamp":"2016-03-03T19:48:55.612Z","level":"WARN","thread":"Memcached IO over {MemcachedConnection to 10.52.40.121/10.52.40.1 21:11210 10.52.40.122/10.52.40.122:11210 10.52.40.123/10.52.40.123:11210}","logger":"com.couchbase.client.CouchbaseConnection","me ssage":"Cancelling operation Cmd: 2 Opaque: 1190090 Key: S2::XYZXYZ-rabbitmq-consumer::string::message::lock::S2::55558684-e 267-4326-bf47-597c1b2aa8b0 Cas: 0 Exp: 605 Flags: 0 Data Length: 3because it has been retried (cloned) more than 100times.","context":"default"}

And, if we look locally, we have no connectivity to 10.52.40.121 on port 11210, while the Cluster Map alluded to above does know about 10.52.40.121:

netstat -anp | grep *PID* | grep 11210

tcp 0 0 ::ffff:10.15.16.18:56246 ::ffff:10.52.40.122:11210 ESTABLISHED 28077/java tcp 0 0 ::ffff:10.15.16.18:56338 ::ffff:10.52.40.122:11210 ESTABLISHED 28077/java tcp 0 0 ::ffff:10.15.16.18:57210 ::ffff:10.52.40.123:11210 ESTABLISHED 28077/java tcp 0 0 ::ffff:10.15.16.18:40679 ::ffff:10.52.40.123:11210 ESTABLISHED 28077/java tcp 0 0 ::ffff:10.15.16.18:40676 ::ffff:10.52.40.123:11210 ESTABLISHED 28077/java tcp 0 0 ::ffff:10.15.16.18:55815 ::ffff:10.52.40.122:11210 ESTABLISHED 28077/java


#11

Restarting our applications, of course, resolves the problem… Until we continue the maintenance.


#12

@unhuman something seems to be wrong with the download - let me follow up via PM