Java client not recovering after failover (server shutdown)

I have 6 couchbase nodes with 3 replicas and autofailover. If I stop one node “service couchbase-server stop” the failover runs ok and the java client gets notified and removes the node from its internal map.
2015-04-10 15:38:50 c.c.client.core.node.Node [INFO] Disconnected from Node xxx

If instead of stopping the couchbase-server service I shutdown the machine the java client never gets notified and tries to hit that dead node every time. The autofailover happens but the client does not update its internal node map.

Any ideas on this?

I am using Couchbase Server Version: 3.0.1 in Ubuntu 64bits and 3.0.3 Enterprise.
Couchbase Java Client 2.1.2

Thanks

1 Like

What’s the workload? If you’re shutting down the server by “shutdown -h” or the like, the behavior should be the same, but simulating a failure is rather different. You’d not have a TCP RST message, so it may take a certain amount of ‘failed’ workload for the client to update it’s internal map. There’s a backstop as well that should eventually get the client to update.

The scenario you describe is like a whole section of our testing, so I’m pretty confident it’s correct. Our test has a basic workload. More info on the scenario would be appreciated.

I am shutting down it stopping the EC2 instance through the AWS manager console.
The workload that I am using in this particular case is low, just a few request because I was trying to test that particular scenario.

I will try with a heavy workload.

Thanks!

I tried with a heavier workload hitting couchbase with 800 get ops per second for 60 seconds without any success. The client does not update its configuration unless I restart it.

I think that it has something to do with the Carrier Publication. I disabled it and it worked as expected.
CouchbaseEnvironment environment = DefaultCouchbaseEnvironment
.builder().
.bootstrapCarrierEnabled(false)
.bootstrapHttpDirectPort(8080)
.build();

But this is a workaround.
When using the carrier publication the client never gets notified when a node shuts down in a “hard” fashion or if it has network issues.

Now I am testing it with a very simple scenario. One client and 2 couchbase nodes doing the failover manually.
Could you test the same scenario? An easy way to test it is by disconnecting one couchbase node from the network.

Thanks in advance.

I add more information to the issue.
In the logs I get every 20-30 seconds a keep alive request without errors nor responses.
2015-04-13 11:28:48 c.c.c.c.e.AbstractGenericHandler [DEBUG] [node-a/10.10.9.135:11210][KeyValueEndpoint]: KeepAlive fired.

@yorugua is it possible for you to share the code that you are using and the steps to reproduce? That would greatly help. Also, if you can share TRACE level logging that would be great.

If you don’t want to share it publicly you can also drop me an email.

Document with id A in Node 1
Document with id B in Node 2
Api client with JAVA sdk 2.1.2

  • Get key A (OK)
  • Get key B (OK)

Unplug Node 2 network cable

  • Get key A (OK)
  • Get key B (FAIL expected until failover)
    java.lang.RuntimeException: java.util.concurrent.TimeoutException
    at com.couchbase.client.java.util.Blocking.blockForSingle(Blocking.java:93) ~[java-client-2.1.2.jar:2.1.2]

Failover Node 2 (without doing the rebalance)

  • Get key A (OK)
  • Get Key B (FAIL not expected behavior)
    java.lang.RuntimeException: java.util.concurrent.TimeoutException
    at com.couchbase.client.java.util.Blocking.blockForSingle(Blocking.java:93) ~[java-client-2.1.2.jar:2.1.2]

The same happens in the cloud when stopping the EC2 instance of Node 2.
However if I stop the couchbase-service in Node 2 doing a “service couchbase-server stop” it works as expected.

If I disable the bootstrap carrier all scenarios work as expected. But it is not the idea.


Code:

public class CouchbaseRepository {

private Cluster cluster;
private Bucket bucket;

public CouchbaseRepository() {
	//Initialization of cluster and bucket
	CouchbaseEnvironment environment = DefaultCouchbaseEnvironment
			.builder().requestBufferSize(16384)
			.build();
	cluster = CouchbaseCluster.create(environment, "10.10.8.189,10.10.9.135");
	bucket = cluster.openBucket("default");
}


public JsonDocument getByKey(String key) {
	JsonDocument doc = bucket.get(key);
	return doc;
}

Thanks!

Would it be possible for you to also share the logs (trace) as well?

Also btw, how are you executing the load? How many ops/s?

I run the load with apache benchmark so I tested it with different amounts of requests and threads. Sometimes 5000 ops per second.

I store the objects with the following method

public MyObject store(MyObject object) throws Exception {
JsonLongDocument doc = bucket.counter(“mycounter::”, 1, 1);
Long id = doc.content();
object.setId(id);
ObjectMapper mapper = new ObjectMapper();

String json = mapper.writeValueAsString(object);
JsonTranscoder tr = new JsonTranscoder();
JsonObject jsonObject = tr.stringToJsonObject(json);
JsonDocument myDoc = JsonDocument.create(KEY_PREFIX + id, jsonObject);
bucket.insert(myDoc);
return object;

}

I’ll grab the logs and send them to you.

Thanks