Newbie questions - failover, java sdk remove/persistence


#1

Hello. Just started testing out Couchbase and have a few questions for you all.
I’m running CE 3.0.1 in a 2 node cluster on centos 6, ec2.

  1. In my testing it seems that the cluster works such that, if 1 node fails in (in a 2 node cluster) (even if shut down gracefully), the other node will not process upserts or removes and only process gets if the data is from the remaining good node.
  2. That if you manually intervene to failover the failed node (mentioned above), then all the data would be available on the remaining good node and upserts/removes/gets now succeed.
  3. That you can auto-failover only in a >= 3 node cluster, and that auto-failover will only work if 1 node fails. If >= 2 nodes fail (no matter how many nodes there are in the cluster), it will not work and manual intervention would be required for failover.

Is anything in the above not accurate?

4.Using the Java SDK 2.0.3 (in the above 2 node cluster scenario with no failed nodes) i’m trying to remove a document from a bucket.
My goal is to be able to wait until the delete is persisted to disk on one node and persisted to memory on the other node.
I’m using:
bucket.remove(mydocname, PersistTo.MASTER, ReplicateTo.ONE);
Is this correct usage?
What is the difference between PersistTo.MASTER and PersistTo.ONE?
When i run the above code the document is deleted but i receive the error:
Exception in thread "main" java.lang.RuntimeException: java.util.concurrent.TimeoutException at com.couchbase.client.java.util.Blocking.blockForSingle(Blocking.java:93) at com.couchbase.client.java.CouchbaseBucket.remove(CouchbaseBucket.java:381) at com.couchbase.client.java.CouchbaseBucket.remove(CouchbaseBucket.java:361) at delete.main(delete.java:32) Caused by: java.util.concurrent.TimeoutException ... 4 more
If i leave out the ReplicateTo, it succeeds.
Using the same PersistTo/ReplicateTo combination with upsert()/replace() does not produce the error.

Thanks folks!

-Tony


#2

hey @simonbasle could you advise ?


#3

ok first off, sorry for the late answer :alarm_clock:

on to your three first questions:

  1. if the node is shutdown but not failed over, the replicas are not promoted so for the subset A of data your node was dealing with, no node can serve it.

  2. failing over a node means the replica will start managing the subset of data “A” and take over. however the cluster is now in an unbalanced state: some subsets of data are still replicated, while A is not. Reintroducing a healthy node, or just downsizing the cluster and doing a rebalance will bring back the cluster in a balanced state where every node has the same ratio of the data and the same replication factor.

  3. Auto-failover is limited, on purpose. It can only failover the first failure that happens, any subsequent failure before an ops has rebalanced the cluster will need manual intervention. I’m not entirely sure about the 3 nodes requirement for autofailover, but it kind of make sense that a 3 nodes cluster is a good minimum: 1 node can go down and the data can still be replicated once.

About the java SDK, your usage should be correct: you’re instructing to wait for the “main” node to acknowledge having written to disk, and also one of the replicas to have received the data.

However it looks like there was a slight delay in replication and the operation timed out. Maybe you have since tried to increase the timeout on the operation?