Unexpected Fail over warning

I’m trying to fail over a node and keep getting a warning (from the web admin) indicating that some data are not replicated:
“Attention – There are not replica (backup) copies of all data on this node! Failing over the node now will irrecoverably lose that data when the incomplete replica is activated and this node is removed from the cluster. If the node might come back online, it is recommended to wait. Check this box if you want to failover the node, despite the resulting data loss”

I’m very confused as I just built this cluster for fail over testing. Basically, I had a cluster with one node running for a while so I added a new node and rebalanced the cluster. Everything went fine. Then I nicely stopped one node (couchbase-server stop) and tried to fail over from the web console and keep getting the above warning. It doesn’t matter which node I stop.

Is this warning supposed to show up in any case?
Is there any way to see which data is not replicated?

I’m using the 2.0.0 community edition (build-1976).

Thanks,
Bruno

2 Likes

I see the same errors with 2.01. This seems to occur even with an idle cluster, i.e. one with no activity going on. So either replication doesn’t work, or this error always appears regardless of the state of the data.

Hello,

This is the “expected” behavior. Let me explain it, with a cluster of 3 nodes and 1 replica.

So you have started with 1 node, so in this case you have only “active documents” (no replica)

Then you add another node, and do a rebalance. Once it is done you have 50% of the active data on each node, and 50% of the replica on each node.

Let’s add a new node again, just to have a more “realistic” cluster of 3 nodes. So the node is added and cluster is rebalanced. This means now you have, as you can guess 33.33% on each node (Active and Replica)

So what you have notice is that the Rebalance is an expensive operation, since the cluster has to move data between all the nodes. (moving active and replicas).

You have a now a well balanced 3 nodes cluster.

Now you stop one node, or one node crashes… this means that some of the data are not accessible (they are still here not available, you do not lose anything).

Here you have 2 options:

  • if you restart the server, nothing to do the cluster is back online entirely. (3 nodes cluster well balances)

  • you do a failover on the node that is off. Let’s explain this in detail.

Failover:
So what is happening here: Couchbase will do that as fast as possible to be sure all the data are available(read and write). So the only thing that is happening here is: promote the replicas to active (for the keys that were active on the node that is off now)

So what is the status now?

  • all the data are accessible in read/write for the application on 2 nodes, so you have 50% of the active data on each node.
  • BUT you do not have all the replicas since:
    • the replicas that are on the node that is off are “not present”
    • the replicas that have been promoted are not present anymore

This is why you see the message “Fail Over Warning: Rebalance required, some data is not currently replicated!” in your console.

Does it make sense?

So to be able to get back in a status that is “balanced” you need to do a rebalance.

Note: when you failover of node, this node is removed from the cluster, and to add it back you need to add it, and rebalanced. (the data that are on this server are just “ignored”)

Hope this clarify the message.

Some pointers about this:

Regards
Tug
@tgrall

2 Likes