What is the difference between failover and removal w.r.t index nodes with 1 replica for each index?
How indexes will be redistributed if we add another node after failover/removal of node
@pruthvi Failover of a node means all replicas of indexes and/or partitions that node hosts are no longer available. If there was one replica of everything, there should still be at least one replica available of every index and partition on some other node. (There could be one or two replicas still available for a given index/partition, depending on whether the failed node hosted any replicas for that particular index/partition.)
Rebalance will rebuild missing replicas if there are enough remaining nodes and they have enough memory to host the recovered replicas. The Planner runs at the start of Rebalance to try to find an optimal placement of all the indexes, so it can move indexes/partitions to other nodes to try to keep resource usage as level as possible across the available Index nodes.
If you bring a new Index node in, Rebalance will give that new node indexes, again to even out the resource usage across Index nodes. Which ones go to the new node depends on the outcome of the Planner’s optimization algorithm, which is based on Simulated Annealing and is thus stochastic.
Thanks for the info. I just wanted to know which one is best out of remove and failover options for index node when you have 1 replica for each index.
@pruthvi Apparently I did not understand what you were originally asking. Now I think I get it – you are asking about the differences between
Remove a Node and Rebalance
Remove a Node and Rebalance | Couchbase Docs
Failover a node
Failover | Couchbase Docs
In the case of Index Service, Failover is sort of like a “kill -9” of the service. Partitions / indexes / replicas hosted on that node will be lost at the time of Failover. That node will instantly be removed from the cluster and all traffic directed to other nodes. Normally one would only use Failover when the node is unresponsive, because if it’s not responding you can’t do Remove and Rebalance since the node has to be working to do that.
Remove a Node and Rebalance is instead a graceful way to take a currently responding node out of the cluster. This process will move all the partitions / indexes / replicas hosted by this node onto other Index nodes, if there are enough other Index nodes and they have enough memory. “Enough other Index nodes” means that to have one replica, there must be at least two other Index nodes remaining in the cluster, as each node can host only one copy of an index.
Thus if the cluster started with three Index nodes and every index had one replica, either choice will leave the cluster with only two Index nodes. However, they differ in that
Remove and Rebalance – at all times every index will still have one replica (assuming the two surviving Index nodes have enough memory), since all the indexes / replicas that were hosted on the removed node would be moved to one of the two surviving nodes before the target node was removed.
Failover – any indexes / replicas that were hosted on the failed over node would no longer have a replica in the cluster, since all the indexes/ replicas that were hosted on the removed node would just be lost. A later Rebalance will recreate missing replicas on the two surviving nodes (again assuming they have enough memory).
Thus in the above scenario, the Failover case contains a period of time where some indexes do not have replicas. If one of the remaining Index nodes also fails during that time period, then any such indexes would become entirely unavailble and queries that need them will start failing. (If a subsequent Rebalance is not performed, the replicas would remain missing indefinitely.) In the Remove and Rebalance case, there is no time at which any replicas are missing, so if one of the two surviving nodes subsequently also fails, all indexes still remain available.
Thanks a lot for all your inputs and info