According to the documentation, failure of one node (in a 3 node cluster) does not bring down the cluster or block any read operations. BUT, the doc states that some write operations are blocked because each data item has one master location in the cluster, so there must be a failover and rebalance before writes to those items can proceed. (See quote below.)
My concern is how long this will take. Couchbase is not really “highly available” if some writes are blocked for hours until the DBA gets an alert about the node failure and completes a rebalance process. Some applications do more writes than reads, so this is a big problem. Questions…
Do I misunderstand the situation here?
How long does it typically take for a 3-cluster with a failed node to be fully back online (reads and writes)?
Is the problem improved by using 4-node clusters, so that a node failure still leaves 3 nodes?
We are doing mostly plain key/value reads and writes, at least for now.
“If a single node fails, the data on a node that failed will not accept writes until the node is failed over, although reads can be serviced from replicas if desired.” from http://developer.couchbase.com/documentation/server/4.5/concepts/data-management.html