We’ve been testing HA configurations and recovery using 4.0.0-4051 Community Edition. I am using a three-node cluster with one replica copy on each bucket, view index replicas enabled, and auto-failover set to 2 minutes. I took one node down hard. Immediately following the auto-failover timeout, the remainder of the cluster started what looked like a full re-index of several views. During that indexing, requests against the server (using the couchbase-1.3.12 ruby gem) would throw timeout errors.
Couchbase::Error::Timeout error=failed to execute HTTP request, Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout
Since this is in a test environment, each remaining server has plenty of capacity (as shown in the attached screenshot). Given the replication configuration including view index replicas, why would we be unable to perform queries against the cluster in this situation? Why are we suffering any kind of timeouts once the failover of the downed node has begun?
It took a full 18 minutes for us to have restored functionality, essentially once it had finished reindexing.