Behavior of View Queries During Failover/Rebalance

polfer · July 13, 2016, 10:41pm

Curious about experiences with view queries using the different stale settings (ok, update_after, false) during failover, and separately rebalance, on Couchbase 4.0 Community. In general, should it be assumed that anything other than ok is going to have the potential for significant application impact/failure?

For example, imagine losing a node in a 3-node cluster and then failing over. Replicas and “view index replicas” are enabled, but most of the views start to reindex, and during this period “stale: false” queries timeout over and over until the respective views clear out of initial failover re-index. Things can be this way for 10’s of minutes, and even “stale: ok” queries occasionally timeout. This seems to indicate that any system built using “stale:false” is suspect and subject to failure if feeding anything of consequence. Back-to-back timeouts are not going to get you a response.

Does this match others experiences? 1) that a failover is going to go into a protracted re-index despite replicas being indexed (seems counter-intuitive)? and 2) stale:ok with retry is the only safe way to make it through a failover and subsequent rebalance if you need to avoid timeouts and get reasonable query response times during the recovery (say 10s or less on queries)?

cihangirb · July 19, 2016, 12:09am

Hi @polfer
There are cases where you may get timeouts because of resource issues or because of system not being able to respond. Are you seeing the timeout during failover or during rebalance?

polfer · July 19, 2016, 1:41am

The timeouts occurred during a failover. We have since learned this may have been an issue with Couchbase 4.0.0 community that has already been addressed in 4.1.1. One of our 3 nodes had been put into failover following a memory issue on the box, and as soon as it went to failover we saw most of our views start to reindex along with very high CPU utilization on the 2 remaining nodes.

Similar loading issues during rebalance attempts (we’d see stalls during rebalance with two nodes at 0% cpu and the third at near 100%) prevented us from doing a rebalance, so over the weekend we built another cluster and carefully used a combination of backup restoration and XDCR to bring the new cluster into service.

cihangirb · July 19, 2016, 5:53pm

ok sorry to hear you had issues with this. We are about to put out the 4.1 CE release out and that may help address a few other issues we fixed in this area. Please watch the downloads page for an update.