How about the hard way?
Couchbase failure scenarios tend to encourage downtime, which is a real problem in the minds of system administrators accustomed to memcached (where a failure of a node doesn’t cause an outage for a whole farm.) When a couchbase node fails, the cluster freezes until a failed node is failed over. This is hard for me to explain in a way that’s acceptable to people who wear the pager for an environment that really tolerates zero downtime, particularly when the point of failure is just a cache cluster. It’s especially hard to explain because Couchbase is supposed to be all about high availability and resilience.
Has anyone developed a monitor system that can make a Couchbase cluster more reliable from the point of view of a large number of clients? The scenario we discuss is “taking a Sawzall to two drives shouldn’t cause a total outage of the farm.”