Handling durability requirements failure in Couchbase


#1

Recently I’ve started investigating Couchbase Server(for the first time) as a candidate for the project. A particular scenario I’m looking at right now is to how to make Couchbase acting as a reliable “source of truth” which is why I’m digging into the durability aspect.

So here is a snippet from ACID Properties and Couchbase:

If the durability requirements fail , then Couchbase may still save the document and eventually distribute it across the cluster. All we know is that it didn’t succeed as far as the SDK knows. You can choose to act on this information to introduce more ACID properties into your application.

So imagine next. I insert/update a document and primary node fails until data made it to any replica. Let’s say primary is gone for a long time. Now, I don’t know at this point whether data was written to disk… So a scary part here is that “Couchbase may still save the document and eventually distribute it across the cluster”. Meaning, as far as client can tell, the data didn’t make it, so a user would see an error, but then all of a sudden it may appear in the system if primary goes back online.

Am I reading this statement correctly? If I am, what’s the best practice to handle it with Couchbase?


#2

This was discussed on StackOverflow: https://stackoverflow.com/questions/52648356/handling-durability-requirements-failure-in-couchbase


#3

The challenge in answering this one is that it depends on the actions you take operationally. The system is flexible here and, in effect, how you handle it is partially in your control.

One scenario could be that after it has been acknowledged as persisted, but before it’s been replicated, the node fails. In that case, if you do not failover the node, when it comes back online, that item will be replicated.

One other scenario is that you could have autofailover enabled and after it’s received by the primary but before it’s replicated or persisted, autofailover kicks in and brings a replica to primary. In this case, your application will have seen the failure to achieve the durability requirement requested. If the previous primary does come back online, before it rejoins the cluster it will resync with the state of the current cluster meaning the location where the item is active is now the current state.

There isn’t a single best practice, arguably, as there isn’t a single “what do I do in this situation?” answer. For many of our users, they prioritize availability over higher levels of durability in the face of failure. That means they turn on autofailover, allowing applications which have more context to manage what to do in particular data mutation cases. For some other users, they prefer to not use auto-failover and have a human involved in deciding what to do next. For yet others, they may have some longer unavailability while they attempt to recover a node.

I hope that helps. Is there is a specific goal you have with this update? We can maybe give some options in that case.


#4

Yeah, I know, as I was the one who kicked it off :smiley:


#5

Nice explanation! Thank you!

So please help me to understand the second scenario with the autofailover. When the former primary gets back online with locally persisted but not replicated items and starts resyncing, would these items get wiped off or something?

As for the goal I have, it’s just about a general investigation on how Couchbase deals with node failures.


#6

Yes, and that’s really intentional. You could look at those previously persisted items as an “alternate history” that didn’t play out. When the failure occurred, the cluster picked a new starting place, got everyone to agree, and started the universe moving forward from there. When the old node recovers and tries to join this universe, it has to do so with a shared understanding of that universe, which potentially means dropping data that wasn’t passed along.

Of course, in practice, since replication is memory-to-memory and Disk IO tends to be higher latency (the replication of an item and the persistence of items are both scheduled concurrently), things will tend to be replicated more than persisted, but there is no guarantee. Also, the app (through the SDK) has some ability to influence outcomes too with the Durability Requirements features we were talking about.

There are some other things we’re thinking about adding in the future (can share more on it later), but I’d expect that’ll just increase flexibility over what we’re talking about here. If that fits into your “investigation”, we can certainly chat directly.


#8

Thank you so much! It’s really awesome and very good news! I’ll share details you’ve provided on StackOverflow.