Couchbase high durability


#21

But, I think I get what you’re asking. Assuming your application never crashes, and you never lose your network connection to the Couchbase cluster, what specifically can you do to improve durability with Couchbase. Is that right?

Yes, My main issue is when I told my end user that transaction is completed, it must not lost, If they see history of transaction they can see completed transaction any time after my confirmation, no matter what happens, It must be durable after confirmation or no confirmation send to end user if we cannot ensure durability

I know that your example is idempotent, But I don’t get how it can help me, I know that Idempotency is the solution of my problem, I don’t get how exactly to implement it

See https://blog.couchbase.com/multi-document-transactions-acid-couchbase-2/
My logic is more complex than what is in blog’s tutorial, let’s change States a bit

public enum TransactionStates
{
    Initial = 0,
    Pending = 1,
    Committed = 2,
    Done = 3,
    Cancelling = 4,
    Cancelled = 5,
    Failing = 6,
    Failed = 7
}

If transaction remains more than 15 minutes in processing we must go to failing

My concern is just done state, I want to create a mechanism that guaranteed we don’t go to the done state unless we can guarantee that it remains on done state or at failure it go to done again

My concern is I cannot create mechanism by Idempotency, all operations that change transaction States is idempotent
But my issue is 15 minutes timeout, and the rare case when the transaction is done by primary node and processing by secondary node , at failure if we restart with secondary node and repeat logic after a while everything is OK, but after 15 minutes , transaction is failing and failed not done , I have a confirmation the end-user that is not correct and there is no way for human reconciliation, so the payee lost the money

How to log it? In a simple file? My application is distributed too, Must I use central log system? Is really needed to log , It convert my app to more complex app


#22

Ah, then I think you’re ok. Saying you’ve got a 1 active+1 replica, and you do your write with ReplicateTo=1. If that write succeeds then, unambiguously, you know it’s reached the active and the replica, and is therefore now persisted. Once you’re done that for all mutations, only then do you can return success to your user. (Under the hood, the SDK is effectively polling the replicas until it knows the mutation has been written.)

How to log it? In a simple file? My application is distributed too, Must I use central log system? Is really needed to log , It convert my app to more complex app

Well, a banking app, I would think probably is going to be rather complex :slight_smile: Yes, you would need your logging to be reliable, and probably centralised so another instance of your app can take over if needed. Take Couchbase out of the equation. Your end-user clicks a button to start a transfer from one account to another. Do you want to ever lose a record of that event?

This is what I’ve been trying to convey throughout. Achieving ultra-high durability is not (solely) a database problem, it requires persistence and reliability throughout your system. For 100% durability, I feel you need at least 2 sources of truth on the data: e.g. the log maintained by the application, and the database. This allows you to reconcile in cases where e.g. you start the write to database and then your application immediately crashes.


#23

Ah, then I think you’re ok. Saying you’ve got a 1 active+1 replica, and you do your write with ReplicateTo=1

Just I cannot understand , why I must have replica?? If I have 1 active+ 1 replica with replicate_to=1, when 1 node get down, I have no write any more, I know that I can read , But all writes just persisted in primary and It is unclear for me what must I do after because , if replica node returns and syned with primary so all writes that failed , be succeeded (done state) and I told end-user that transaction is not completed!!! Oh my God , it is a vicious cycle :cry:

I know that you told I must have a log system too, and the problem is not couchbase problem , it is a fundamental problem

Can you sugged me a book or article that help me, I need to create a reliable application


#24

Yes, well if you’re going for max durability, 1 active + 1 replica isn’t really enough, for the reason you’ve given. So I would suggest (with my usual disclaimer that these are the personal thoughts of a fairly new employee, rather than an official Couchbase recommendation, plus that these settings will be way overkill I think for most users, and are suggested purely in the context of trying to absolutely maximise your durability):

  1. Run lots of nodes, diffusing the risk/impact of hardware failure. At least 4, to support:

  2. Use active + the full 3 replicas, and write with replicateTo=2. Hardware is unreliable and nodes can and will go down or be unavailable. This lets you survive 1 node going down, while your replicateTo=2 writes continue to succeed, and you data is safely replicated to 3 locations. As long as your durable write has succeeded, then I don’t think you can lose data in this 1-node-down scenario (though I would welcome comment from my more experienced colleagues). Once that downed node’s replica is back, it will safely sync back up with the up-to-date active.
    You could be even safer and write with replicateTo=3, but then have to manage some more complexity in the app (e.g. retrying failed/timed-out writes with replicateTo=2).

  3. Use autofailover, with a low timeout (say 10 seconds). If a node does go down, this will bring it back online quickly. Note that autofailover will failover a maximum of 1 downed node.

  4. Possibly, use persistTo, in addition to replicateTo. We don’t generally recommend it as replicateTo is faster and more suited to a distributed system, but it is an extra layer of protection if you’re going for maximum durability. You want fast IO hardware if you go with this, I feel.

Also, as Shivani has mentioned above, we are actively working on improving durability semantics currently. Nothing I can go into details on unfortunately as it’s still under active discussion, but they will certainly help in these situations.