Detecting deletions vs partial replication in CBL 2.0

thebnich · May 21, 2018, 10:09pm

Let’s say I have a simple store of Account and Transaction documents, where each transaction holds the corresponding Account ID. I have two clients that sync the Accounts and Transactions. If the user deletes an Account, all Transactions associated with that Account should be deleted.

Now, assume Client A has these documents:
{ id: “abc123”, type: “Account” }
{ id: “def456”, type: “Transaction”, account: “abc123”, amount: 10.0 }
{ id: “ghi789”, type: “Transaction”, account: “abc123”, amount: 25.0 }

And Client B looks like this:
{ id: “ghi789”, type: “Transaction”, account: “abc123”, amount: 25.0 }

I can envision at least two possible scenarios causing B to arrive at this state:

Client B has only Account “abc123” and Transaction “def456” stored; Client A then adds transaction “ghi789” that hasn’t yet been pulled to B. The user deletes the account on B, deleting all associated transactions (just “def456”). After the Account is deleted, “ghi789” gets synced to B, creating a dangling transaction that points to a non-existent account.
The user has all three documents on Client A, then adds Client B as a new second device. A partial pull happens, and “ghi789” happens to be the first document to be synced.

To clean up the dangling Transactions caused by scenario 1, Client B has a change listener that automatically deletes created Transactions that point to missing Accounts. However, if scenario 2 has happened, B will actually be deleting new Transactions that would have been valid once the sync has completed.

Unfortunately, database.document(withID: "abc123") will return nil in the case of a deleted document, but also in the case of a document that hasn’t been created in the first place – so how can a client differentiate between cases 1 and 2?

priya.rajagopal · May 22, 2018, 1:38pm

You cannot handle transactions of this kind (atomicity spanning documents) on the client side in a distributed system. Factoring in network connectivity, that’s not practical . There will be conflicts that you will have to handle depending on business logic that makes sense.

Couple of options

I think having clients delete transaction documents when they don’t see corresponding accounts is potentially dangerous because you can never guarantee that a particular client has list of transactions.
Re-think your data model and embed your transactions in a single account document so you get the document level atomicity that you need.
The source of truth should probably be on your server / backend which has a more complete view of documents from all the clients. Even then, given that this is a distributed system, it may not be guaranteed to achieve atomicity across documents. But one option would be to expand your business logic to track deleted accounts and when transactions come in for deleted accounts, you delete those transactions.

BTW, while written for Couchbase server, this blog post should be a useful read .

thebnich · May 22, 2018, 11:10pm

Re-think your data model and embed your transactions in a single account document so you get the document level atomicity that you need.

Interesting idea; I hadn’t considered grouping transactions because I assumed that it would become much harder to query data, but these examples seem promising.

Could this scale on a much larger set of data, say 10,000 or even 100,000 transactions? I assume document subdata like this can’t be indexed. It’s also hard to imagine that replication would work well with such a huge document size (I’m assuming it would have to calculate a full diff for each write).

The source of truth should probably be on your server / backend which has a more complete view of documents from all the clients. Even then, given that this is a distributed system, it may not be guaranteed to achieve atomicity across documents. But one option would be to expand your business logic to track deleted accounts and when transactions come in for deleted accounts, you delete those transactions.

I suppose this would be easier to do on the server side; the issue above where database.document(withID: "abc123") returns nil won’t happen on the server, but will instead have a _deleted: true property IIUC.

Though looking at the docs, it seems that Couchbase Lite 2.0 should also have _deleted tombstones, but that wasn’t the case from my testing as I mentioned. Are those docs correct?

borrrden · May 22, 2018, 11:34pm

Your testing and docs are both correct at the same time. They will have tombstones, it’s just that those are not exposed from the CBL API to the user. Documents that have been deleted are the same externally as documents that never existed.

priya.rajagopal · May 23, 2018, 2:20am

Yes. All valid trade offs to be considered when your choose the “embed” versus “reference” model for JSON.
CBL 2.0 does not yet support indexing on nested arrays . Well- as far as replication , Its not really the diff on each write - we don’t do delta compression as yet but very large documents would impact bandwidth and memory usage. Of course “large” is a relative term and it would depend on size of each transaction in your application - do you have an idea of the document size with 100,000 transactions . Since embed data model come with own set of unique benefits (no joins needed , atomicity etc), I want to ensure this is thought through .

( BTW CB server does support sub document queries and un nest operations ).

Also, as an aside - something to consider (could be applicable in your case) , if the number of transactions can grow significantly high, determine if you need to always have all transactions synced to the clients. One rough idea is to, segregate account / transaction documents by month for example (assuming embed model) - have a current channel and a series of history channels for each previous month . The current channel only holds the transactions that are recent (current month) and thats what gets synced - the rest can be synced on demand. Of course you need server side logic to manage some of this stuff so older documents are moved to the appropriate channel at end of each month but this may become relevant if your data set grows massively

thebnich · May 23, 2018, 5:34pm

Well- as far as replication , Its not really the diff on each write - we don’t do delta compression as yet but very large documents would impact bandwidth and memory usage.

Does this mean each new Transaction would require that the entire Account document – with every child Transaction – be fully synced? Each Transaction will be around 50 bytes, so even just 10,000 transactions would require sending half a megabyte of data per update.