Are Couchbase documents replicated between clusters in a specific order?

andreas.nilsson · April 16, 2019, 8:43am

Hi,

We have an issue with Couchbase replication, and are now wondering if our expectations are perhaps a bit off.

The part of our stack that is relevant include the following pieces:

Couchbase - version 5.0.x Community edition, clustered
Kafka
Kafka Connect with the Couchbase Source connector
Our application

The flow of data and logic here follows the following pattern:

Data comes into Couchbase - documents being written
Kafka Connect Couchbase source connector picks of changes and writes events to a topic in Kafka
Our application sees the events in the Kafka topic and starts reading from Couchbase

Our setup normally consists of a single Couchbase cluster in a single data center, and this have been working fine for a long time now. However, as of late one of our clients have been wanting to expanding into a secondary data center. The first thought was to let Couchbase handle all the replication since everything else in the setup is actually derived in one way or another from Couchbase data (more or less). We set this up using uni-directional replication (master -> slave).

What we have been experiencing however is that while we do get the events in the slave cluster as we would expect, the order is not the same as in the master cluster. I can’t be sure whether this means the documents in Couchbase are written in this order or whether the issue lies with the source connector and the events, but I’m guessing Couchbase actually replicates vbucket by vbucket separately, and thus this is actually “expected” behaviour.

It’s a bit hard to find exact details on how Couchbase replication (at least in the main documentation) works, but that might be because as users we are not supposed to care about the details.

Now, our application doesn’t exactly depend on the order of events, but there are some dependencies between documents in Couchbase. Meaning one single logical entity consists of 3 separate documents. In this case, what seems to be happening is that one of the documents (the one triggering the Kafka event the application cares about) is written in Couchbase and the resulting event seems to reach the application before the other documents are replicated.

What is the semantics for Couchbase replication? I realize we probably shouldn’t depend on a certain replication order always being true, and we can certainly handle this case with fallbacks and retries, but if there is something intrinsically incorrect in our understanding of Couchbase replication that makes this kind of behaviour unfeasible, we certainly need to educate ourselves. Also, what kind of “lag” / latency can we realistically expect in an otherwise fully working and connected environment?

If the issue here lies with the replication being done on a vbucket level, I guess the only real solution is to either fallback / retry until the document do exist or to make sure all linked documents are hashed in the same bucket?

Thanks in advance for any pointers or tips!

Regards,
Blasphemic

matthew.groves · April 19, 2019, 1:07pm

Hi @andreas.nilsson,

I’ll do my best to provide you some help, and I’m also going to tag some of our engineers to assist.

As far as XDCR (cross-data center replication), I think you are right to assume that documents will come over in undefined order. I couldn’t find any documentation about vbucket replication to this effect, but it makes sense given what you are experiencing. XDCR also does batching, which could also be a culprit in the situation you describe.

I don’t think the issue lies with the Kafka connector, though again, I don’t think you can (or should) expect any particular ordering of messages being put into Kafka via the connector.

As far as your question about ‘lag’ (latency) between clusters, that’s a hard question to answer generally, because it depends on a number of factors. However, we do publish a lot of benchmarks, including XDCR.

I’m tagging @david.nault as someone who might be able to provide some more insight.

david.nault · April 19, 2019, 8:11pm

Hi Andreas,

This is definitely true of the Kafka Connector. Publication order is only guaranteed for documents in the same vbucket. I suspect XDCR works the same way since it watches for database changes using the same protocol as the connector. I’m not certain though.

If you decide you want related documents to be in the same vbucket, there’s some code in the Elasticsearch connector that might help with that.

Thanks,
David

andreas.nilsson · April 23, 2019, 6:59am

Thanks for the answers, both of you! This certainly brings some additional clarity to the issue at hand.

I believe the short term solution for us at this time is to be able to detect and fallback / retry in these situations. I don’t however think it can ever be a long term solution… documents missing in the document storage is a too general error case to distinguish from other errors.

For now, however, I believe I got answers to my questions. Thanks!

Regards,
Blasphemic

andreas.nilsson · April 24, 2019, 7:33am

Hi again!

A follow-up question has come up.

The document key example from the Elasticsearch connector referred above takes the approach of force-generating the document ID to be CRC32-hashed (the vbucket algorithm) into a certain vbucket, but unfortunately it’s a brute-force approach that can actually (even if rarely) fail as well as hog a lot of CPU resources if you are generating a lot of IDs. You might also have to change the structure of your document keys a bit.

Is there a way (supported) to do it the other way around instead, meaning actually replace the hashing algorithm being used? I believe this is a pure client operation in Couchbase and that the server only supply the node-map. Of course, you then run into the issue of needing every Couchbase client (every one of our application clients as well as the Couchbase Admin UI itself, perhaps?) to be using the same algorithm, but it’s good to know if it’s an option.

I’ve seen one or two references to this approach in other forums, but never any concrete information.

Thanks in advance!

Regards,
Blasphemic

david.nault · April 24, 2019, 7:10pm

Hi Blasphemic,

You are correct, the client is responsible for mapping a key to a vbucket (server doesn’t care, just stores what it’s given). Also correct, the admin web UI fails when a document is not in the expected vbucket. I suspect it would also impact the command-line tools.

There’s currently no supported way to tell the SDKs to use a custom hashing algorithm. But I agree that would be more elegant than the brute force kludge I linked to above.

Thanks,
David