- " When comparing these values, it’s important to interpret them as an unsigned 64-bit integers." - In the kafka connector i get this as long anyways. Hope i can directly use the long value for comparison.
If you’re using Java you’ll want to use Long.compareUnsigned(x,y) to see which of two sequence numbers is greater. If the language you’re working with has native support for unsigned 64-bit integers then this isn’t an issue at all.
- I get the fact that the same sequence numbers can be reassigned in the case of a fail over scenario (when vBucket moves from one node to another), but you mentioned that it might not be directly comparable. Is there any alternative or it has to be in the application logic?
Without persistence polling, the application logic would need to look at the failover log. It might get complicated, which I why I tried to gloss over it
When persistence polling is enabled you should be able to simply compare the sequence numbers.
3.1.What do you mean by persistence polling? I do not see any references in the connector document and how to enable this?
Persistence polling is a rollback mitigation strategy where the DCP client waits for changes to be persisted to all replicas before telling the connector about the change. It’s enabled by setting the
couchbase.persistence_polling_interval connector config property to a non-zero duration.
3.2.And does this mean that connector has a built in way where it adjust the sequence numbers so that ordering is maintained and we can directly compare?
That’s persistence polling, yes.
- While reading the DCP link, i came across snapshots. If my understanding is correct, only if connector’s ‘use_snapshot’ is set to true, it becomes resilient to the connet cluster failure.
The Kafka connector’s
use_snapshots config property doesn’t do anything except cause
OutOfMemoryErrors It will be removed in a future release; in the mean time I’d recommend setting this to
When the same Couchbase document is modified twice, the DCP protocol allows the server to “de-duplicate” the event stream and send only the second version of the document. For example, let’s say an application creates document
A, then document
B, and finally updates document
A. The “real” sequence of events looks like this:
A1 B1 A2
The DCP protocol allows the server to de-duplicate the modifications to document
A and send this instead:
If you’re reading the stream one event at a time, there’s a period when you would know about only document
B, even though document
A was created first. Snapshots are a way to retain a consistent view of all documents. In this case, the server presents
B1 A2 in the same snapshot. The idea is that if you process an entire snapshot at once, you know you’re be looking at documents that all existed together at the same point in time.
For the Kafka connector, snapshots don’t provide any value, since we send the events to the topic one at a time. The only thing the
use_snapshots setting does is buffer an entire snapshot into memory before sending the messages to the topic. The messages are still published one event at a time, without any indication that they belong to the same DCP snapshot.
Incidentally, there’s an open enhancement request MB-26908 to allow disabling de-duplication (and eliminating the need for snapshots). This would be a boon for the connectors, but it’s not clear whether a high-performance solution is feasible.