Is there any bound to Mutation deduping

Ok I get it: https://issues.couchbase.com/browse/KAFKAC-237

DCP will dedup redudant mutations to the same key.

Is there any bound on this? Is there any time between updates that one could assume would ensure that they wouldn’t be deduped.

We wanted to use the stream (among other uses) to track slowly changing dimensions. If I knew that changes to the same field happening, say, a day a part would not be deduped that would accommodate some use cases.

Hi @naftali,

Yes even a a few seconds apart should be enough to to track your changes. Current customer do this already understanding if they get a burst of 100 updates in a second they will only see the most current document.

Below I will give some interesting alternatives that might be applicable to your architecture where I use Eventing a realtime lambda to help out.

CASE 1: Avoiding Dedup

Note if you have control of writing your data you could always write a unique key say basename:baseid:timestamp_millis_since_epoch:random_int in this case dedup would never impacet you.

Then to ensure you could easily access the most recent document you could then use an Eventing Function to update a document basename:baseid with the latest unique basename:baseid:timestamp_millis_since_epoch:random_int for example say I have a basename of “customer” and the following Eventing Function deployed

function OnUpdate(doc, meta) {
    // Given unique keys <basename>:<baseid>:<timestamp_millis_since_epoch>:<random_int>
    // we update a common key <basename>:<baseid> with the latest most current update 
    // or mutation.
    var ary = meta.id.split(":");
    if (ary.length != 4 && ary[0] !== "customer") return;
    var basekey =  ary[0] + ":" + ary[1];
    // src_col is a an alias to a Bucket binding for the source bucket/collection in r+w mode
    src_col[basekey] = doc;
}

Create document in the source bucket/collection
customer:001:1638368667660:999181{"a": 1}

The : is created in the source bucket
customer:001{"b":1}

Create another document in the source bucket/collection
customer:001:1638368667661:999182{"b": 2}

The : is updated in the source bucket
customer:001{"b":2}

The final document set in the source bucket/collection

  customer:001{"b":2}
  customer:001:1638368667660:999181{"a": 1}
  customer:001:1638368667661:999182{"b": 2}

So here you have not only the most current document you want as basename:baseid but you also have the history with millisecond timestamps as basename:baseid:timestamp_millis_since_epoch:random_int available.

You could improve this if you used a long counter instead of random_int then you would have correct ordering. Note the Eventing Function would stay the same.

CASE 2: Approximate Reverse Dedup

As you indicated a day (or even a second or two spread) will result in each mutation not being subject to dedup (barring some major issue like a network outage or a long rebalance).

Thus you could do what I call an “Approximate Reverse Dedup” where you take basename:baseid and in realtime you use Eventing to create a new a new basename:baseid:timestamp_millis_since_epoch:random_int document for example say I have a basename of “customer” and the following Eventing Function deployed

function getRandomInt(min, max) {
  min = Math.ceil(min);
  max = Math.floor(max);
  return Math.floor(Math.random() * (max - min) + min); //The maximum is exclusive and the minimum is inclusive
}

function OnUpdate(doc, meta) {
    // ignore all approximate reverse dedup records, process only <basename>:<baseid>
    var ary = meta.id.split(":");
    if (ary.length !== 2 || ary[0] !== "customer") return;
    
    // Given a base key (for this mutation) of <basename>:<baseid> we want to store the 
    // history as <basename>:<baseid>:<timestamp_millis_since_epoch>:<random_int>
    var myMillis = Date.now();
    var myRand = getRandomInt(1000000,9999999);
    var fullkey =  meta.id + ":" + myMillis + ":" + myRand;
    // src_col is a an alis to a Bucket binding for the source bucket/collection in r+w mode
    src_col[fullkey] = doc;
}

Create document in the source bucket/collection

customer:001{"a":1}

A new basename:baseid:timestamp_millis_since_epoch:random_int is created in the source bucket (note your timestamp and. random_int will differ)

customer:001:1638370428288:8976100{"a":"1"}

Update the document in the source bucket/collection

customer:001{"b":2}

A new basename:baseid:timestamp_millis_since_epoch:random_int is created in the source bucket
and of course the original document was updated

customer:001:1638370461488:9534833{"b":"2"}

The final document set in the source bucket/collection

  customer:001{"b": "2"}
  customer:001:1638370428288:8976100{"a":"1"}
  customer:001:1638370461488:9534833{"b":"2"}

Case 3: Keep the last N documents

If you just want access to the the last N versions you might also consider this design pattern, refer to the Eventing Documentation specifically the Scriptlet Function: Advanced Keep the Last N User Items | Couchbase Docs

Best

Jon Strabala
Principal Product Manager - Server‌

1 Like

I appreciate this @jon.strabala and to be sure I like eventing but I could use not it to act as the base for a change log of data for our heterogeneous systems. Its fire and forget semantics is not solid enough ground to build on.

I get why DCP dedups changes, as it allows for more efficient replication and indexing but it’s a product hole.

It’s becoming increasingly desirable to have a durable change log of the application data state. The Log: What every software engineer should know about real-time data's unifying abstraction | LinkedIn Engineering

I hope Couchbase is considering this on the roadmap. it’s getting more an more important.

1 Like