CBL Duplicate checking pattern


#1

Hi there!

I was just curious if anyone had developed a working pattern to ensure that they weren’t saving documents with duplicate data?

Right now, I have a data supplier in the form of a piece of hardware, which sends data (data with no unique ID), and sometimes that data is replicated - however, since there is no unique ID - I cannot know it’s replicated.

That means that I need to handle stopping duplicate records right at the point they are saved - so I feel like this has something to do with my livequery or maintaining a live list of all records, and then searching through them before each item to be saved.

At the moment, I’m debating creating a hash of my document (excluding documentID and timestamps), and then keeping the hashes of all my already-persisted documents in a hashmap or list - and before each save, I can run through these lists/hashmaps to see if data is duplicated.

This feels like a real hack - so I was wondering if anyone else had developed a better solution?

Thanks!
-SJ


#2

If you can create your documents using a deterministic key, it will be easier/lighter than a live query that indexes all documents.


#3

To elaborate on what Laurent said: when you save a record to the database, compute a digest of the data and base the docID on that — something like "hardwaredata-"+hex(sha1(rawData)). Then if you try to save a duplicate record, the save will fail with a conflict error.

If you can’t do this, then you’ll have to do something like what you already described. But you can do it with a view whose map function computes the digest and emits it as the key. Then before saving a record you digest its data and query that single key in the view; if you get a match, the record is redundant.


#4

Thanks @jens and @ldoguin!

My hope was to do it without using deterministic keys (because that means find deterministic data, or hashing that, and putting them in the keys of the object type which has the most number of objects!

My hope was as Jens said, where I would emit a hash with the livequery - but I feel like this might lead to race conditions.

However, it looks like there is no pretty way to do that, so I’m looking into namespaced keys…
e.g. user::{22byteUUID}::dataX::{sha1 or some other unique data}


#5

If you put the user ID into the data that you digest, you’ll get unique digests across all users.


#6

So, I might not use a digest necessarily. But I do need to namespace, as my ‘unique’ information is only unique per person, not globally.

e.g. I have one document per day, per user - so I can do user::suresh::dataX::20160225, in this case, I’m only allowed one document per day - and each user will have 1 document per day. So this is a simple way to avoid all the SHA1 digests for this kinda document. Looking to do the same with other doc types.


#7

Just to follow up - this pattern seems to be working, and it feels better overall.

In the official CB docs - is there anything on document key naming? I’ve seen stuff in various presentations, just wondering if there is an ‘official’ convention, as that would be great to point some of our developers who have never used CB at. I’m going with the double colon - but no reason, other than I’ve seen it in various examples on the internet.