How to eliminate duplicate processing in batches

aponnath · March 22, 2022, 4:52pm

Here is my issue, i have a large amount of docs 3 Million + and growing and i need to process them / fetch additional DATA from an external API. The problem is since its HTTP calls to API its slow and i need to run multiple instances. My problem is now to find a way to prevent these processes to work the same Doc’s.

So i was thinking of possibly creating a InProcess Doc which has an Array of DocKeys which are in any of the running batches and then put on the select clause where DocKey is not in Array. And before i go and loop thru result push all New DocKeys to the Array.

The only issue is then to also have to cleean the InProcessDoc and remove all keys from a that batch as well make sure i dont get past 20MB. And there is a issue of performance for the initial select if i want to filter out a large DocKey range.

I know i can use the offset but that isn’t a real solution as my underlaying data updates so the offset would change if new data arrived at the time between starting 2 batches.

So i am looking for the most flexible approach and highest performance to do batch process via N1QL

vsr1 · March 22, 2022, 5:58pm

I would recommend use covered query and project the document key.
Then use SDK’s fetch the documents and process so that your size limit can be avoided.

Based on query, Other option is see if you can use keyset pagination Using OFFSET and Keyset in N1QL - The Couchbase Blog