Fastest retrieval of the documents without knowing their keys

mattt · January 11, 2017, 2:34pm

Hi All,

we have a task of retrieving and deleting periodically all documents from a specific bucket.

The bucket has over 100+ million documents and currently the documents are retrieved via Python SDK and execution of the looped N1QL query:

        query = cb.n1ql_query(N1QLQuery(
            'SELECT meta(alias).id, * FROM bucket_name AS alias LIMIT $batch_size',
            batch_size=batch_size))

and then there is executed: remove_multi(keys)

where the batch_size is usually 250k.

As the keys are not known in the script it’s not possible to use (faster?) function get_multi(keys).

In our processing this operation is the main bottleneck and it takes surprisingly a lot of time in a comparison to writing the documents into a relational DB (currently writing is 5x times faster).

We have 3 nodes on different machines and the script is executed on another machine (and relational DB is also on another machine).

Is there any better / faster way to retrieve a batch of documents without knowing their key?

Thank you a lot for any ideas or hints.

drigby · January 11, 2017, 4:20pm

If you’re willing to “block” normal access to the bucket for the duration of the delete you can use Flush operation to delete everything from it.

Alternatively you could create a Map/Reduce view which just emits the ID of all documents and query that - it may or may not be faster than N1QL.

Thirdly, if you just want to have all data disappear after a certain time (and don’t need to read it before it goes away) then you could set TTLs (expiries) on your documents.

mattt · January 11, 2017, 5:02pm

Thank you for your reply drigby,

Creating a naïve map and query it instead of using the N1QL sounds like a good alternative method to verify.

The Flush and TTL are not matching to our case as the number and size of the documents require that we process them (read & write) in the batches in an undefined moment in the future.

jon.strabala · July 6, 2022, 2:54pm

@mattt,

I know it’s been a while, the Eventing Service can be used to quickly inspect (and if needed modify) all documents in a keyspace and also directly alter TTLs if needed (via advanced keyspace accessors).

Eventing Functions can inspect all data via Couchbase’s high speed database change protocol (DCP) are of the form:

function OnUpdate(doc, meta) {
    log("document metadata", meta);
    log("document value", doc);
}

There are many examples in the documentation (an also here in the forums like “How can I rename field of document …”). As such I am sure one will be close to what you are looking for.

Best

Jon Strabala
Principal Product Manager - Server‌