Fastest retrieval of the documents without knowing their keys

Hi All,

we have a task of retrieving and deleting periodically all documents from a specific bucket.

The bucket has over 100+ million documents and currently the documents are retrieved via Python SDK and execution of the looped N1QL query:

        query = cb.n1ql_query(N1QLQuery(
            'SELECT meta(alias).id, * FROM bucket_name AS alias LIMIT $batch_size',
            batch_size=batch_size))

and then there is executed: remove_multi(keys)

where the batch_size is usually 250k.

As the keys are not known in the script it’s not possible to use (faster?) function get_multi(keys).

In our processing this operation is the main bottleneck and it takes surprisingly a lot of time in a comparison to writing the documents into a relational DB (currently writing is 5x times faster).

We have 3 nodes on different machines and the script is executed on another machine (and relational DB is also on another machine).

Is there any better / faster way to retrieve a batch of documents without knowing their key?

Thank you a lot for any ideas or hints.

If you’re willing to “block” normal access to the bucket for the duration of the delete you can use Flush operation to delete everything from it.

Alternatively you could create a Map/Reduce view which just emits the ID of all documents and query that - it may or may not be faster than N1QL.

Thirdly, if you just want to have all data disappear after a certain time (and don’t need to read it before it goes away) then you could set TTLs (expiries) on your documents.

Thank you for your reply drigby,

Creating a naïve map and query it instead of using the N1QL sounds like a good alternative method to verify.

The Flush and TTL are not matching to our case as the number and size of the documents require that we process them (read & write) in the batches in an undefined moment in the future.