Delete bulk documents using N1QL

Jijith · January 30, 2023, 6:42pm

I am trying to delete/ cleanup large number of documents (in the order of ~250 Million items) of certain age from a bucket using N1QL. The query looks like →

Delete from bucket1
where meta().id in
(Select b.Id from bucket1 as b
where creationTimeStamp BETWEEN T1 AND T2) .

Query explanation -
Basically, I have a bucket that comprises of lookup documents as well as another kind (type 2) of documents. I want to delete both kind of documents, but for deleting lookup document, I am selecting Ids (not meta().id, but a specific field) from all the type 2 documents that has created time stamp falling in between T1 and T2 and then delete all the docs that has meta().id same as the subquery-selected Id. (The meta().id of lookup docs will be the Id field in type2 documents). Eventually, I would be deleting the type 2 documents as well. But for deleting the lookup docs, the identification method is only from type 2 docs as the lookup docs is not having any other fields other than the type 2 document id and some other field.

I have created a primary index for doing this ONE TIME activity, but it takes long execution times. Is there a better way to handle this using N1QL. Please suggest.

Thanks.

mreiche · February 3, 2023, 6:32pm

Is it the delete that takes long or the select? How long does it take to run just the select?

Select b.Id from bucket1 as b where creationTimeStamp BETWEEN T1 AND T2

How long does it take to run the delete with the select?

If it is the select that is taking a long time, then you can look into ways to improve it. Do you have an index on creationTimeStamp?

Jijith · February 3, 2023, 8:41pm

@mreiche
It is the delete that takes longer time. I can see it leverages the primary index and hence the longer execution time.

Yes, the index is available for creationTimeStamp.

vsr1 · February 3, 2023, 9:27pm

You should explore eventing vs N1QL. cc @jon.strabala

N1QL may take very very long time

create index ix1 ON bucket1(creationTimeStamp, Id);
use covering index

Delete 
from bucket1 USE KEYS (Select RAW  b.Id from bucket1 as b
                       where creationTimeStamp BETWEEN T1 AND T2 LIMIT 5000)

You can use different non overlap times and run statement in parallel. If timeout repeat until no mutations.

mreiche · February 3, 2023, 9:38pm

On the client, you can do something like this (java version). Get the list of id’s using n1ql, then do the deletes using the kv api. The deletes will be done in up to 16 parallel executions.

ReactiveCluster reactiveCluster = cluster.reactive();
ReactiveCollection reactiveCollection = bucket.defaultCollection().reactive();
		
reactiveCluster.query("Select b.Id from bucket1 as b where creationTimeStamp BETWEEN T1 AND T2")
	.flatMapMany( result -> result.rowsAsObject())
        .parallel(16).runOn(Schedulers.parallel())
	.map( r -> reactiveCollection.remove( r.getString("id"))).collectList().block();

Jijith · February 4, 2023, 3:25pm

@vsr1 Thanks, this makes sense as it would eliminate the need for table fetch and the result can be served via index scan itself.

Jijith · February 4, 2023, 3:29pm

@mreiche Thanks, we were more inclined to the N1QL approach rather than introducing an application layer in between, but this looks good for a try as it reduces the execution scope (based on the result of subquery and then direct delete, rather than doing full primary scan for results in subquery).

vsr1 · February 5, 2023, 3:23am

@Jijith You should consider @mreiche approach. If one time clean up, you just need few lines of standalone SDK program vs application layer changes

Jijith · February 6, 2023, 2:57pm

@vsr1 Sure, makes sense.