I have a case where I get a file generated by Spark containing document keys of cold data, which should be deleted. These files usually contain arround 6.000.000 records/ keys.
I wrote a Python script which reads the file line by line, generates packages of e.g. 250 lines and then deletes them using remove_multi() providing a dictionary with the keys and their expected CAS.
As far as I can see, the remove_multi() does not immediatelly remove the documents. In the Couchbase UI, I can see the are no “deletes per sec.” for quite a while, then after a few minutes they go up to some thousands per seconds and the disk write queue gets filled too.
Am I missing something like a commit to force the removal after each 250 docs package?
Any advice would be very much appreciated.
def remove_batch_data( rm_dict ):
suc = 0
cb = Bucket(CB_CONNSTR, password=CB_PASSWORD)
if ( len(rm_dict)>0 ):
if str2bool(dryrun_mode):
for key in rm_dict:
logging.info('Key: ' + key + ' - CAS: ' + str(rm_dict[key]))
else:
try:
docs = cb.remove_multi( rm_dict, quiet=True )
suc += len(rm_dict)
except cb_errors.KeyExistsError as exc:
for k, res in exc.all_results.items():
if not res.success:
logging.warning('Removal failed: ' + k)
else:
suc += 1
return suc