Unreferenced binary blobs not getting cleaned up

We are having some issues with our buckets rapidly increasing in disk size due to binary data from presumably attachments not being cleaned up when the corresponding document is deleted. Our use case involves customers creating documents containing data, and a server component fetching this data and removing it from the database. This involves many small documents with 1 or more attachments (image data) which are created and shortly after (matter of hours) deleted again when they have been processed.
Our setup involves a Couchbase cluster consisting of 3 nodes, and in front of that 3 sync gateway instances which are used by the backend server and mobile clients. We can consistently reproduce this issue as follows:

  • Have a bucket up and running.
  • Run the following n1ql query:
    echo 'SELECT * FROM `bucketname`;' | cbq | grep "\u003cbinary" | wc -l
    => This will print the number of binary blobs in the database, lets call this number X.
  • Insert some documents with attachments into the bucket
  • Rerun the above query to get the new number of blobs, lets call this number Y. You will see that Y > X, obviously.
  • Delete the documents
  • Rerun the query
    => We still get Y. I would assume this is because older revisions of the documents still refer to this attachment.
  • Run database compaction on the bucket to get rid of old revisions.
  • Rerun the query
    => I would expect the number to have dropped down back to X but instead we still get Y.

Is there a way to get the number of blobs down and free up the disk space they are consuming?

The difference you have could be explained by query consistency. But if you wait long enough it should not be the case. Maybe @march44 or @traun can help.

Even after waiting several days (and running compact), the query still returns the same result, so I don’t think query consistency is what is wrong here. We also see the bucket still taking the same size on disk as after adding the attachments, so the blobs are still really there, even after removing the corresponding documents.

It has been well over a week now since the first time I ran the query and the results are still exactly the same. Query inconsistency seems very unlikely :frowning: . Any other ideas?

Any news on this? Our production servers disk size usage is growing exponentially so this is soon becoming a critical issue for us.

There isn’t currently an automated task to clean up obsolete attachments. Since attachments could be referenced by multiple revisions of a document (or multiple documents), they don’t get purged when a revision gets deleted (tombstoned).

I’ve filed a ticket for an enhancement (https://github.com/couchbase/sync_gateway/issues/1648) to clean up unreferenced attachments - if you’ve got any additional specific requirements, you could add a note to that ticket.

Shouldn’t obsolete attachments get deleted when all documents that have been referencing them have been deleted and a compact cycle has discarded all but the tombstone revision?

Glad to see there is a github ticket about this now. In the meantime, is there any way we can manually remove the binary blob files to free up some disk space? Or will this cause issues when doing this behind Couchbase’s back?