Fragmentation high after bulk data load


#1

I am using Couchbase to load some master data from SQL Server database. One of the issues, I am facing is that the fragmentation goes very high after the data is loaded. I am not sure about the specifics of the data load here since I do not own that process but its something that runs from .Net Code. After this load is complete, I see that the disk space used by Couchbase goes from 21 GB to 154 GB (numbers based on Web Console and I assume Console only shows data size not indexes).

As soon as data load is complete, I run compaction on the bucket and it goes back down to ~ original size. This indicates to me that data volume is not increasing on a net basis.

I am looking for some suggestions to explore on how to perform data load in Couchbase in an efficient manner to avoid fragmentation. Are there any guidelines?


#2

I wouldn’t necessarily worry about compaction - assuming you have auto-compaction enabled and at a suitable threshold, the compaction task will run automatically.

Having said that, for some context there’s two main contributors to compaction in Couchstore (given it’s an append-only format):

  1. Replacing existing documents (as the old document value will be present earlier in the file).
  2. Writing data in small batches (as the B-Tree overhead will dominate if you’re only writing a few documents to disk at once).

(2) is mostly a function of your mutation rate compared to your disk speed:

  • if your mutation rate is low, or your disks are fast, then little batching occurs (the Data Service optimises the latency of writes - i.e. will aggressively flush any data outstanding, even if it’s only one item). As such, you’ll see higher fragmentation
  • If your mutation rate is high, or your disks are slow(er), then more batching occurs - and hence the B-Tree metadata cost is amortised over more items (and you end up with fewer old/stale B-Tree nodes in the file).

#3

Thank you for explaining that.

So far I had a percentage threshold but compaction is allowed to run only during specific time interval. I will try the auto compaction as you indicated.

I will also explore the possibility of optimizing batches with the developers. Is there a way to monitor these metrics - mutation rate, disk speed, latency of writes as you have indicated. I am still new to the console monitoring.