How to reduce disk I/O?

janpaulb · March 8, 2019, 6:28pm

My team has been using Couchbase on VM’s for a while now, and we recently enabled throttling on disk I/O to stop “noisy neighbor” syndrome for other VM’s on the same physical machine. However, this has landed us in heaps of trouble because Couchbase will randomly (and sometimes not-so-randomly) start demanding a super high amount of disk I/O operations, get throttled, and then stop serving requests until it can get through its disk write queue.

It’s frustrating because Couchbase seems like it’s using far more disk I/O than other DB’s that other teams at our company are using. For example, MongoDB is well documented that if the entire working set can fit into RAM, the disk operations are pretty limited. That particular is more difficult with Couchbase since using XDCR means that each DC needs to cache the working set of every other DC in memory.

Couchbase isn’t as well documented but it seems like there should be similar tips for reducing disk I/O. For example, I read that setting the read_ahead setting on the VM to zero is a good idea for MongoDB. Would you expect that to be helpful for Couchbase as well? Are there any other known ways to reduce the amount of disk I/O that Couchbase uses?

For reference, we’re on community edition 5.1.1. We’re running XDCR in a bidirectional ring between 5 different datacenters around the world. The high disk I/O usage looks to be caused by any number of things, from high read/write volume, to high cache miss ratio, to routine DB operations (I think maybe compaction has caused it?), to high write volume on a different DC that then propagates through XDCR.

Any suggestions are very welcome. Thanks for taking the time.

paolo.morgano · March 9, 2019, 12:38pm

You scenario seems to be quite complex. Disk is accessed for sure when data is written and must be persisted and replicated, or when data is read and the data you read is not cached (check bucket info, they also show the percentage of data cached vs total bucket data). Indexing is also an operation that involves disk access, check if some type of data you access involves complex index update.

Generally, documentation suggests to have dedicated disks for both data and index service, to avoid racing conditions in I/O demand.

About system response latency, you can check your client write policy about persistence: you can decide to be acknowledged when data is stored in cache (if cache space available), or to wait data is persisted or wait when data is persisted and replicated too. Obviously, if you wait data persistence or replication, disk io is an issue for you.