PersistTo By Date?

aajordan · October 2, 2020, 7:38pm

Hi All,
We’ve found that our Couchbase cluster has been running consistently high on CPU and RAM utilization; in our use case however, we were curious if upgrading infra resources was the best approach; or if we could/should change our persistence strategy. Here is our use case:

-Write ~2-3 million records daily
-Purge ~2-3 million records daily (through setting a 45 day TTL flag on the document)
-Within the first 48 hours we process the data, during that time CRUD performance is important
-After 48 hours, we never touch the data again, except rarely to research issues and of course when it’s purged

We were thinking, if there’s a way to persist the data that is at least 48 hours old to disk, versus in memory, that this would dramatically reduce our resource needs; bearing in mind that performance on data post-48 hours is not important. I saw the persistTo flag, but I’d prefer to do it at the bucket level, or set some sort of time frame, versus having to update each document after a couple days.

Thoughts?

Thank you in advance!

ingenthr · October 2, 2020, 10:17pm

Couchbase tends to focus on operational use cases, and so toward that end it will try to keep a working set in memory. So, for RAM, you can in many cases just reduce the size of memory and as long as you are happy with response times, the lower amount of memory is fine. You can also consider having less memory and faster disk, for example, as a way do shift the cost.

As to CPU, that depends a bit on where it’s going. You might look at which process (top -u couchbase on most Linux systems) is using the most CPU time and definitely consider using the latest version as there have been occasional improvements.

Couchbase will do exactly what you describe automatically. The persistTo flag isn’t involved in the decision on when to persist things. Everything is persisted as fast as the disk allows. It sounds like you really need to estimate/analyze your working set, and adjust from there.

One simple thing you could do, take a few percent of memory away each day and watch the disk fetches. If they don’t go up, then you didn’t need that memory. Eventually they’ll go up a little but your response times will be fine. That’s your sweet spot.

aajordan · October 5, 2020, 1:52pm

We’ve found that we’re pretty consistently using something like 90% of our CPU, but it stays this way practically round the clock, and we’re only doing data reads/writes a few hours a day. We were thinking that the working set was too large, and we’re trying to figure out how to lower our utilization. In thinking about it however, perhaps it’s that Couchbase is spending a lot of time writing to disk, and that’s the issue…I just read that NRU gets ejected automatically, wondering if that’s it since none of this data is used past 48 hours.

We’re chasing our tails a little with it, but will try to check the top -u couchbase to see where the processor time is going. Beyond that, we’re concerned, since our DBA team told us that due to the high utilization our index had gotten emptied, and we had a multi-day outage for it to be rebuilt. My concern is that we’re not actually root causing the issue, we’re just throwing more resources at it, and that concerns me that we’ll face the same issue if we’re not managing our cluster properly.