In the last few days I’ve been experiencing a strange problem.
I am using CB Server 4.0 with a full-eviction bucket.
For some reason, occasionally one or another server will run into the ‘Hard OoM’ error.
At these points, I’ve noticed that the memcached process is using the full amount of memory that the data service has been allocated.
Once it reaches this state, it never leaves it, regardless of the system load.
The only way to fix it is to manually restart the memcached process (which does work).
Therefore, I suspected that we were writing new documents faster than we could evict them; however, the disk throughput is very low (10s-100s KBs per second) compared to the potential disk throughput (AWS SSDs which are striped together). Furthermore, the disk and replication queues and not significantly high (they tend to spike and return to 0 quite quickly).
I’m wondering what could be cause of this, is there a known bug in 4.0 that can cause such an issue?
Also, as more evidence, we didn’t have this problem until we started building GSI indexes; however the issue crops up on machines which DON’T contain an index.
Image below shows vbucket status at the time for the affected node.