Couchbase OOM, killed by OS



We are running Couchbase 5.0.1 Community Edition with following setup:

  • 1 node with Data, Index, Search, Query services
  • 6 nodes with Data service alone
    All nodes are of type i3.xlarge on AWS EC2, so 4 CPUs, 30GB RAM, 950GB disk. The swap space (seeing in free command) is 0.

The settings for memory quota are as following:

  • Data 20000 MB
  • Index 4000 MB
  • Search 2000 MB

We have about 1.8 billions documents which occupies about 1TB disk space.

One day when we were adding documents via Spark job, the CB client returned OOM exceptions, then we stopped the job, but from that point on it was virtually not possible to query the cluster anymore due to OOM errors. We observed that the memory usage reported in CB dashboard reached ~87%, top. We added a node, tried to rebalance, but then one of the nodes’ memory usage would spike up and got killed by OS due to OOM. It happened a few times until we managed to rebalance.

My question is, are we doing something sub-optimal with the settings? can we take some measures to avoid such OOM problem from happening again?