We have a 28 node, 2 replica, couchbase cluster with 60 Gb total RAM per server( total of 1.26 TB in cluster).
We are running the cluster in DGM mode and there is 745 GB free RAM in the cluster.
While rebalancing the cluster from 28 to 27 nodes, we have recived temp OOM error causing set failures and affecting response time of the gets. Free ram after rebalancing is 494 GB. The cluster is able to handle normal load with 17 nodes while not rebalancing.
Why do we get OOM error despite the 494 Gb free RAM during rebalance. Is there a way to handle the temp OOM errors/allocate more unused RAM for couchbase operations.
Couchbase version - 4.0.0-4051
mem_used: 971 Gb
as per the cb vbucket monitoring graphs
total user data in RAM: 728 GB
active metadata in RAM : 140 Gb