I have a cluster of 4 nodes, each witch 30GB memory (Amazon EC2) (+ xdcr backup node)
Couch gui show memory usage of 24-28%
The cluster memory status:
Total Allocated (32.3 GB) Total in Cluster (70.3 GB) In Use (17.2 GB) Unused (15 GB) Unallocated (37.9 GB)
The quota per node is 18GB
Despite the fact that there is a bunch of free memory, periodically there is a memory usage increase, and OOM fires.
It usually kills beam.smp (the cluster “works” after that
Jul 15 07:08:41 couch01 kernel: [9214515.877193] Out of memory: Kill process 6639 (beam.smp) score 786 or sacrifice child Jul 15 07:08:41 couch01 kernel: [9214515.881859] Killed process 6639 (beam.smp) total-vm:27077936kB, anon-rss:24227172kB, file-rss:0kB Jul 15 07:08:41 couch01 kernel: [9214515.892273] memcached invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 Jul 15 07:08:41 couch01 kernel: [9214515.892275] memcached cpuset=/ mems_allowed=0
Today it did kill memcache, and the node failed, cluster tend to rebalance, but failed (second node also crushed)
Jul 15 11:05:31 couch01 kernel: [9228726.790765] Out of memory: Kill process 117040 (memcached) score 164 or sacrifice child Jul 15 11:05:31 couch01 kernel: [9228726.796414] Killed process 117040 (memcached) total-vm:5399148kB, anon-rss:5067456kB, file-rss:0kB
Why it does happen, I seem to have a lot of free memory.
Current node shows a lot of free memory
nodeA $ free -m total used free shared buffers cached Mem: 30147 8618 21528 0 161 1015 -/+ buffers/cache: 7440 22706 Swap: 0 0 0
I guess, that from time to time some tasks starts, and tend to do some computation tat require all “Total Allocated” memory on a single node?
How can I limit per-node memory usage?
We have some idea,that this might be a slow disk (600/3000 IOPS) EBS on our cluster nodes.
The backup-XDCR-node which is in other zone might have some network lag (I haven’t notice anything like that though), or some performance issues.
(This does not help http://docs.couchbase.com/admin/admin/Misc/Trbl-beam-smp-issue.html)
Still, that shouldn’t result in eating more memory than is allowed (quota per node: 18GB out of 30GB) so the OOM shouldn’t fire
1x node: Version: 3.0.2-1603 Enterprise Edition (build-1603-rel) Ubuntu: 3.13.0-36-generic #63-Ubuntu SMP Wed Sep 3 21:30:07 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux 3x node: Version: 3.0.1-1444 Community Edition (build-1444-rel) Ubuntu: 3.13.0-48-generic #80-Ubuntu SMP Thu Mar 12 11:16:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux + backup node (XDCR) 3.0.2-1603-rel Linux-x86_64 3.13.0-44-generic #73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
No swap, over-commit memory off:
cat /proc/sys/vm/overcommit_memory 0
This look like configuration problem (or a couchbase bug).