OOM beam.smp unknow couchbase failure

https://groups.google.com/forum/#!topic/couchbase/bw7SAHft8Vs

Hi

I have a cluster of 4 nodes, each witch 30GB memory (Amazon EC2) (+ xdcr backup node)
Couch gui show memory usage of 24-28%

The cluster memory status:

Total Allocated (32.3 GB)     Total in Cluster (70.3 GB)
In Use (17.2 GB)  Unused (15 GB)   Unallocated (37.9 GB)

The quota per node is 18GB

Despite the fact that there is a bunch of free memory, periodically there is a memory usage increase, and OOM fires.
It usually kills beam.smp (the cluster “works” after that

Jul 15 07:08:41 couch01 kernel: [9214515.877193] Out of memory: Kill process 6639 (beam.smp) score 786 or sacrifice child
Jul 15 07:08:41 couch01 kernel: [9214515.881859] Killed process 6639 (beam.smp) total-vm:27077936kB, anon-rss:24227172kB, file-rss:0kB
Jul 15 07:08:41 couch01 kernel: [9214515.892273] memcached invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Jul 15 07:08:41 couch01 kernel: [9214515.892275] memcached cpuset=/ mems_allowed=0

Today it did kill memcache, and the node failed, cluster tend to rebalance, but failed (second node also crushed)

Jul 15 11:05:31 couch01 kernel: [9228726.790765] Out of memory: Kill process 117040 (memcached) score 164 or sacrifice child
Jul 15 11:05:31 couch01 kernel: [9228726.796414] Killed process 117040 (memcached) total-vm:5399148kB, anon-rss:5067456kB, file-rss:0kB

Why it does happen, I seem to have a lot of free memory.
Current node shows a lot of free memory

nodeA $ free -m
             total       used       free     shared    buffers     cached
Mem:         30147       8618      21528          0        161       1015
-/+ buffers/cache:       7440      22706
Swap:            0          0          0

I guess, that from time to time some tasks starts, and tend to do some computation tat require all “Total Allocated” memory on a single node?
How can I limit per-node memory usage?


We have some idea,that this might be a slow disk (600/3000 IOPS) EBS on our cluster nodes.
Or
The backup-XDCR-node which is in other zone might have some network lag (I haven’t notice anything like that though), or some performance issues.
(This does not help http://docs.couchbase.com/admin/admin/Misc/Trbl-beam-smp-issue.html)

Still, that shouldn’t result in eating more memory than is allowed (quota per node: 18GB out of 30GB) so the OOM shouldn’t fire :confused:

1x node: Version: 3.0.2-1603 Enterprise Edition (build-1603-rel)
Ubuntu: 3.13.0-36-generic #63-Ubuntu SMP Wed Sep 3 21:30:07 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

3x node: Version: 3.0.1-1444 Community Edition (build-1444-rel)
Ubuntu: 3.13.0-48-generic #80-Ubuntu SMP Thu Mar 12 11:16:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

+ backup node (XDCR)
   3.0.2-1603-rel Linux-x86_64
   3.13.0-44-generic #73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

No swap, over-commit memory off:

cat /proc/sys/vm/overcommit_memory
0

This look like configuration problem (or a couchbase bug).
Any clues?

zero is actually overcommit in auto mode. from man 5 proc:

0: heuristic overcommit (this is the default)
1: always overcommit, never check
2: always check, never overcommit

Your beam.smp process getting up to 24GB RSS (anon-rss:24227172kB) sounds like something is wrong - that’s unusually high.

You probably also don’t want to run with a mix of CE and EE nodes; and certainly not at different versions - I don’t know how well those will play together.

First up I’d recommend getting on a consistent edition and version, and see if that improves the resource usage.

Hello!

I’m hijacking this thread cause I seems to have something similar. My cluster is a bit smaller, using couchbase-server-3.0.3-1716.x86_64 on centos6 (Linux *** 2.6.32-504.16.2.el6.x86_64 #1 SMP Wed Apr 22 06:48:29 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux).

Servers have 8GB of RAM and node quota is 4,7GB, 2 buckets with bucket quota 1GB. As I understand it, we should have enough memory to accomodate the data. However, we randomly see oom triggered kills of the beam.smp process (the one with long command line).

We realized that pausing the xdcr removes the problem. As soon as we resume it, we can see a great spike of io on disk along with memory consuption that goes beyond server capacity and ends up in oom.

Is there a known issue with XDCR or something that we missed in the setup of xdcr?

thanks!
Erick

I’d like to add, that reseting vm.swappiness to a higher value (60) instead of the recommanded 0 seems to solve the problem.

Heh, strange since it is advised not to set
vm.swappiness = 0

In my case, we do not use swap at all.
We did find the solution. We did put the load off from couchbase to another database for now, and we are thinking what to do next.