Couchbase `erl_child_setup` CPU 100% and the server never boots up

Tested with the Autonomous Operator and home-brew Helm charts as well, including all Couchbase version 6+ and 7+, Enterprise and Community. The following happens within the container:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   2532   736 ?        Ss   13:52   0:00 runsvdir -P /etc/service log: .............................
root          41  0.0  0.0   2380   664 ?        Ss   13:52   0:00 runsv couchbase-server
couchba+      42  0.0  0.0   4116  3380 ?        S    13:52   0:00 bash /opt/couchbase/bin/couchbase-server -- -kernel global_
couchba+      54  1.1  0.8 11656800 1621012 ?    Sl   13:52   0:01 /opt/couchbase/lib/erlang/erts-10.7.2.7/bin/beam.smp -- -ro
couchba+      61  0.0  0.0   3932   112 ?        S    13:52   0:00 /opt/couchbase/lib/erlang/erts-10.7.2.7/bin/epmd -daemon
couchba+      65  100  0.0   2368   580 ?        Rs   13:52   1:46 erl_child_setup 1073741816
root         190  0.1  0.0   2640   564 pts/0    Ss   13:54   0:00 sh -c clear; (bash || ash || sh)
root         197  0.0  0.0   2640   148 pts/0    S    13:54   0:00 sh -c clear; (bash || ash || sh)
root         198  0.0  0.0   4404  3700 pts/0    S    13:54   0:00 bash
root         203  0.0  0.0   5928  3000 pts/0    R+   13:54   0:00 ps aux

The following process eats 1 core at 100% until the container is restarted:

couchba+      65  100  0.0   2368   580 ?        Rs   13:52   1:46 erl_child_setup 1073741816

Limits are as follows:

core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 772388
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1073741816
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Also followed all deployment guidelines. [1]

[1] Deployment Guidelines | Couchbase Docs

The problem here was the containerd distributions may set open files to unlimited or to a very large number like 1073741816.

The erl_child_setup 1073741816 process goes into a never-ending computation. Reducing the value slightly, for example to 104583 and Couchbase can boot up again.

ulimit -n 104583

It took me more than 24 hours to figure it out.

I think

  • the boot of Couchbase could be substantially improved to better debug cases like this. In addition,
  • the parameters of /opt/couchbase/bin/couchbase-server could be documented so that debugging could be enabled by the end-user to better reveal problems/bugs. Moreover,
  • Couchbase documentation should make a hint that large values of ulimit -n will break the boot process.