AWS Cluster failing & sizing

We have a couchbase cluster that has brought us great pain, mostly during rebalancing where nodes start failing one after another.

Main facts:

  • Hosted on 6 AWS r3.xlarge instances
    • 6 * 30.5= 185GB RAM
    • 6 * 4 = 24 vCPUs
    • Each node has two 500GB io1 1000IOPS SSDs combined with LVM for a total of 6 * 2 * 500 = 6TB
  • One big bucket with:
    • 6 * 28 = 168GB bucket quota total
    • 2.55TB disk usage out of the 6TB total
    • 20 million items
    • 2 days expiration date
    • 2 replicas
    • 600 ops per second
      • 300 gets
      • 300 sets

The problem is that every once in a while (since we’re on the cloud) we lose nodes. When this happens we reboot the failing node which changes the underlying host, add it back and rebalance. This rebalance takes about 6 hours. This is a problem on it’s own but most of the time rebalance doesn’t even complete because in the mean time we start losing more nodes and the rebalance stops. In the end we accept that we’ll lose all the data, drop the bucket, rebalance the rest of the cluster in a minute, recreate the bucket and start clean.

In the monitoring tab there are a few metrics that seem “dangerous”:

  • 1.15% active docs resident
  • Disk write queue peaks on high traffic hours at 10+K

But our cache miss ratio is 5% which is decent enough for our use.

Also the healthcheck utility complains for even more stuff like:

  • Average item loaded time ‘7.545 ms’ is slower than ‘500 us’
  • Replica resident item ratio ‘0.99%’ is below ‘20.00%’
  • Number of backlog items ‘5.89 quadrillion’ is above threshold ‘100 thousand’
  • Number of backlog item to active item ratio ‘2186457269979.22%’ is above threshold ‘30.0%’
  • Total memory fragmentation ‘5.732 GB’ is larger than ‘2 GB’

To tackle the problem we’re trying to follow the sizing guidelines mentioned in the documentation but we’re at a loss even there. The first number we have to calculate is the value size which (since we cannot easily calculate it from the client app) we suppose is the total disk size used divided by the number of items (including replicas) so that’d be roughly 2.55TB / 60M which is about 40KB. The next controversial number is the working_set_percentage. Now based on the client app usage we’re ok with 1% since that’s the part of the data that’s accessed 90% of the time. The rest falls on the long tail and we don’t mind if it’s a bit slower.

Plugging this data in we get the following:

documents_num                     19,108,793.00
ID_size                                   36.00
value_size                            46,080.00
number_of_replicas                         2.00
working_set_percentage                      1%
per_node_ram_quota            30,064,771,072.00
Metadata per document                     56.00
SSD or Spinning                           SSD
headroom                                   30%
High Water Mark                            85%
no_of_copies                               3.00
total_metadata                 5,274,026,868.00
total_dataset              2,641,599,544,320.00
working_set                   26,415,995,443.20
Cluster RAM quota required    48,467,092,946.54
number of nodes                            1.61

So… that says we should use 2 nodes??? On the other hand maybe aiming for 1% working set percentage is not a good idea? If as the healthcheck suggests we have to aim for 20% that pushes the number of nodes to 27 which is outside our price range. So maybe couchbase is not made for our use case? Or maybe after all the memory is not the issue here but it’s the CPU and the disk IO?

What we plan to do now is to try to temporarily improve the situation a bit by moving to double the amount of nodes with the half size instances (r3.large), dropping to 1 replica and then revisit the situation given any feedback we get here.

Thanks and sorry for the long post!

WOW … Couchbase if done correctly should be up and available & be the most reliable part of your deployment.

ok … lets see if you have the basics covered.

Could you login to the nodes as the user that is running CB and run below commands and paste them to Post.

Check ulimit of user

#ulimit -a

Check the THP status

cat /sys/kernel/mm/transparent_hugepage/enabled
cat /sys/kernel/mm/transparent_hugepage/defrag       

On some Red Hat and Red Hat variants, you might have to do this:

cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
cat /sys/kernel/mm/redhat_transparent_hugepage/defrag        

Check Swappiness

#cat /proc/sys/vm/swappiness

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 245387
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 245387
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
$ cat /sys/kernel/mm/transparent_hugepage/defrag
[always] madvise never
$ cat /proc/sys/vm/swappiness
0

An extra note regarding the swappiness: we ourselves set it to zero to avoid any unnecessary swapping but maybe it wasn’t a good idea? We have 80GB swap on (we use the whole AWS instance store disk since we use EBS for boot).

$ free -m
             total       used       free     shared    buffers     cached
Mem:         30425      30166        259          0        127       1541
-/+ buffers/cache:      28497       1928
Swap:        76799        649      76150

Ok so as http://developer.couchbase.com/documentation/server/current/install/thp-disable.html says we need to set these two to never. We’ll try that and see.

Your good here. :+1:

You want “unlimited”, your restricting an in memory DB to only us 64 memory chunks at a time. :slight_frown:

You need more file descriptors.
you’ll find the exact number in the /etc/init.d/couchbase-server … it should be about 50,000 to 60,000

You want them both set to [never]

Here is our documentation about THP and how to change it.

AFTER CHANGES
After you make the changes you will have to restart the couchbase service for the limits to take hold. Also make sure on reboot that the nodes retain their same value.

Here is a blog on how to do so if your linux admin are not sure how.

This is a strange advice you’re giving considering the fact Couchbase sets required limits in the init.d script as you said yourself. So even though you see a low value for open files parameter for the user, the couchbase process is running with the values from the init.d script in fact. You can check the actual limits with the command:

cat /proc/<couchbase_pid>/limits

In general, changing limits on user basis is not a good idea.

Also I’d be very careful with disabling Transparent Huge Pages (THP), as that might potentially affect performance of other services running on the same machine if you have RDBMS for example or something else.

Ok so I checked the limits for all the processes run by the couchbase user and it seems you’re right dmitryb. The init script did override the global values.

Now about the THP disabling, the nodes are dedicated to couchbase they don’t run anything else so we’ll try that. We’re gonna run load tests on the new cluster so we’re (hopefully) gonna know if it all works ok.

$ for pid in $(ps -u couchbase | tail -n +2 | awk '{print $1}'); do echo $pid; cat /proc/$pid/limits | egrep "(Max open files|Max locked memory)"; done
4046
Max open files            10240                10240                files
Max locked memory         unlimited            unlimited            bytes
25556
Max open files            10240                10240                files
Max locked memory         unlimited            unlimited            bytes
25580
Max open files            10240                10240                files
Max locked memory         unlimited            unlimited            bytes
25607
Max open files            10240                10240                files
Max locked memory         unlimited            unlimited            bytes
25637
Max open files            10240                10240                files
Max locked memory         unlimited            unlimited            bytes
25639
Max open files            10240                10240                files
Max locked memory         unlimited            unlimited            bytes
25640
Max open files            10240                10240                files
Max locked memory         unlimited            unlimited            bytes
25641
Max open files            10240                10240                files
Max locked memory         unlimited            unlimited            bytes
25643
Max open files            10240                10240                files
Max locked memory         unlimited            unlimited            bytes
25644
Max open files            10240                10240                files
Max locked memory         unlimited            unlimited            bytes
25648
Max open files            10240                10240                files
Max locked memory         unlimited            unlimited            bytes
25691
Max open files            10240                10240                files
Max locked memory         unlimited            unlimited            bytes
25692
Max open files            10240                10240                files
Max locked memory         unlimited            unlimited            bytes
25693
Max open files            10240                10240                files
Max locked memory         unlimited            unlimited            bytes
25697
Max open files            10240                10240                files
Max locked memory         unlimited            unlimited            bytes
25712
Max open files            10240                10240                files
Max locked memory         unlimited            unlimited            bytes
29944
Max open files            10240                10240                files
Max locked memory         unlimited            unlimited            bytes
31185
Max open files            10240                10240                files
Max locked memory         unlimited            unlimited            bytes

I have found that not everybody has sudo or root access when deploying Couchbase… so many times you have to getting the linux admin to change the ulimits for you.