Couchbase 3.0.1 auto failover every week or so

We have a customer in production since June with a 4 node cluster.
On each node we have a couchbase instance and two tomcat with a java application (our frontends).

Since the end of september everything was fine but after that, we increased the load on the cluster by going live with another site.

Before october we were doing 6k/7k operations/sec (cluster wide).
Now we have roughly 34k/35k operations/sec (cluster wide).

It seems that every 5-7 days one of the nodes is not heard from the other nodes and so, after 300 seconds, it is automatically put in failover.
In order to solve the problem I only need to restart couchbase on the node and click on rebalance on the GUI (I usually use DeltaRecovery).
Sometimes the command /etc/init.d/couchbase-server stop does not work and I need to manually kill some of the remaining couchbase processes…
Once I restored the cluster state with all the nodes, after 20-30 minutes another node go down and I’ve to repeat the process.
It seems like there is some kind of leak in the processes that the weekly restart then solve…

Since I initially suspected network problems I’ve setup a permanent ping every 5 seconds between each node and from another server to all the nodes but even if sometimes they show that the latency between the nodes increase it is almost always below 2ms and at the last occourence of the problem it was <1ms (on average .500ms).

Did anyone ever see this kind of behaviour?

This is the email being sent for the last failover:

Node ('ns_1@web2.customer.local') was automatically failovered. [down,stale, {last_heard,{1448,452853,198762}}, {now,{1448,452853,189173}}, {active_buckets,["comments","video","poll","cmbucket"]}, {ready_buckets,["comments","video","poll","cmbucket"]}, {status_latency,8769}, {outgoing_replications_safeness_level, [{"cmbucket",green},{"poll",green},{"video",green},{"comments",green}]}, {incoming_replications_conf_hashes, [{"cmbucket", [{'ns_1@web1.customer.local',22178425}, {'ns_1@web3.customer.local',56464530}, {'ns_1@web4.customer.local',75568021}]}, {"poll", [{'ns_1@web1.customer.local',128744204}, {'ns_1@web3.customer.local',118225207}, {'ns_1@web4.customer.local',77926089}]}, {"video", [{'ns_1@web1.customer.local',14818100}, {'ns_1@web3.customer.local',118225207}, {'ns_1@web4.customer.local',53675363}]}, {"comments", [{'ns_1@web1.customer.local',14818100}, {'ns_1@web3.customer.local',118225207}, {'ns_1@web4.customer.local',53675363}]}]}, {local_tasks,[]}, {memory, [{total,602146168}, {processes,265593048}, {processes_used,264888368}, {system,336553120}, {atom,686993}, {atom_used,669601}, {binary,37829456}, {code,16371821}, {ets,269075200}]}, {system_memory_data, [{system_total_memory,33733103616}, {free_swap,16928206848}, {total_swap,16928206848}, {cached_memory,13174239232}, {buffered_memory,209772544}, {free_memory,466776064}, {total_memory,33733103616}]}, {node_storage_conf [{db_path,"/opt/couchbase"}, {index_path,"/opt/store/couchbase"}]}, {statistics, [{wall_clock,{215464559,5000}}, {context_switches,{1130090132,0}}, {garbage_collection,{263035644,1131734052301,0}}, {io,{{input,246185236379},{output,533850013308}}}, {reductions,{454055570955,7607743}}, {run_queue,0}, {runtime,{54159680,930}}, {run_queues,{0,0,0,0,0,0,0,0}}]}, {system_stats, [{cpu_utilization_rate,34.60076045627376}, {swap_total,16928206848}, {swap_used,0}, {mem_total,33733103616}, {mem_free,13851561984}]}, {interesting_stats, [{cmd_get,13701.0}, {couch_docs_actual_disk_size,1668786948}, {couch_docs_data_size,200836350}, {couch_views_actual_disk_size,74422496}, {couch_views_data_size,32329225}, {curr_items,114528}, {curr_items_tot,229222}, {ep_bg_fetched,0.0}, {get_hits,446.0}, {mem_used,335254192}, {ops,13701.0}, {vb_replica_curr_items,114694}]}, {per_bucket_interesting_stats, [{"comments", [{cmd_get,436.0}, {couch_docs_actual_disk_size,155695343}, {couch_docs_data_size,106937230}, {couch_views_actual_disk_size,70288793}, {couch_views_data_size,28225399}, {curr_items,79984}, {curr_items_tot,159978}, {ep_bg_fetched,0.0}, {get_hits,436.0}, {mem_used,156535344}, {ops,436.0}, {vb_replica_curr_items,79994}]}, {"video", [{cmd_get,0.0}, {couch_docs_actual_disk_size,78650103}, {couch_docs_data_size,61343873}, {couch_views_actual_disk_size,4133703}, {couch_views_data_size,4103826}, {curr_items,32559}, {curr_items_tot,65265}, {ep_bg_fetched,0.0}, {get_hits,0.0}, {mem_used,139843320}, {ops,0.0}, {vb_replica_curr_items,32706}]}, {"poll", [{cmd_get,0.0}, {couch_docs_actual_disk_size,30937326}, {couch_docs_data_size,28621824}, {couch_views_actual_disk_size,0}, {couch_views_data_size,0}, {curr_items,0}, {curr_items_tot,0}, {ep_bg_fetched,0.0}, {get_hits,0.0}, {mem_used,18708304}, {ops,0.0}, {vb_replica_curr_items,0}]}, {"cmbucket", [{cmd_get,13265.0}, {couch_docs_actual_disk_size,1403504176}, {couch_docs_data_size,3933423}, {couch_views_actual_disk_size,0}, {couch_views_data_size,0}, {curr_items,1985}, {curr_items_tot,3979}, {ep_bg_fetched,0.0}, {get_hits,10.0}, {mem_used,20167224}, {ops,13265.0}, {vb_replica_curr_items,1994}]}]}, {processes_stats, [{<<"proc/(main)beam.smp/cpu_utilization">>,0}, {<<"proc/(main)beam.smp/major_faults">>,0}, {<<"proc/(main)beam.smp/major_faults_raw">>,22}, {<<"proc/(main)beam.smp/mem_resident">>,4249673728}, {<<"proc/(main)beam.smp/mem_share">>,9895936}, {<<"proc/(main)beam.smp/mem_size">>,949330685952}, {<<"proc/(main)beam.smp/minor_faults">>,772}, {<<"proc/(main)beam.smp/minor_faults_raw">>,502473658}, {<<"proc/(main)beam.smp/page_faults">>,772}, {<<"proc/(main)beam.smp/page_faults_raw">>,502473680}, {<<"proc/beam.smp/cpu_utilization">>,0}, {<<"proc/beam.smp/major_faults">>,0}, {<<"proc/beam.smp/major_faults_raw">>,0}, {<<"proc/beam.smp/mem_resident">>,27287552}, {<<"proc/beam.smp/mem_share">>,2519040}, {<<"proc/beam.smp/mem_size">>,975695872}, {<<"proc/beam.smp/minor_faults">>,0}, {<<"proc/beam.smp/minor_faults_raw">>,9195}, {<<"proc/beam.smp/page_faults">>,0}, {<<"proc/beam.smp/page_faults_raw">>,9195}, {<<"proc/inet_gethost/cpu_utilization">>,0}, {<<"proc/inet_gethost/major_faults">>,0}, {<<"proc/inet_gethost/major_faults_raw">>,1}, {<<"proc/inet_gethost/mem_resident">>,430080}, {<<"proc/inet_gethost/mem_share">>,344064}, {<<"proc/inet_gethost/mem_size">>,7630848}, {<<"proc/inet_gethost/minor_faults">>,0}, {<<"proc/inet_gethost/minor_faults_raw">>,708}, {<<"proc/inet_gethost/page_faults">>,0}, {<<"proc/inet_gethost/page_faults_raw">>,709}, {<<"proc/memcached/cpu_utilization">>,0}, {<<"proc/memcached/major_faults">>,0}, {<<"proc/memcached/major_faults_raw">>,64}, {<<"proc/memcached/mem_resident">>,543154176}, {<<"proc/memcached/mem_share">>,6148096}, {<<"proc/memcached/mem_size">>,880947200}, {<<"proc/memcached/minor_faults">>,0}, {<<"proc/memcached/minor_faults_raw">>,323925}, {<<"proc/memcached/page_faults">>,0}, {<<"proc/memcached/page_faults_raw">>,323989}]}, {cluster_compatibility_version,196608}, {version, [{lhttpc,"1.3.0"}, {os_mon,"2.2.14"}, {public_key,"0.21"}, {asn1,"2.0.4"}, {couch,"2.1.1r-432-gc2af28d"}, {kernel,"2.16.4"}, {syntax_tools,"1.6.13"}, {xmerl,"1.3.6"}, {ale,"3.0.1-1444-rel-community"}, {couch_set_view,"2.1.1r-432-gc2af28d"}, {compiler,"4.9.4"}, {inets,"5.9.8"}, {mapreduce,"1.0.0"}, {couch_index_merger,"2.1.1r-432-gc2af28d"}, {ns_server,"3.0.1-1444-rel-community"}, {oauth,"7d85d3ef"}, {crypto,"3.2"}, {ssl,"5.3.3"}, {sasl,"2.3.4"}, {couch_view_parser,"1.0.0"}, {mochiweb,"2.4.2"}, {stdlib,"1.19.4"}]}, {supported_compat_version,[3,0]}, {advertised_version,[3,0,0]}, {system_arch,"x86_64-unknown-linux-gnu"}, {wall_clock,215464}, {memory_data,{33733103616,33252155392,{<14497.1546.0>,41050832}}}, {disk_data, [{"/",4787516,12}, {"/sys/fs/cgroup",4,0}, {"/dev",16459992,1}, {"/run",3294252,1}, {"/run/lock",5120,0}, {"/run/shm",16471240,1}, {"/run/user",102400,0}, {"/boot",240972,31}, {"/home",3869352,10}, {"/opt",4787516,20}, {"/usr",7742856,26}, {"/tmp",3869352,1}, {"/var",4787516,20}, {"/opt",51475068,43}, {"/opt/store",381754588,53}]}, {meminfo, <<"MemTotal: 32942484 kB\nMemFree: 455836 kB\nBuffers: 204856 kB\nCached: 12865468 kB\nSwapCached: 0 kB\nActive: 23031524 kB\nInactive: 8294088 kB\nActive(anon): 16519580 kB\nInactive(anon): 1791864 kB\nActive(file): 6511944 kB\nInactive(file): 6502224 kB\nUnevictable: 0 kB\nMlocked: 0 kB\nSwapTotal: 16531452 kB\nSwapFree: 16531452 kB\nDirty: 220 kB\nWriteback: 0 kB\nAnonPages: 18255292 kB\nMapped: 615604 kB\nShmem: 56156 kB\nSlab: 662580 kB\nSReclaimable: 501436 kB\nSUnreclaim: 161144 kB\nKernelStack: 85248 kB\nPageTables: 118444 kB\nNFS_Unstable: 0 kB\nBounce: 0 kB\nWritebackTmp: 0 kB\nCommitLimit: 33002692 kB\nCommitted_AS: 37566348 kB\nVmallocTotal: 34359738367 kB\nVmallocUsed: 340092 kB\nVmallocChunk: 34359386136 kB\nHardwareCorrupted:! 0 kB\nAnonHugePages: 18432 kB\nHugePages_Total: 0\nHugePages_Free: 0\nHugePages_Rsvd: 0\nHugePages_Surp: 0\nHugepagesize: 2048 kB\nDirectMap4k: 61312 kB\nDirectMap2M: 33492992 kB\n">>}]

we do have 4 buckets but we use only 3 of them, one is empty, one of them is doing less than 1ops/sec, one is below 2k ops/sec and the other one is doing all the operations.

Without seeing the logs, there are two things that make me think of this happening I have seen multiple times. 1) Transparent Huge Pages (THP) is on, but should be off. 2) vm.swappiness is not set =0

Usually when this kind of thing happens, the cluster manager is being starved for some kind of resource. For the THP setting this can certainly cause this as the OS is trying to create shuffle memory pages around to create Huge Pages. Databases do not like their resources moved and this is not unique to Couchbase if you do some searching around on google.

THP and swappiness have been correctly set (I follow the advices found in http://blog.couchbase.com/often-overlooked-linux-os-tweaks).
I found out that the ubuntu APPORT service was using 100% when a node crashed so I disabled it (using http://howtoubuntu.org/how-to-disable-stop-uninstall-apport-error-reporting-in-ubuntu).

Some more data on the issue since may help someone else, in our cluster we had 3 buckets, A with 20K ops/sec , B with 400 ops/sec and C with 1 ops/sec.

Since B is accessed primarily through views, it was mine first suspect. So we decided to create another cluster and move this bucket to this new cluster.

Right now we had a couple of crashes on the new cluster and no down for the old one, so it seems the B bucket was really the faulty one.

After analyzing the code that is accessing the B bucket I found that in some pieces of code we had:

final CouchbaseCluster cluster = CouchbaseCluster.create(environment, nodes);

without a disconnect and without caching of the cluster.
That means each time this code was executed we increased the number of connections to the node.

As soon as we deploy the code we will see if the number of crash will reduce or note.

Sounds good. Keep us posted.

As a side note, what’s the ulimit on file descriptors for the Couchbase user?

We set the limit to 10240 (as explained by http://docs.couchbase.com/admin/admin/Misc/Trbl-commonErrors.html).

Right now, after a week of the code change explained above, we did not had other couchbase nodes failures, so I think that was really the problem. If that’s true the SDK documentation should be modified to warn users and some checks can be added (like the one that have been added for the Environment class) that can warn you if they detect multiple cluster objects.

thanks for the feedback.

On the SDK documentation you mention, can you please click on the link in the bottom right of that page that says “feedback on this page” and write a sentence or two of what you think. That will get it to our docs team directly.