Continous Rebalance Failure, Memcached taking Very High CPU

Sanjay1313 · March 2, 2014, 2:27pm

Hi
Below are the steps I followed

1)Install Couchbase 2.1 in machine A
2)Create 4 buckets
3)Install Couchbase 2.1 in machine B
4)Add Couchbase on machine B to the cluster
5)Rebalance

Initially Rebalance is fine, but after some time the memcached CPU in one of the machines is VERY HIGH.
And in between the rebalance fails with the below log

“Rebalance exited with reason {unexpected_exit,
{‘EXIT’,<0.5427.6>,
{badmatch,
[{‘EXIT’,
{{badmatch,{error,closed}},
{gen_server,call,
[<12900.2331.4>,had_backfill,30000]}}}]}}}”

->If I retry the rebalance again, it fails again and again
-> If we do netstat on the memcached process, it has about 600+ connections towards beam.smp process. And most of the connections are in CLOSE_WAIT state

CPU Usage of memcached(top command)

6112 couchbas 20 0 752m 214m 3480 S 405.3 2.8 11712:14 memcached

Hardware/OS Details

OS: Centos 6.2 on both machines
CPU: Intel® Core™ i7-2600 CPU @ 3.40GHz ( 8 cores) on both machine
RAM: 8GB on both machines

What can cause this rebalance failure? Plz let us know the probable cause for this

Thanks

househippo · March 3, 2014, 8:41am

Rebalance is a very hard disk i/o heavy. Especially on the node being added in.
Check your Web Admin GUI go to TAB “SERVER NODES” go to server that is being added back in and click on the blue arrow. There is should tell you the current bucket being rebalance. From there go to the TAB “DATA BUCKET”=>Disk Quese=>Average Age Active. That will tell you in what amount of time is Seconds it take to write you items to desk. click the “show by server” there you will see each servers time. The server that you are adding back in is the times the same as the others?
And
What is your disk Swappiness set at (60% default)?

Sanjay1313 · March 3, 2014, 6:03pm

Hi , Thanks for the reply…
The thing is there are no items in any of the couchbase buckets. This is the initial rebalance which is done after adding the new node to the cluster.So when I look at the Disk Queue option the average age always is showing 0.
Does this issue have to do with the hardware/ Hardisk in particular?
Hardware seems to be higher than the min recommended for installing couchbase.

Memcached is taking 400-500% of the CPU and has 600+ connections to beam.smp (in one of the machines)even for rebalancing 0 elements.(rebalance fails over and over again)

When We check the pstack of the memcached process, 3 of the threads seems to be taking more CPU
Below is the backtrace of the threads that are taking more CPU
Thread 40 (Thread 0x7f252618d700 (LWP 13676)):
#0 0x00007f252d77e435 in KVShard::getVBuckets() () from /opt/couchbase/lib/memcached/ep.so
#1 0x00007f252d777820 in Flusher::getNextVb() () from /opt/couchbase/lib/memcached/ep.so
#2 0x00007f252d7783dd in Flusher::step(unsigned long) () from /opt/couchbase/lib/memcached/ep.so
#3 0x00007f252d782b59 in ExecutorThread::run() () from /opt/couchbase/lib/memcached/ep.so
#4 0x00007f252d7831fd in launch_executor_thread () from /opt/couchbase/lib/memcached/ep.so
#5 0x0000003e824077f1 in start_thread () from /lib64/libpthread.so.0
#6 0x0000003e820e570d in clone () from /lib64/libc.so.6

Please let me know if you need any more info

Thanks

Sanjay1313 · July 29, 2014, 5:52am

Hi
I could not find the statistic that you asked me in the GUI.

The disk Swappiness is set to 60%. I changed it to 0, that did not help either.
After failing continously the memcached process is in bad state.It takes lot of CPU( >600%)
And it has lot of TCP connections , 800+ connections in one node.

Is there any workaround or fix for this issue. The rebalance of empty buckets is causing this and its taking lot of time as well.
Does this depend on the network setup/ Machine hardware. We are seeing this issue even with 2 identical hardware configuration

Thanks

gluz · February 1, 2015, 8:22am

Hello.
Got the same problem with couchbase 3.
Couchbase is failing and memcached service take 100% cpu (we don’t have memcached buckets!!!).
Also, a server restart didn’t fix the issue.
We are running a 4 node cluster on aws ec2 with r3.large servers.

Is there any fix or some knowledge gathered about this problem?

Thanks

javieramadi · February 26, 2015, 2:35pm

Any update for this? I have the some problem after upgrade to version 3 on 2 nodes.

dauger · March 3, 2015, 6:52pm

I’m seeing this too with Couchbase 3.0.2 on Window Server 2012. The Memcached process is maxing out the CPU when trying to rebalance with a combination of empty and almost empty buckets.

nico_ad · December 13, 2015, 3:54pm

We have a similar issue, we are on centos 6.5 on our side.

Did you manage to solve your issue ?

househippo · January 21, 2016, 10:01pm

@nico_ad , Could you run the ulimit settings for the user that is running Couchbase.

#ulimit -a

and paste the out put.

Thanks