Connection timeouts during statistics

Hi,

Every few minutes, I get connection timeouts on my couchbase installation. The timeouts are very regular (exactly every two minutes) and seem to happen during some statistics operation (I found this looking at the /opt/couchbase/var/lib/couchbase/logs/stats logfile).

Is this a known issue? Is there anyway to juste disable the statistics to see if it corrects to problem?

Thank you! :slight_smile:

What kind of statistics operations? Do you mean stats in the Couchbase Web Console?

Actually, I don’t know exactly what the server is doing at the moment of the timeouts. But, as I said, the timeouts occur exactly when statistics are logged in “/opt/couchbase/var/lib/couchbase/logs/stats”. I get something like :

[ns_doctor:debug,2014-10-06T15:23:39.054,ns_1@:ns_doctor<0.15098.40
4>:ns_doctor:handle_info:167]Current node statuses:
[{'ns_1@,
[{last_heard,{1412,623414,52371}},
{outgoing_replications_safeness_level,
[{,green},
[…]

and the timeouts at the exact same time. I upgraded the servers where I have these problems (more ram and cpus) and the problems are a lot less frequent. But I still get timeouts every once in a while.

I figured that if I had a way to disabled these statistics (that I don’t use), maybe the timeouts would completely disappear or I would at least have something to work on.

Also, I have these every once in a while in the same logfile :

[stats:warn,2014-10-06T15:25:25.680,ns_1@:<0.26253.449>:stats_colle
ctor:latest_tick:240]Dropped 1 ticks

Maybe that’s completely normal though…

@markgaudreau Did you ever figure out a solution to this problem, other then upgrade the ram/cpu of your server? It appears I am having the exact same problem.

Actually, upgrading RAM/CPU and reinstalling everything (after the upgrade) solved everything for me. I don’t have these problems anymore.

Upgrade our nodes from m3.medmium (1 vCPU) to c3.xlarge (4 vCPU) and the issue went away as well. Our average response time is now 10-20ms.

My take away from this:

Couchbase doesn’t do well with a small number of vCPU.

I think I’m currently hitting this problem. It manifests as Cloudflare giving me a 524 error, but after a LOT of digging (weeks), I found that every 2 minutes, CB hits 100% CPU for about 2 seconds - during that 2 seconds, any requests get a timeout.

Is it just increasing the number of CPUs? Could it be anything else?

EDIT: Seems to be coming from Beam.smp specifically

After digging in, we found two things that were contributing. View indexing and Stats collection. If you are seeing spikes every 2m exactly, it is almost certainly the stats collection.

For our situation, increasing the number CPUs was all it took.

Okay, I’ve doubled the number of cores, so here’s hoping!

Thanks!