Time spent with garbage collection


#1

How can I tell how much time is spent during garbage collection?
I’ve been looking at ns_server.stats.log and found this piece of information:

{statistics,
[{wall_clock,{36421279,1624}},
{context_switches,{9701483,0}},
{garbage_collection,{1800886,10515982029,0}},
{io,{{input,1733995406},{output,25540467904}}},
{reductions,{1736523859,172750}},
{run_queue,0},
{runtime,{842940,1160}},
{run_queues,{0,0,0,0,0,0,0,0}}]},
{system_stats,
[{cpu_utilization_rate,19.79949874686717},
{swap_total,1073741824},
{swap_used,72876032},
{mem_total,17179869184},
{mem_free,5512970240}]},

I’m guessing that numbers have something to do but not sure about their meaning.
Does anyone have a clue?
Should I be looking somewhere else?


#2

I wouldn’t worry too much about that as ns_server is not in the critical path for most operations. The cluster manager is intentionally out of the way, for the most part, but comes into play for monitoring and failover.

Is it cluster management garbage collection that you’re concerned with, or affecting something else?


#3

I’ve been getting these type of errors:

Caused by: net.spy.memcached.OperationTimeoutException: Timeout waiting for value: waited 5,000 ms. Node status: Connection Status { memcache03.mri.tools/15.22.1.5:11210 active: false, authed: true, last read: 5,115 ms ago memcache06.mri.tools/15.22.1.8:11210 active: false, authed: true, last read: 5,334 ms ago dai-memcache20.dai/15.22.2.110:11210 active: false, authed: true, last read: 5,417 ms ago memcache01.mri.tools/15.22.1.3:11210 active: false, authed: true, last read: 5,180 ms ago dai-memcache19.mri.tools.dai/15.22.2.109:11210 active: false, authed: true, last read: 5,009 ms ago dai-memcache18.dai/15.22.2.108:11210 active: false, authed: true, last read: 5,262 ms ago memcache02.mri.tools/15.22.1.4:11210 active: false, authed: true, last read: 5,161 ms ago memcache04.mri.tools/15.22.1.6:11210 active: false, authed: true, last read: 5,028 ms ago }
at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:1241)
at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:1257)
at com.ezrez.common.memcached.MemcachedServiceCouchbaseBase.primitiveGet(MemcachedServiceCouchbaseBase.java:243)
… 144 more

From other posts, the suggestion is that it could be related to garbage collection, that’s why I’m trying to find out if that’s the case.


#4

The garbage collection in this case is the Java JVM one at the application server side and not the Couchbase Server side.

Timeout can be a trick problem to debug as the problem can be any where in the stack: Application/network/Couchbase Server/ Hardware. From the backtrace provide it looks like get operations are timing out after 5 seconds. Get requests go directly to the Data service (memcached process). I would recommend looking at the memcached.log. On the newer versions of Couchbase Server, the data service will log any operation that took over 500ms. For example they look like this:

2016-06-20T15:39:25.744852+01:00 WARNING 132: Slow GET operation on connection: 4421 ms ([ 10.0.0.2:35934 - 10.0.0.4:11210 ])

If you do see those messages on the server side then you know that the problem is on the server side, again this could be a sizing issue or a hardware problem. The next steps in this case would be to check the resident ratio of the bucket and the disk utilization.

If you do not see those messages I would start looking at the network and application.


#5

Thanks for the help.
Indeed it seems to be environment related.