Random node failures, core dump generated

StephenHenderson · June 3, 2014, 1:47pm

Hi,

We’re running a 12 node couchbase cluster in production (2.2.0 community edition, build-837) and we’ve recently been seeing nodes randomly lock-up and generating large core dump files. When the problem occurs the affected node seems to lock up and stops responding to user requests. We have to force kill the process and restart it.

It doesn’t seem to correspond to any noticeable change in traffic or query load and it’s a different node each time.

We have a single couchbase bucket and records are accessed by simple key-value pair lookups. No views or complex queries.

I can see there are errors in the info.log and I’ve put a gist of the relevant time period here: https://gist.github.com/stephenhenderson/813e495dedcbf1727793 (the core dump was generated around 14:35 on 2014-05-28 in this case)

I can provide additional logs if needed, though there doesn’t seem to be anything significant around the time of the problem.

Any help would be appreciated. We’re starting to see this happen every 3-4 days.

Thanks,
Stephen

pvarley · June 5, 2014, 12:34am

Which process is segfaulting?
What OS are you using?
What is the segfault message?

StephenHenderson · June 5, 2014, 9:58am

Hi,

Thanks for replying.

It looks like it was the memcached process which generated the core dump (I only just realised the core dumps had a pid associated with them). Where would I find the segfault message? I can’t see anything like that in any of the couchbase logs or /var/log/messages.

We’re running CentOS 6.4 on amazon EC2 instances (m3.large).

StephenHenderson · June 5, 2014, 1:11pm

We had another core dump on a node last night and again it looks like it came from the memcached process. However this time it seemed to recover by itself.

The full error log from the time period is here: https://gist.github.com/stephenhenderson/813e495dedcbf1727793 (the timestamp on the core.dump is 21:50 just after the first error).

It includes this crash report:
[error_logger:error,2014-06-04T21:49:08.606,ns_1@vp4.us-east.xxxxxx.net:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
=========================CRASH REPORT=========================
crasher:
initial call: couch_stats_reader:init/1
pid: <0.15680.1252>
registered_name: 'couch_stats_reader-visitorprofilestore’
exception exit: {timeout,
{gen_server,call,
[dir_size,
{dir_size,
"/opt/couchbase/var/lib/couchbase/data/visitorprofilestore"}]}}
in function gen_server:terminate/6
ancestors: [‘single_bucket_sup-visitorprofilestore’,<0.8661.0>]
messages: [refresh_stats]
links: [<0.8662.0>,<0.298.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 6765
stack_size: 24
reductions: 14182742003
neighbours:

pvarley · June 5, 2014, 1:58pm

You should see something like this in /var/log/messages:
Jun 5 12:02:38 patrick-laptop kernel: memcached[12682]: segfault at 0 ip (null) sp 00007fffe5828c08 error 14 in memcached[400000+20000]

With that I should be able to debug the problem.

StephenHenderson · June 5, 2014, 3:18pm

I can’t see any segfault errors but last night’s problem did generate some hung-task errors in /var/log/messages. I’ve put the gist here: https://gist.github.com/stephenhenderson/91c9e431d32e1f6a3b7b (there were no other entries for over 10 minutes either side)

I think a problem in the past has been the core.dump filling up the root partition so errors might not have been written to disk successfully. This time the dump file was a bit smaller. As a side note, do you know if it’s possible to configure couchbase to specify where the memcached dump file should be written so we can direct it to another partition with more space available?

Thanks,
Stephen