We’re running a 12 node couchbase cluster in production (2.2.0 community edition, build-837) and we’ve recently been seeing nodes randomly lock-up and generating large core dump files. When the problem occurs the affected node seems to lock up and stops responding to user requests. We have to force kill the process and restart it.
It doesn’t seem to correspond to any noticeable change in traffic or query load and it’s a different node each time.
We have a single couchbase bucket and records are accessed by simple key-value pair lookups. No views or complex queries.
I can see there are errors in the info.log and I’ve put a gist of the relevant time period here: https://gist.github.com/stephenhenderson/813e495dedcbf1727793 (the core dump was generated around 14:35 on 2014-05-28 in this case)
I can provide additional logs if needed, though there doesn’t seem to be anything significant around the time of the problem.
Any help would be appreciated. We’re starting to see this happen every 3-4 days.