Server failed, data corrupted and view intermittent



I’ve filed a bug report here:

Some of the logs generated by the cbcollect_info indicate that the following caused the server to fail:

“State machine mb_master”
“Generic server ns_cookie_manager”
“Generic server ns_node_disco”
“Generic server remote_clusters_info”

I’m not sure what these are, or why the processes would be terminating, but the timestamps match up to when the server automatically failed over. I can see a sharp rise of swapping in at around that time from Munin graphs but what I don’t know is whether the swapping was a result or cause of these processes failing or if they are unrelated. Either way, I don’t know what the root cause of either issue is and I can’t seem to find one in the logs that I’ve spent hours and hours reading through. There’s also nothing evidently wrong in the syslog, kern.log etc. files.

The server automatically failed over at 17:48:47 and was available on :8091 again at 17:50:42 so that side of things cleaned itself up ok, but I’d still like to find out what the root cause was so I can put preventative measures in for the future, if possible/applicable.

I’m also concerned that (at least) one document that was typically critical was corrupted. This data stores a large array to the equivalent of a MySQL table with a thousand or so entries in a bzcompress() format (in PHP). When decompressed and print_r() it output part of the array and then a load of garbage. After deleting the key from Couchbase, forcing it to rebuild the data in the cache, the array was complete as expected. I’m assuming this isn’t a common occurrance, but surely it shouldn’t happen at all? How can I find the root cause of this?

When the node was available, a rebalance was run and during that time view queries were intermittent in terms of their success rate. These view queries were run as “stale=false” so it will have been rebuilding the index each time but according to the documentation, it states that view queries should run as normal during a rebalance process. Why is/was this not the case and is there somewhere I can look for potential errors in relation to this? I’ve tried searching through the “ns_server.views.log” file compiled in cbcollect_info and it has no mention at all of the view/design document in question.

Thank you in advance for any feedback/suggestions.



It looks like the whole discussion has continued in the JIRA: