Data loss with 2.5.1

We definitly have some percent of data loss, and not random, only for certain keys, and we having observer calls, so maybe it is problems with some vbuckets. I even have missing document _id’s.

What is the proper way to debug such bugs, to understand if data not get written to disk from memcached, or it is data corruption, or any other couch issue?

Hi there, these issues could be tough to debug given the issue can be in the app or on the server side. For the server side, there is possibility of data loss if you loose all replicas or if you fail over a node(s) and you have not been using the replicate to or persist to option for mutations. If you restored backups, that also will take you back in time and that means data loss. Your logs in the couchbase console can tell you if any of these happened. did you check these already?
thanks
-cihan

Hi there, these issues could be tough to debug given the issue can be in the app or on the server side. For the server side, there is possibility of data loss if you loose all replicas or if you fail over a node(s) and you have not been using the replicate to or persist to option for mutations. If you restored backups, that also will take you back in time and that means data loss. Your logs in the couchbase console can tell you if any of these happened. did you check these already?
thanks
-cihan

We using :persist => 1 option, and it return no errors. Also Data seems to be written and accessed for some time (few minutes sometimes less, and get deleted). Data loss happen during all day, and we don’t use backups.

Also note that we have problems with replication, i tried to add 2 new servers and start reballance, and it just can’t get completed. First Erlang processes crashed few times (i’m till trying to get core dump, will upload when i’ll have it). And now replication just stuck, only 2 servers out of 4 replicating, and rest of 2 stuck at 75% forever. After i manually stop replication, it says that it finished, and do not show errors that data is uneven and needed replication. But number of active items differs a lot, its not evenly distributed.

https://www.evernote.com/shard/s9/sh/953a6824-ed22-4d66-aac8-ba382897ac86/c9736acf6f2c29fd999955cd4f7e70ac

So i still have feeling that some vbuckets have problems, but completely have no idea how to debug it. Can’t find meaningful information in logs.

It segfaulted again…, and i have dump: https://www.dropbox.com/s/fca53un6ievswzr/core.beam.smp.tgz (big file 400+ MB gzipped, still uploading should be ready in 30 minutes)

Also erl_crash_dump: https://www.dropbox.com/s/eezxba2mynj5cs8/erl_crash.dump.1396349027.6664

Right now i just can’t finish replication…

I hope this crashes related to our data loss.

Also adding cbcollect_info report: https://www.dropbox.com/s/7a8luzdbjvo9kb6/report.zip
And cbhealthchecker output for day and hour: https://www.dropbox.com/s/eonfvwz4hfn5u64/day.txt https://www.dropbox.com/s/52wh2vkaqaoo9xb/hour.txt

healthchecker shows errors for few checks: Active resident ration, Replica Resident Ration, and one very interesting message:

“Number of active vBuckets ‘1022’ is less than ‘1024’ per node”, so where is the rest 2 buckets? I guess this is why i’m loosing some data? And if can’t run reballance (because it segfaults), how to restore those 2 buckets?

Worth adding that it started segfaulting even without reballance… Last time 2 of 4 nodes go down simultaneously…

Is there is way to check for given document _id, which vbucket it will use? Some sort of mapping? So i can confirm this theory that all missing _ids is inside this 2 missing buckets.

Using vbuckettool i found that keys missing in random vbuckets, so it is smth different…

Hi There, seems like you may be experiencing a number of issues. Do you have a support contract with us? could I ask you to open a support case so we can take a deep look at this?
thanks
-cihan

Nope not yet. We still on evaluation period.

We started with community version, after few weeks of using we started getting log of errors like this, and week ago decided to try 2.5.1 enterprise version. But as you can see it not much better…

if this is something that require really lot of work, at least give me direction what to check first. We really liked how couch worked before this issues started happening…

Haven’t had a chance to review your logs yet but one of the obvious things to do is to ensure that all nodes are up and healthy first - ensure all vbuckets are present (1024 active and 1024 replicas in couchbase console > bucketname > vbucket resources first row shows vbucket count.).
If you are certain that your app correctly handled all errors and you were able to write the value successfully and could read it, the usual suspect is HW or in your case VM health. I would check to see if there are any drive or HW faults reported on the nodes.
thanks

Looks like out Data loss happen because temp OOM errors, but this issue is quite strange too http://www.couchbase.com/communities/q-and-a/ram-stats-do-not-reconcile-and-evictions-seems-be-not-working-expected

And i have feeling that some of our segfaults can be related to IO problems. From what i read about couch and Erlang, it very sensitive to IO spikes, and high read/write queues can cause segfaults. After upgrading servers it started working slightly better.