We definitly have some percent of data loss, and not random, only for certain keys, and we having observer calls, so maybe it is problems with some vbuckets. I even have missing document _id’s.
What is the proper way to debug such bugs, to understand if data not get written to disk from memcached, or it is data corruption, or any other couch issue?
Hi there, these issues could be tough to debug given the issue can be in the app or on the server side. For the server side, there is possibility of data loss if you loose all replicas or if you fail over a node(s) and you have not been using the replicate to or persist to option for mutations. If you restored backups, that also will take you back in time and that means data loss. Your logs in the couchbase console can tell you if any of these happened. did you check these already?
thanks
-cihan
Hi there, these issues could be tough to debug given the issue can be in the app or on the server side. For the server side, there is possibility of data loss if you loose all replicas or if you fail over a node(s) and you have not been using the replicate to or persist to option for mutations. If you restored backups, that also will take you back in time and that means data loss. Your logs in the couchbase console can tell you if any of these happened. did you check these already?
thanks
-cihan
We using :persist => 1 option, and it return no errors. Also Data seems to be written and accessed for some time (few minutes sometimes less, and get deleted). Data loss happen during all day, and we don’t use backups.
Also note that we have problems with replication, i tried to add 2 new servers and start reballance, and it just can’t get completed. First Erlang processes crashed few times (i’m till trying to get core dump, will upload when i’ll have it). And now replication just stuck, only 2 servers out of 4 replicating, and rest of 2 stuck at 75% forever. After i manually stop replication, it says that it finished, and do not show errors that data is uneven and needed replication. But number of active items differs a lot, its not evenly distributed.
healthchecker shows errors for few checks: Active resident ration, Replica Resident Ration, and one very interesting message:
“Number of active vBuckets ‘1022’ is less than ‘1024’ per node”, so where is the rest 2 buckets? I guess this is why i’m loosing some data? And if can’t run reballance (because it segfaults), how to restore those 2 buckets?
Is there is way to check for given document _id, which vbucket it will use? Some sort of mapping? So i can confirm this theory that all missing _ids is inside this 2 missing buckets.
Hi There, seems like you may be experiencing a number of issues. Do you have a support contract with us? could I ask you to open a support case so we can take a deep look at this?
thanks
-cihan
We started with community version, after few weeks of using we started getting log of errors like this, and week ago decided to try 2.5.1 enterprise version. But as you can see it not much better…
if this is something that require really lot of work, at least give me direction what to check first. We really liked how couch worked before this issues started happening…
Haven’t had a chance to review your logs yet but one of the obvious things to do is to ensure that all nodes are up and healthy first - ensure all vbuckets are present (1024 active and 1024 replicas in couchbase console > bucketname > vbucket resources first row shows vbucket count.).
If you are certain that your app correctly handled all errors and you were able to write the value successfully and could read it, the usual suspect is HW or in your case VM health. I would check to see if there are any drive or HW faults reported on the nodes.
thanks
And i have feeling that some of our segfaults can be related to IO problems. From what i read about couch and Erlang, it very sensitive to IO spikes, and high read/write queues can cause segfaults. After upgrading servers it started working slightly better.