Data loss problem (4.0.0-4051)

We are running a 3 node cluster of couchbase community edition 4.0.0-4051

Since today some documents seem to have disappeared. Examining the cluster nodes I found that there is a problem with one node.

the memcached.log rotates like crazy showing thousands of entries like:

2017-03-28T21:32:37.612737+02:00 WARNING (BUCKET) Fatal error in persisting SET ``DOC-ID-1'' on vb 6!!! Requeue it...
2017-03-28T21:32:37.612804+02:00 WARNING (BUCKET) Fatal error in persisting SET ``DOC-ID-134'' on vb 6!!! Requeue it...
2017-03-28T21:32:37.612872+02:00 WARNING (BUCKET) Fatal error in persisting SET ``DOC-ID-14'' on vb 6!!! Requeue it...
2017-03-28T21:32:37.612957+02:00 WARNING (BUCKET) Fatal error in persisting SET ``DOC-ID-341'' on vb 6!!! Requeue it...
2017-03-28T21:32:37.613035+02:00 WARNING (BUCKET) Fatal error in persisting SET ``DOC-ID-94'' on vb 6!!! Requeue it...

Trying to run cbtransfer results in the following error:

error: could not read couch store file: /data/couchbase/data/BUCKET/6.couch.14; exception: malformed data in file

This particular file is JSON document:

{
   "ep_max_checkpoints" : "2",
   "ep_tap_queue_fill" : "0",
   "ep_flushall_enabled" : "1",
   "ep_tap_backlog_limit" : "5000",
   "mem_used" : "8752200",
   "ep_tap_queue_backfillremaining" : "0",
   "ep_chk_persistence_timeout" : "10",
   "vb_pending_queue_size" : "0",
   "vb_pending_ops_create" : "0",
   "ep_dcp_count" : "0",
   "ep_alog_sleep_time" : "1440",
...
   "ep_item_eviction_policy" : "value_only",
   "ep_vb_total" : "0",
   "ep_total_new_items" : "0",
   "vb_replica_meta_data_memory" : "0",
   "vb_replica_ops_create" : "0",
   "ep_tap_bg_fetched" : "0",
   "vb_replica_queue_fill" : "0",
   "ep_diskqueue_fill" : "0",
   "ep_max_num_workers" : "3"
}

All other files with this naming scheme are binary data files. The file’s date corresponds to a reboot of the cluster.

What is going on here? Is there a way to fix this problem?

Thanks in advance!
Stefan

You’ve encountered some form of filesystem or hardware-related error. I suggest you check your OS logs to see if there’s any reported issues. An fsck or similar of the filesystem is also recommended.

Having said all that, if one or more of your X.couch.Y files is showing as JSON then it’s pretty badly corrupted (they should be couchstore files) and you’ll likely need to recover from your last backup.