URGENT - The number of documents in my bucket dropped by 300,000

sam.wilks92 · October 10, 2017, 9:44pm

Sure!

We’re running couchbase sever with a single node, on an aws m3-medium instance (which I’m aware is well below the minimum requirements), and sync gateway on a separate m3-medium aws instance.

First, to clarify the above, here is a more clear explanation of sequence of events:

7:30 pm - I noticed strange sync gateway behaviour. Changes made on one device were either only being partially replicated to other devices, or not being replicated at all. This seems to correspond with the
9:30pm - after restarting sync gateway, I decided to see if the changes were being reflected on the server.
9:35pm - I initiated a backup using a backup script that had been used previously
9:40pm - As I was on the documents tab of the couchbase console, the document count dropped from 1,150,000 to 880,000
9:45 Couchbase server crashed

After reviewing logs, the following components appear to have failed:

We’ve done a ton of digging through logs, and here’s what it looks like happened

Memcache failed at roughly 11:15GMT:

Service 'memcached' exited with status 137. Restarting. Messages: 2017-10-09T10:42:20.087445Z WARNING 43: Slow STAT operation on connection: 703 ms ([ 127.0.0.1:51593 - 127.0.0.1:11209 (Admin) ])

We didn’t notice memcache had failed until I noticed that replications weren’t working properly (I’m assuming that memcache failing would lead to this?)
I went to diagnose the problem and initiated a backup, which started to pull documents into memory
Either just be coincidence, or because of the extra load on the system, memcache failed again at the same time as compaction failed

Service ‘memcached’ exited with status 134. Restarting. Messages: 2017-10-09T18:49:21.232229Z NOTICE (maisha-meds-sg) DCP (Producer) eq_dcpq:cbbackup-wFzATINGXovxZcOr - (vb 379) Scheduling backfill from 1 to 458, reschedule flag : False
2017-10-09T18:49:21.232360Z NOTICE (maisha-meds-sg) DCP (Producer) eq_dcpq:cbbackup-wFzATINGXovxZcOr - (vb 378) Creating stream with start seqno 0 and end seqno 7
2017-10-09T18:49:21.232411Z NOTICE (maisha-meds-sg) DCP (Producer) eq_dcpq:cbbackup-wFzATINGXovxZcOr - (vb 378) Scheduling backfill from 1 to 7, reschedule flag : False
2017-10-09T18:49:21.232527Z NOTICE (maisha-meds-sg) DCP (Producer) eq_dcpq:cbbackup-wFzATINGXovxZcOr - (vb 829) Creating stream with start seqno 0 and end seqno 10
2017-10-09T18:49:21.232581Z NOTICE (maisha-meds-sg) DCP (Producer) eq_dcpq:cbbackup-wFzATINGXovxZcOr - (vb 829) Scheduling backfill from 1 to 10, reschedule flag : False

Compactor for database maisha-meds-sg (pid [{type,database},
{important,true},
{name,<<“maisha-meds-sg”>>},
{fa,
{#Fun<compaction_new_daemon.4.102846360>,
[<<“maisha-meds-sg”>>,
{config,
{30,undefined},
{30,undefined},
undefined,false,false,
{daemon_config,30,131072,
20971520}},
false,
{[{type,bucket}]}]}}]) terminated unexpectedly: {{{badmatch,
…

Our best guess for what happened is that the compactor was deleting and re-creating batches of documents, and was interrupted at some point between the delete and re-create by the server crashing, which led to document loss.

Does that sound feasible?