CB Community Ed 5.1.1 data loss

In one particular environment (and only there) we’re experiencing loss of data.

I know our environment is not offically supported by Couchbase, but let me please ask anyway:

CB Community Ed 5.1.1
OS: OpenBSD 6.3 (we ported CB ourselves)
Number of objects in DB < 300.000
Average object size: <1k
2 indexes
12 views
4CPUs
8GB RAM

We have not found any trigger yet that causes the problem. When the problem appears, CB first seems to cause a high load on the machine, then the problem appears. Afterwards, we find the following in the logs:

######################################## memcached.log.000014.txt: unmodified: line 70309

2019-07-08T07:11:26.090852+02:00 WARNING (BUCKET) Slow runtime for ‘Connection Man
ager’ on thread nonIO_worker_0: 1528 ms
2019-07-08T07:11:29.657490+02:00 WARNING (BUCKET) Slow runtime for ‘Updating stat
snapshot on disk’ on thread writer_worker_1: 6052 ms
2019-07-08T07:11:32.481294+02:00 NOTICE 81: (BUCKET) DCP (Producer) eq_dcpq:secidx
:getseqnos-10:8a:42:16:94:8f:c7:37 - Removing connection 0xcf45c920b78
2019-07-08T07:11:41.377124+02:00 NOTICE 81: DCP connection opened successfully.
flags:{} [ 127.0.0.1:34215 - 127.0.0.1:11210 (@index) ]
2019-07-08T07:11:44.193807+02:00 WARNING (BUCKET) Slow runtime for ‘Backfilling it
ems for a DCP Connection’ on thread auxIO_worker_0: 559 ms

######################################## indexer.log: unmodified: line 1552

2019-07-10T13:16:13.467+02:00 [ERRO][FDB] docio_read_doc_key error: read failure
on offset 22220800 in a database file ‘/opt/couchbase/var/lib/couchbase/data/@2i/
BUCKET_#primary_295915238280948878_0.index/data.fdb.432’ : FDB status -5, lastbid 0
x1531, curblock 0xffffffffffffffff, curpos 0x0

(hex) 0x240277000, 4096 (0x1000) bytes
0000 08 08 00 01 00 01 00 5b ff ff ff ff ff ff ff ff …[…
0010 00 00 00 00 00 00 01 90 00 00 00 00 01 90 41 a3 …A.
0020 00 00 00 00 00 00 01 91 00 00 00 00 01 70 01 1e …p…
0030 00 00 00 00 00 00 01 92 00 00 00 00 01 72 b1 12 …r…

######################################## babysitter.log.1: unmodified: line 3123

** Reason for termination ==
** {abnormal,131}

[error_logger:error,2019-07-08T07:11:12.521+02:00,babysitter_of_ns_1@127.0.0.1:er
ror_logger<0.6.0>:ale_error_logger_handler:do_log:203]
=========================CRASH REPORT=========================
crasher:
initial call: ns_port_server:init/1
pid: <0.648.420>
registered_name:
exception exit: {abnormal,131}
in function gen_server:terminate/6 (gen_server.erl, line 744)
ancestors: [<0.645.420>,<0.647.420>,ns_child_ports_sup,
ns_babysitter_sup,<0.55.0>]
messages: [{‘EXIT’,<0.654.420>,normal}]
links: [<0.645.420>]
dictionary:
########################################

After this, we’ve lost a substantial amount of data.

Any hint what we could do to examine and work around this?

Cool to know you’ve ported to OpenBSD!

So the error you highlighted is in the indexer (secondary indexing), which is derived data from the Data-Service. While you might have lost index definitions, you shouldn’t have lost any of the underlying data - deleting and rebuilding the indexes should result in the same state.

In terms of what the problem is here - it’s very hard to say - given it’s in the low level indexer kvstore (forestdb) which I assume you also ported to OpenBSD, it could be some existing bug (fixed in upstream - note 5.1.1 is very old now); or some issue in your port.

I can’t really give much concrete advice, but if you’re already building from source I’d certainly suggest upgrading to at least the latest 5.5 release, or even the latest current release (6.0).

Thanks a lot for your reply!

deleting and rebuilding the indexes should result in the same state

It looks like we can’t do that. The indexer process dies and is being restarted every second or so. In the process list we can see it get’s new process IDs at least every second. On the CB web interface we get: “Warning: Cannot communicate with indexer process. Information on indexes may be stale. Will retry.”

We tried to put in “DROP PRIMARY INDEX ON BUCKET;” but with the indexer not running it was no surprise we didn’t get a result.

Any idea what to do to examine this?

The indexer isn’t really my area of expertise. You could try moving the indexer’s data files (as mentioned in the error message above) aside - I only say don’t delete them in case you want to try to recover them later / the indexer gets upset if the files are just moved away.

I’d start with moving /opt/couchbase/var/lib/couchbase/data/@2i/ BUCKET_#primary_295915238280948878_0.index to some other location.

(As you can probably tell, everything under /opt/couchbase/var/lib/couchbase/data/@2i/ are secondary index files).

as mentioned in the error message above

I don’t understand. What error msg are you referring to?

So I tried what you proposed:

  1. stopped CB
  2. deleted all files in @2i
  3. started CB
  4. CREATE PRIMARY INDEX ON BUCKET USING GSI;

And I could successfully access a document that seemed to be unavailable when the machine was in the error state! That’s great! Thank you so much!!! I haven’t checked if all data is back and there’s no loss, but we’ll keep on examining this on Monday.

The only thing left to do now is to detect when the machine gets into this error state, immediately stop our backend (so that no false “document not known” msg are sent out), stop CB, delete the index files, start CB, re-create Indexes, re-start our backend. That all sounds feasible. I don’t know yet how to reliably detect that the machine gets into this error state, but I think we should be able to figure something out. We have a snapshot of the machine in its error state. Worst case we’ll write another watchdog that checks the logs for repeated indexer crashes…

Oh, did I say “Thank you”?
Cannot thank you enough!
T.