Data lost after node stop


#1

Hello,

We have the following scenario:
2 couchbase servers 2.2.0-837-rel-community running on ec2 with the data stored on instance store volumes. On this cluster we have a bucket with replica factor 1.

So we start with the following number of items:
x.x.x.1 28.8 M/28.8 M
x.x.x.2 28.8 M/28.8 M

We have stopped one of the nodes (x.x.x.2). This will lead to the destruction of the instance store volume. After some time we have started the server and recreated the instance store volumes.

After start the cluster automatically starts the recovery process however it immediately crashes with:
Port server ns_server on node ‘babysitter_of_ns_1@127.0.0.1’ exited with status 1. Restarting. Messages:
Apache CouchDB 1.2.0a-01dda76-git (LogLevel=info) is starting.
/opt/couchbase/lib/erlang/lib/os_mon-2.2.7/priv/bin/memsup: Erlang has closed.
Erlang has closed
{“Kernel pid terminated”,application_controller,"{application_start_failure,ns_server,{shutdown,{ns_server,start,[normal,[]]}}}"}

Crash dump was written to: erl_crash.dump.1408964317.1055
Kernel pid terminated (application_controller) ({application_start_failure,ns_server,{shutdown,{ns_server,start,[normal,[]]}}})

After this the replica data on the node that was not restarted is lost and as the data active on the node restarted is lost due to instance store volumes being destroyed this will lead to half the data going missing. So at the end we end up with:
10.x.x.1 1 /28.8 M
10.x.x.2 28.8 M/1

What is strange is that the active items on the remaining node are replicated but the data from the replica is not copied to the active side on the other node.

Has anyone else seen this behavior? Is this normal. I expect that a node fail-over will prevent this but I don’t know if it’s ok to have different behavior where data from active goes to replica but data in replica doesn’t go to active side on the other node.

Thank you