Clearing replication state safely

paulharter · March 21, 2017, 4:49pm

When replicating two instances of SG with each other, in both directions, how is the state of these replications recorded? I’m wondering what state is kept where for in particular situation:

If db A is pulling from and pushing to db B how safe is it to stop A, flush all its buckets and start it up again? Will this really confuse B who is holding some record of the replications? Or should it be fine and the new A will just catch up with B as before?

I guess I’m asking where does the replication process at A put its “since” values? Always in the target db? both in its own db? always in both db? or to file and not in buckets at all? And hence what would I need to delete to make this behaviour safe?

Or is this something I shouldn’t hope to be able to do?

Thanks

Paul

andy · March 22, 2017, 1:06pm

@paulharter

The replicator will store a checkpoint doc on the target DB, in Couchbase Server these will have a _sync:local: name prefix.

If the local checkpoint doc is missing on the target DB when a replication starts, the replication will start from sequence 0.

There is no state stored to local disk on the Sync Gateway instances.

I ran a scenario similar to yours using a single Sync Gateway instance with two buckets. I did not see any issues with this test, here is my config.

{
    "log": ["*"],
    "adminInterface":"0.0.0.0:4985",
    "replications": [
        {"source":"http://localhost:4985/source/", "target":"http://localhost:4985/target/", "continuous":true, "replication_id":"continuousA-B"},
        {"source":"http://localhost:4985/target/", "target":"http://localhost:4985/source/", "continuous":true, "replication_id":"continuousB-A"}
    ],
    "databases": {
        "source": {
            "server": "http://localhost:8091",
            "bucket":"bucket-1",
            "users": {
                "GUEST": {"disabled": false, "admin_channels": []}
            }
        },
        "target": {
            "server": "http://localhost:8091",
            "bucket":"bucket-2",
            "users": {
                "GUEST": {"disabled": false, "admin_channels": []}
            }
        }
    }
}

I flushed both buckets before starting SG.

I added two documents to DB A and two documents to DB B. All 4 docs were replicated to both DB’s A and B.

In each CBS bucket there was a single _sync:local: doc with the following content:

{
  "_rev": "0-5",
  "lastSequence": "5"
}

I shut down SG and flushed bucket A

After restarting SG DB A contained all 4 docs.

The the CBS bucket for SG DB B the _sync:local: doc was unchanged, in the CBS bucket for SG DB A the _sync:local: doc content was:

{
  "_rev": "0-1",
  "lastSequence": "5"
}

paulharter · March 22, 2017, 3:12pm

Hi @andy,

Thanks for this. It’s exactly what I’ve been wondering about.

So each target db holds a checkpoint which identifies the last sequence checkpoint for another db. This means a checkpoint doc in the target for every other source db referencing a sequence number in the source db.

This makes sense and I see how after flushing SG DB A would recover all 4 docs.

But DB B is left holding checkpoint doc for DB A that doesn’t correspond to DB A’s new sequence numbers. If this checkpoint is higher than the current sequence number new documents added to A will not replicate to B until the old checkpoint is surpassed. I have seen something like this happening.

So lots more questions.

How are the checkpoint docs tied to each source DB? Is there an identifier in the source bucket that matches the target checkpoint? In which case there would be no problem as once flushed they wouldn’t match, but if something like a machine id is used then replications from A to B will break.

Maybe the solution is to be able to delete the correct checkpoint doc?

I will do some tests.

Thanks

Paul

paulharter · March 23, 2017, 2:33pm

Hi @andy

I’ve done a couple of tests.

It looks like the problem I was concerned about doesn’t happen. The flushed source database will successfully create a new checkpoint with the correct lower seq number.

However any docs that are added to A after it is flushed but before a new successful replication to B will not then be replicated to B. If the checkpoint in B is deleted at the same time as flushing A then all documents are replicated correctly.

So, although not for the reason I thought, it is safer to delete the target’s checkpoint at the same time as flushing the source DB.

I’ll will raise this as an issue as it doesn’t seem ideal.

Thanks

P