CBL .Net v1.4.0 not retrieving latest revision

As part of a test for our client-side conflict resolution I’ve been seeing an issue with pull replications not getting the latest revision from the sync gateway (this is in Couchbase Lite .Net versions 1.3.1 and 1.4.0 and Sync Gateway v1.2.1). It’s somewhat inconsistent, but I was able to capture a Fiddler log that seems to demonstrate the issue.

Test steps:

  1. Create a local document
  2. Replicate document to the server
  3. Stop replication
  4. GET the document directly from the Sync Gateway directly, make a change, and PUT the document back to the gateway
  5. Repeat step 4 a random number of times
  6. Meanwhile, update the local Document a random number of times
  7. Restart Push and Pull replications
  8. Wait for replications to go Idle and conflict resolution code to complete
  9. Verify conflicts were resolved as expected.

Results:
When the replications are restarted, there is a _revs_diff call made to the sync gateway. Here’s the response:

{
  "patientCareRecord_0e2e0bb0-268c-4bda-8808-35e80b3c4de4_e4ddfcfe-0bb7-4a09-9e55-5f7b585b2b68": {
    "missing": [
      "7-617c09bef0da1fa21b87dbfe663a6140"
    ],
    "possible_ancestors": [
      "4-5f3739734b446eba69a77098bedfb9c3",
      "5-fd6054d5ade9577bc5736bc09abea490",
      "1-b0b23e7616b30fcbf94853b8c78aa33e",
      "2-1f2d6d68dca3e7790a0ce47fcc0defe7",
      "3-1c1f2cfeb660b31b97df8a2ace0ac765"
    ]
  }
}

Then a GET call is made to retrieve revision 4-5f3739734b446eba69a77098bedfb9c3 instead of 5-fd6054d5ade9577bc5736bc09abea490 from the server:

http://sync-gateway.com:4984/default/patientCareRecord_0e2e0bb0-268c-4bda-8808-35e80b3c4de4_e4ddfcfe-0bb7-4a09-9e55-5f7b585b2b68?rev=4-5f3739734b446eba69a77098bedfb9c3&revs=true&attachments=true&atts_since=[%223-23720f92999a6568d53982052def39f8%22,%222-19921dd03fc6255775ece132c60d67e5%22,%221-b0b23e7616b30fcbf94853b8c78aa33e%22]

Revision 5 is eventually retrieved, but its changes are discarded by our conflict resolution code because the change to revision 5 was made prior to the new conflict resolution revision.

I realize some changes are bound to get thrown away by the conflict resolution, but it seems like a problem that the replication is pulling an outdated revision. Is this a known issue? Any ideas for workarounds?

Hmmm this is slightly hard to understand, but the first thing that comes to mind is to run the pull replication first and let it complete and then run the push.

I’m not sure what is going on here. Could you share more of the conversation from Fiddler? With a title like “patientCareRecord” I imagine that this is sensitive information so it may not be possible, or need a lot of redaction. I don’t need to know the contents of the revisions, just the conversation that took place in terms of pushing and pulling.

run the pull replication first and let it complete and then run the push

I can try that, but what it’s trying to simulate is a scenario where our users are working in areas with intermittent connection. When they lose and regain connection, I would expect both replications to resume at the same time.

The documents themselves are just test data at this point, so nothing to redact. Unfortunately the forums won’t let me upload a zip file with the Fiddler archives because I’m a new user.

I had a look at the files I received offline and I see one problem that I will need to address. This problem doesn’t seem to be interfering with anything but it’s definitely uncool (nullreferenceexception when closing a pull replicator that is running on a fresh DB).

I notice you are using Sync Gateway 1.2.1 (and I notice also that you already said that), and that your checkpoints indicate that you have missed some revisions (is this test running with Sync Gateway backed by Couchbase Server?). I’d suggest upgrading Sync Gateway to a newer version and trying this out again as an initial step, because during 1.4.0 testing I ran into stuff like this and was told that this area has been greatly improved since 1.2.1.

Also I’d like to comment on the Fiddler conversation I received. I think you might be misunderstanding the purpose of the calls.

_revs_diff is used by the pusher to ask the server which revisions it does not yet have so that it can then push them. In that case, the response you posted indicates that the server has responded that it does not have the indicated revision, but it has some possible revisions that could be its ancestor. The client will use this information to find the last revision that the client and server have in common. In this case they only have generation 1 in common, and so you will see that the full history of this branch (the revision IDs, not the bodies) are pushed to sync gateway in the next _bulk_docs request.

This has no effect on the puller. The puller uses the reverse procedure. It will send a request for all changes starting from a certain point (where it last stopped) and I see it requesting those changes in increasing order in the file that you sent me. This is a race between the pusher and puller. Normally it’s not a problem but if you resolve the conflicts too early then you might run into this. Fiddler won’t capture the change tracker items because they are sent over web socket instead of regular HTTP, and that could be what is confusing you about this flow.

The reason you see the generation 5 in there is probably because, although it exists on the server, it hasn’t come through on the client changes feed yet for processing. Is your conflict resolution code waiting for idle before it executes or is it happening during it? The replicator may go idle at any time if it catches up with all the current changes. If you are concerned about getting all current changes before moving onto conflict resolution you might be better served with a one shot replication before resolving conflicts. Unlike continuous, which gets its changes as they come in, one-shot will get all the current changes in one batch so when it finishes you can be confident that it is caught up.

So actually, I’m not sure there is any explicitly incorrect behavior here. Does the above make sense?

Ok, that helps explain things then.

We are doing the conflict resolution directly from the database change notification event. We could try waiting for the pull replicator to go idle before trying to resolve the conflict, but I’m curious if you have another suggestion for when/where to handle conflict resolutions.

For this test itself I added a delay before restarting the replications, and now it seems to be succeeding consistently.