Couchbase Lite Inconsistent Replication (Xamarin/.NET 1.4)

Hey guys,

I have been getting some unexpected behavior with continuous replication in my project lately and I’m hoping someone can maybe just point out an obvious issue with what I am trying to do. Apologies if this gets a bit wordy:

Specs:

  • CB Lite 1.4 for .NET, using in a Xamarin project targeting iOS and Android (behavior is in both I’m pretty sure, but I have been focusing on the iOS one lately where it happens pretty consistently)
  • Sync Gateway 1.4.1-3
  • Couchbase Server Community Edition 4.5.1-2845
  • 1 database in continuous PULL replication, 1 database in continuous PUSH/PULL replication
  • basic authentication

Architecture Overview:
The basic layout is we have a mobile app which works with a middleware via couchbase. We are doing a sort of CQRS design (command-query responsibility segregation), so the mobile app will send a “command document” to couchbase, which is then picked up by the middleware to go process that command in the backend, then the middleware updates the appropriate data document in couchbase which is then synced back to the mobile app. So in short, there is a command database which is push/pull replicated to couchbase, then a document database which is pull replicated to the device.

Issue:
If I test my app against a sandbox data bucket with no middleware processing, replication all seems fine - I can update the data documents and they are all replicated back to the device as intended. In this case I am not sending command documents since there is no middleware to process them; I am just using Postman to update the data documents the way the middleware would and verifying my app replicates the new revisions. However, once I test this on the data bucket that has the middleware hooked up, it seems I begin to lose replications periodically. This screenshot shows the sync gateway log output when this happens:

App’s username is the blue, doc ID is purple. It looks like every time my app sends a command document to couchbase there is a POST revs diff, POST bulk docs, and PUT local. Then when the middleware processes the command there is a POST bulk docs to update the command document to the completed state, then a second POST bulk docs to update the actual data document from the backend.

In the successful case, there is a replication attempt by my app after the middleware does its second POST, but in some cases there is no replication attempt for the new document revision. You can see in the screenshot above, the first successful attempt results in revision 101 of the data document which my app pulls down, then the second attempt has no revision request, then the third attempt has the replication and we see the data document is actually on revision 103 now. In between these attempts I can verify that revision 102 was successful, as the data was updated in the backend and the document is on revision 102 in Sync Gateway, it just never makes its way to my app until revision 103. Note sometimes this can span multiple revisions, for example going from revision 51 to 55 before my app starts getting the updates.

Question:
Is there any potential issue you guys see here? For example, is the idea of pushing some documents to be processed by a middleware and then very soon after pulling down updated documents flawed? Or maybe the fact that we are using the ADMIN REST API by the middleware is causing our app’s PUBLIC requests to be lost in some cases, either due to a race condition or just priority of ADMIN over PUBLIC processing. Or in general is it just bad to have multiple users updating sync gateway at the same time? I would assume not in order for this to scale up to multiple users, just curious if there are some “best practices” I might be missing.

Hope that doesn’t come across too lengthy or vague, I’m just really struggling to figure out this replication issue and not sure where to begin troubleshooting.

Thanks!

@camato
The SG logs are only showing part of the story.

Could you set the SG logs to:
"log":["*"]
this way you can see how the _changes is processing the data.

PULL Below is the basic method for how PULL works.
GET _local/checkpoint with the old seq it processed.
POST _changes?since=seq_from_checkpoint
POST _bulk_get of docs and revs that CBL wants from the _changes feed.
PUT _local/checkpoint with the newest seq it processed.

PUSH Below is the Specific method for how PUSH works.

also if you can try using SG 1.5 BETA2 and CB 5.x BETA2 …

  1. replication is a lot faster
  2. There are some know race conditions and edge cases that this build has addressed
1 Like

Oh wow, just upgraded to SG 1.5 Beta 2 and CB Server 4.6 DP (couldn’t find a 5.X Beta 2 for the Community Edition). Not only can I not seem to reproduce the issue I described, but all of the replication is noticeably faster as well.

Thanks for the suggestion! Can’t believe I didn’t try that first - I checked for new versions but didn’t see the pre-release downloads for Community Edition.

Thanks again. I’ll update this post if the problem comes back, but so far it’s looking promising.