Replication sometimes not working

I am using Couchbase Lite 1.3.2 (build 6) and am having occasional problems with replication. The following log is the result of a client creating a document (id 2cf3c9e5...), and then a server immediately makes a change to it. The client doesn’t receive the change for some reason. It doesn’t happen often, but is definitely happening once and a while. If I select the document from the client database it shows rev 1-3d618.... Here’s the log from the sync gateway, and below is the document. Any ideas?

23:20:08.921717 2016-11-24T23:20:08.921+11:00 CRUD+: Invoking sync on doc "2cf3c9e5-c28f-4662-ba2e-8c0757dac42b" rev 1-3d618c5092d04e301ab40558086bb068
23:20:08.924250 2016-11-24T23:20:08.924+11:00 CRUD: 	Doc "2cf3c9e5-c28f-4662-ba2e-8c0757dac42b" in channels "{ch_private_559241a1-b27c-484d-b090-093e470d0104, ch_private_6ff22496-9fc0-47d6-8185-221ea4d903aa, ch_private_7f3b03b8-340d-4d41-80f9-d4ebec187216}"
23:20:08.924372 2016-11-24T23:20:08.924+11:00 Cache: SAVING #431351
23:20:08.924768 2016-11-24T23:20:08.924+11:00 CRUD: Stored doc "2cf3c9e5-c28f-4662-ba2e-8c0757dac42b" / "1-3d618c5092d04e301ab40558086bb068"
23:20:08.924814 2016-11-24T23:20:08.924+11:00 HTTP+: #6834:     --> 201   (6.1 ms)
23:20:08.924968 2016-11-24T23:20:08.924+11:00 Events+: Event queue worker sending event Document change event for doc id: 2cf3c9e5-c28f-4662-ba2e-8c0757dac42b to: Webhook handler [http://192.168.0.13:8088/pomochat/webhook]
23:20:08.926983 2016-11-24T23:20:08.926+11:00 Events+: Webhook handler ran for event.  Payload  posted to URL http://192.168.0.13:8088/pomochat/webhook, got status 200 OK
23:20:08.930742 2016-11-24T23:20:08.930+11:00 HTTP:  #6835: POST /sync_gateway/_bulk_get  (as admin)
23:20:08.931277 2016-11-24T23:20:08.931+11:00 HTTP+: #6835:     --> 200   (1.1 ms)
23:20:08.934673 2016-11-24T23:20:08.934+11:00 HTTP:  #6836: POST /sync_gateway/_bulk_docs  (as admin)
23:20:08.935474 2016-11-24T23:20:08.935+11:00 CRUD+: Invoking sync on doc "2cf3c9e5-c28f-4662-ba2e-8c0757dac42b" rev 2-082290ceadbbc2281b13051962cd50e8
23:20:08.936314 2016-11-24T23:20:08.936+11:00 CRUD+: Saving old revision "2cf3c9e5-c28f-4662-ba2e-8c0757dac42b" / "1-3d618c5092d04e301ab40558086bb068" (781 bytes)
23:20:08.936523 2016-11-24T23:20:08.936+11:00 CRUD+: Backed up obsolete rev "2cf3c9e5-c28f-4662-ba2e-8c0757dac42b"/"1-3d618c5092d04e301ab40558086bb068"
23:20:08.936647 2016-11-24T23:20:08.936+11:00 Cache: SAVING #431352
23:20:08.936915 2016-11-24T23:20:08.936+11:00 CRUD: Stored doc "2cf3c9e5-c28f-4662-ba2e-8c0757dac42b" / "2-082290ceadbbc2281b13051962cd50e8"
23:20:08.936957 2016-11-24T23:20:08.936+11:00 HTTP+: #6836:     --> 201   (2.8 ms)
23:20:08.937059 2016-11-24T23:20:08.937+11:00 Events+: Event queue worker sending event Document change event for doc id: 2cf3c9e5-c28f-4662-ba2e-8c0757dac42b to: Webhook handler [http://192.168.0.13:8088/pomochat/webhook]
23:20:08.938906 2016-11-24T23:20:08.938+11:00 Events+: Webhook handler ran for event.  Payload  posted to URL http://192.168.0.13:8088/pomochat/webhook, got status 200 OK

The doc:

{
  "owner": "6ff22496-9fc0-47d6-8185-221ea4d903aa",
  "server_timestamp_utc": "2016-11-24T12:20:08.924268540Z",
  "read_receipts_by_owner": {
    "7f3b03b8-340d-4d41-80f9-d4ebec187216": "dda22395-7968-4977-b31a-6547a237baeb",
    "559241a1-b27c-484d-b090-093e470d0104": "1aa1eafe-d04b-49b0-8065-c1329a291c2a"
  },
  "audit": {
    "user_id": "admin",
    "client_timestamp_utc": "2016-11-24T12:20:08.932Z"
  },
  "_sync": {
    "time_saved": "2016-11-24T23:20:08.936535326+11:00",
    "sequence": 431352,
    "rev": "2-082290ceadbbc2281b13051962cd50e8",
    "history": {
      "channels": [
        [
          "ch_private_559241a1-b27c-484d-b090-093e470d0104",
          "ch_private_6ff22496-9fc0-47d6-8185-221ea4d903aa",
          "ch_private_7f3b03b8-340d-4d41-80f9-d4ebec187216"
        ],
        [
          "ch_private_559241a1-b27c-484d-b090-093e470d0104",
          "ch_private_6ff22496-9fc0-47d6-8185-221ea4d903aa",
          "ch_private_7f3b03b8-340d-4d41-80f9-d4ebec187216"
        ]
      ],
      "revs": [
        "1-3d618c5092d04e301ab40558086bb068",
        "2-082290ceadbbc2281b13051962cd50e8"
      ],
      "parents": [
        -1,
        0
      ]
    },
    "channels": {
      "ch_private_6ff22496-9fc0-47d6-8185-221ea4d903aa": null,
      "ch_private_7f3b03b8-340d-4d41-80f9-d4ebec187216": null,
      "ch_private_559241a1-b27c-484d-b090-093e470d0104": null
    },
    "recent_sequences": [
      431351,
      431352
    ]
  },
  "chat_group_id": "661322acc19a10786c0adac0e8e3c52a",
  "state": null,
  "delivery_receipts_by_owner": {
    "7f3b03b8-340d-4d41-80f9-d4ebec187216": "5c4da70c-7476-4899-a68c-94cba50814f9",
    "559241a1-b27c-484d-b090-093e470d0104": "e6588d3b-3141-4e07-a5b1-baa3147d9f2d"
  },
  "type": "chat-message",
  "content": {
    "message": "Sdd",
    "type": "text"
  },
  "attached_document_ids": [],
  "client_timestamp_utc": "2016-11-24T12:20:08.366Z"
}

On the second insert (2-...) I would expect the logs to show which channel the document was mapped to. Something like:

23:20:08.924250 2016-11-24T23:20:08.924+11:00 CRUD: 	Doc "2cf3c9e5-c28f-4662-ba2e-8c0757dac42b" in channels "{ch_private_559241a1-b27c-484d-b090-093e470d0104, ch_private_6ff22496-9fc0-47d6-8185-221ea4d903aa, ch_private_7f3b03b8-340d-4d41-80f9-d4ebec187216}"

The fact that the issues happens sometimes though makes it much harder to pin down. What happens when you query the changes feed as the user that’s meant to receive it?

For example, http://user1:pass@localhost:4984/{db}/_changes. If the document shows up in the response then it’s more likely to have been replicated to the client. If the document doesn’t show up in the response then it’s more likely a sync function issue or maybe a bug.

James

I agree with @jamiltz - testing whether a new _changes request for the user returns the latest rev will help identify what kind of issue this is.

Thanks for the response. Ok, I hit the endpoint http://6ff22496-9fc0-47d6-8185-221ea4d903aa:8b1f5eb2-1509-4f05-87f8-fa4e69a7a400@192.168.0.13:4986/sync_gateway/_changes and can’t see that doc I at all (not even rev 1).

This is the user https://gist.githubusercontent.com/npomfret/8be980752b41286cbb5e07b76ce97fbe/raw/8ab5f734fc25712decd8686ceb3f4602b1f43efd/user

This is the doc https://gist.githubusercontent.com/npomfret/8be980752b41286cbb5e07b76ce97fbe/raw/8ab5f734fc25712decd8686ceb3f4602b1f43efd/doc

I believe it should appear in the users changes feed because it’s in channel ch_private_6ff22496-9fc0-47d6-8185-221ea4d903aa

And this is the changes feed https://gist.githubusercontent.com/npomfret/8be980752b41286cbb5e07b76ce97fbe/raw/8ab5f734fc25712decd8686ceb3f4602b1f43efd/changes

I’ve grepped the sync gateway logs for that doc id and can’t see anything untoward.

I agree - channel assignment looks good based on the user and the doc.

Since the changes request ends before getting to the doc’s sequence (431352), this could potentially be a scenario where Sync Gateway has lost the mutation stream from Couchbase server, and isn’t updating it’s channel cache with new updates. As of SG 1.3, Sync Gateway should detect this scenario and attempt to restart the stream.

If SG is still working the mutation stream as expected, you should see a log entry like:

Cache: #431352 ==> channel "[...channel name...]"

at some point after the log excerpt in your first post to this thread, which represents the document being written to the channel cache. Is that present (or are there any log entries in this format) after the excerpt provided?

Hmm, the plot thickens. It just appeared in my app after several hours (after a restart of the sg). Here’s the sg log now:

23:20:08.921717 2016-11-24T23:20:08.921+11:00 CRUD+: Invoking sync on doc "2cf3c9e5-c28f-4662-ba2e-8c0757dac42b" rev 1-3d618c5092d04e301ab40558086bb068
23:20:08.924250 2016-11-24T23:20:08.924+11:00 CRUD:     Doc "2cf3c9e5-c28f-4662-ba2e-8c0757dac42b" in channels "{ch_private_559241a1-b27c-484d-b090-093e470d0104, ch_private_6ff22496-9fc0-47d6-8185-221ea4d903aa, ch_private_7f3b03b8-340d-4d41-80f9-d4ebec187216}"
23:20:08.924768 2016-11-24T23:20:08.924+11:00 CRUD: Stored doc "2cf3c9e5-c28f-4662-ba2e-8c0757dac42b" / "1-3d618c5092d04e301ab40558086bb068"
23:20:08.924968 2016-11-24T23:20:08.924+11:00 Events+: Event queue worker sending event Document change event for doc id: 2cf3c9e5-c28f-4662-ba2e-8c0757dac42b to: Webhook handler [http://192.168.0.13:8088/pomochat/webhook]
23:20:08.935474 2016-11-24T23:20:08.935+11:00 CRUD+: Invoking sync on doc "2cf3c9e5-c28f-4662-ba2e-8c0757dac42b" rev 2-082290ceadbbc2281b13051962cd50e8
23:20:08.936314 2016-11-24T23:20:08.936+11:00 CRUD+: Saving old revision "2cf3c9e5-c28f-4662-ba2e-8c0757dac42b" / "1-3d618c5092d04e301ab40558086bb068" (781 bytes)
23:20:08.936523 2016-11-24T23:20:08.936+11:00 CRUD+: Backed up obsolete rev "2cf3c9e5-c28f-4662-ba2e-8c0757dac42b"/"1-3d618c5092d04e301ab40558086bb068"
23:20:08.936915 2016-11-24T23:20:08.936+11:00 CRUD: Stored doc "2cf3c9e5-c28f-4662-ba2e-8c0757dac42b" / "2-082290ceadbbc2281b13051962cd50e8"
23:20:08.937059 2016-11-24T23:20:08.937+11:00 Events+: Event queue worker sending event Document change event for doc id: 2cf3c9e5-c28f-4662-ba2e-8c0757dac42b to: Webhook handler [http://192.168.0.13:8088/pomochat/webhook]
08:11:12.554200 2016-11-25T08:11:12.554+11:00 Changes+: MultiChangesFeed sending &{Seq:431352 ID:2cf3c9e5-c28f-4662-ba2e-8c0757dac42b Deleted:false Removed:{} Doc:map[] Changes:[map[rev:2-082290ceadbbc2281b13051962cd50e8]] Err:<nil> allRemoved:false branched:false}   (to 6ff22496-9fc0-47d6-8185-221ea4d903aa)

If the root issue was that SG had lost the server’s mutation feed, this would have been fixed when you restarted SG.

The latest SG log provided looks like a mix of the pre-restart logs (23:20:08 timestamps) and the post-restart (the single 08:11:12 timestamp), so it’s not really shedding any more light on the situation.

If you have the full pre-restart logs, it may be useful to see a grep of those logs searching for the sequences in question (431351, 431352). If you want to post those to a gist, I’ll take a look to see if there’s any additional information there.

there’s a bit of noise in there but I think this what you mean: https://gist.github.com/npomfret/49938ffdb99b58c828c826ee5b1c9cd8 -

Thanks for the additional logs. Based on those, it looks like a case where the revisions/sequences in question aren’t being added to the in-memory channel cache. The most probable cause is that Sync Gateway lost it’s connection to the server mutation feed. As I mentioned, SG 1.3 or later should auto-reconnect in that scenario. What SG build are you on?

I’m using SG version 1.3.1-16

That sounds like a scenario where the dropped feed should have been detected, then. Can you file an issue against the SG github repo, if possible including a Sync Gateway log excerpt around the last appearance of “==> channel” in the logs?