Hi,
We use two sync gateway docker containers in a swarm. The issue is, sync gateway does not replicate (to CB Lite) documents when they are assigned an “abandoned” sequence number.
Logs:
2019-02-22T08:28:03.968Z [DBG] Changes+: MultiChangesFeed waiting... (to 1178)
2019-02-22T08:28:03.968Z [DBG] Changes+: No new changes to send to change listener. Waiting for "primary"'s count to pass 0
2019-02-22T08:28:10.604Z [INF] HTTP: #4577: GET /
2019-02-22T08:28:10.604Z [INF] HTTP+: #4577: --> 200 (0.1 ms)
2019-02-22T08:28:12.134Z [DBG] Changes+: heartbeat written to _changes feed for request received (as 1411)
2019-02-22T08:28:12.147Z [INF] Cache: Received #108936 after 261ms ("1340.route.cb078380-3b35-40d4-acbd-8e63e8f583c3" / "1-823ad0130266cb4f959f332b878b359b")
2019-02-22T08:28:12.147Z [INF] Cache: **Received unexpected out-of-order change - not in skippedSeqs (seq 108936, expecting 108948) doc "1340.route.cb078380-3b35-40d4-acbd-8e63e8f583c3" / "1-823ad0130266cb4f959f332b878b359b**"
2019-02-22T08:28:12.147Z [INF] Cache: Initialized cache for channel "1340.route.cb078380-3b35-40d4-acbd-8e63e8f583c3-READ" with options: &{ChannelCacheMinLength:50 ChannelCacheMaxLength:500 ChannelCacheAge:1m0s}
2019-02-22T08:28:12.147Z [INF] Cache: #108936 ==> channel "1340.route.cb078380-3b35-40d4-acbd-8e63e8f583c3-READ"
2019-02-22T08:28:12.147Z [DBG] Cache+: LogEntries.appendChange: out-of-order sequence #108936 (last is #108947) - handling as insert
2019-02-22T08:28:12.147Z [INF] Cache: #108936 ==> channel "1340_data"
2019-02-22T08:28:12.147Z [DBG] Cache+: LogEntries.appendChange: out-of-order sequence #108936 (last is #108947) - handling as insert
2019-02-22T08:28:12.147Z [INF] Cache: #108936 ==> channel "*"
Here, this document change “Received unexpected out-of-order change - not in skippedSeqs (seq 108936,” was never replicated to Couchbase lite database. Digging deeper into logs reveal, that this sequence was never picked up before by any document. Logs shows,
Received #108935
and after some time,
2019-02-21T16:20:08.484Z [DBG] CRUD+: Saving doc (seq: #108938, id: 1340.route.2557965f-f00d-4362-adf0-ec9b0a87b989 rev: 2-ec550692d9b8800ea660af9f5baa025f)
2019-02-21T16:20:08.648Z [INF] Cache: Received #108938 after 164ms ("1340.route.2557965f-f00d-4362-adf0-ec9b0a87b989" / "2-ec550692d9b8800ea660af9f5baa025f")
2019-02-21T16:20:08.648Z [INF] Cache: Deferring #108938 (1 now waiting for #108930...#108937)
And as the missed sequence number (#108936) never arrives, that gets abandoned after default 60 minutes clean up task with this log,
2019-02-21T17:37:35.311Z [INF] Cache: Querying ‘channels’ view for “*” (start=#108936, end=#108936, limit=0)
2019-02-21T17:37:35.481Z [WRN] Skipped Sequence 108936 didn't show up in MaxChannelLogMissingWaitTime, and isn't available from the * channel view. If it's a valid sequence, it won't be replicated until Sync Gateway is restarted. -- db.(*changeCache).CleanSkippedSequenceQueue.func1() at change_cache.go:283
It looks easily reproducible when using multiple sync gateway instances. From my debugging, the issue looks like,
Multiple SG instances fighting for next sequence number which updates _sync:seq twice and _addPendingLogs method in change_cache.go
pushes the missed sequence number to skipped queue which gets abandoned if it does not arrive within max_wait_skipped time. As the lastSeq
still holds last abandoned sequence number #108936, that gets picked up when nextSequence
is called.
And it is very closely related to https://github.com/couchbase/sync_gateway/issues/3787.
Way to reproduce:
- Start 2 or more SG instances with “max_wait_skipped” to less than 5 minutes.
- Push documents to SG which triggers TAP DCP changes in both the instances.
- Notice the skip in new sequence numbers.
- As the skipped sequence numbers are pushed to skippedQueue, they will get abandoned after max_wait_skipped time. (I kept it shorter to reproduce)
- The very next document will have this missed sequence number.
Sync gateway : 2.1.1
Couchbase: 5.0.1
Any help guys? @jens @adamf