Sync gateway assigns abandoned sequence number

bharath · February 22, 2019, 1:33pm

Hi,

We use two sync gateway docker containers in a swarm. The issue is, sync gateway does not replicate (to CB Lite) documents when they are assigned an “abandoned” sequence number.

Logs:

2019-02-22T08:28:03.968Z [DBG] Changes+: MultiChangesFeed waiting... (to 1178)

2019-02-22T08:28:03.968Z [DBG] Changes+: No new changes to send to change listener. Waiting for "primary"'s count to pass 0

2019-02-22T08:28:10.604Z [INF] HTTP: #4577: GET /

2019-02-22T08:28:10.604Z [INF] HTTP+: #4577: --&gt; 200 (0.1 ms)

2019-02-22T08:28:12.134Z [DBG] Changes+: heartbeat written to _changes feed for request received (as 1411)

2019-02-22T08:28:12.147Z [INF] Cache: Received #108936 after 261ms ("1340.route.cb078380-3b35-40d4-acbd-8e63e8f583c3" / "1-823ad0130266cb4f959f332b878b359b")

2019-02-22T08:28:12.147Z [INF] Cache: **Received unexpected out-of-order change - not in skippedSeqs (seq 108936, expecting 108948) doc "1340.route.cb078380-3b35-40d4-acbd-8e63e8f583c3" / "1-823ad0130266cb4f959f332b878b359b**"

2019-02-22T08:28:12.147Z [INF] Cache: Initialized cache for channel "1340.route.cb078380-3b35-40d4-acbd-8e63e8f583c3-READ" with options: &amp;{ChannelCacheMinLength:50 ChannelCacheMaxLength:500 ChannelCacheAge:1m0s}

2019-02-22T08:28:12.147Z [INF] Cache: #108936 ==&gt; channel "1340.route.cb078380-3b35-40d4-acbd-8e63e8f583c3-READ"

2019-02-22T08:28:12.147Z [DBG] Cache+: LogEntries.appendChange: out-of-order sequence #108936 (last is #108947) - handling as insert

2019-02-22T08:28:12.147Z [INF] Cache: #108936 ==&gt; channel "1340_data"

2019-02-22T08:28:12.147Z [DBG] Cache+: LogEntries.appendChange: out-of-order sequence #108936 (last is #108947) - handling as insert

2019-02-22T08:28:12.147Z [INF] Cache: #108936 ==&gt; channel "*"

Here, this document change “Received unexpected out-of-order change - not in skippedSeqs (seq 108936,” was never replicated to Couchbase lite database. Digging deeper into logs reveal, that this sequence was never picked up before by any document. Logs shows,

  Received #108935

and after some time,

 2019-02-21T16:20:08.484Z [DBG] CRUD+: Saving doc (seq: #108938, id: 1340.route.2557965f-f00d-4362-adf0-ec9b0a87b989 rev: 2-ec550692d9b8800ea660af9f5baa025f)

 2019-02-21T16:20:08.648Z [INF] Cache: Received #108938 after 164ms ("1340.route.2557965f-f00d-4362-adf0-ec9b0a87b989" / "2-ec550692d9b8800ea660af9f5baa025f")

 2019-02-21T16:20:08.648Z [INF] Cache: Deferring #108938 (1 now waiting for #108930...#108937)

And as the missed sequence number (#108936) never arrives, that gets abandoned after default 60 minutes clean up task with this log,

2019-02-21T17:37:35.311Z [INF] Cache: Querying ‘channels’ view for “*” (start=#108936, end=#108936, limit=0)

2019-02-21T17:37:35.481Z [WRN] Skipped Sequence 108936 didn't show up in MaxChannelLogMissingWaitTime, and isn't available from the * channel view. If it's a valid sequence, it won't be replicated until Sync Gateway is restarted. -- db.(*changeCache).CleanSkippedSequenceQueue.func1() at change_cache.go:283

It looks easily reproducible when using multiple sync gateway instances. From my debugging, the issue looks like,

Multiple SG instances fighting for next sequence number which updates _sync:seq twice and _addPendingLogs method in change_cache.go pushes the missed sequence number to skipped queue which gets abandoned if it does not arrive within max_wait_skipped time. As the lastSeq still holds last abandoned sequence number #108936, that gets picked up when nextSequence is called.

And it is very closely related to https://github.com/couchbase/sync_gateway/issues/3787.

Way to reproduce:

Start 2 or more SG instances with “max_wait_skipped” to less than 5 minutes.
Push documents to SG which triggers TAP DCP changes in both the instances.
Notice the skip in new sequence numbers.
As the skipped sequence numbers are pushed to skippedQueue, they will get abandoned after max_wait_skipped time. (I kept it shorter to reproduce)
The very next document will have this missed sequence number.

Sync gateway : 2.1.1
Couchbase: 5.0.1
Any help guys? @jens @adamf

adamf · February 23, 2019, 8:02pm

Multiple Sync Gateway nodes incrementing _sync:seq is routine behaviour - it shouldn’t cause any issues like the one you’re describing. Sync Gateway 2.1.1 should be releasing any unused sequence numbers, so I’d like to better understand the scenario where a sequence is allocated and then doesn’t get used until step 5.

Some questions to help understand what you’re seeing:

In your step 2, how many documents are you pushing to each SG node? How are you writing the documents (PUT doc, bulk_docs, etc). Are the writes successful? What are the sequences associated with the new revisions?
What is the skipped sequence of the ‘skip’ in step 3?

Thanks.

bharath · February 23, 2019, 8:16pm

Thanks for looking into this @adamf!
To answer your questions,

Single document is pushed to SG using /POST. It was a new document. The revision was #108938. There was a “Deferring log” (posted in the original thread). The previous sequence received log showed #108935. This made me believe that, #108936 and #108937 were missed and subsequently got pushed to skipped seqs.
Both #108936 and #108937 got skipped.

So after receiving #108938, there was a long gap where SG did not have any activity (no new/changes documents were pushed). In this time. CleanUp task happened which printed out, "

2019-02-21T17:37:35.481Z [WRN] Skipped Sequence 108936 didn't show up in MaxChannelLogMissingWaitTime, and isn't available from the * channel view. If it's a valid sequence, it won't be replicated until Sync Gateway is restarted. -- db.(*changeCache).CleanSkippedSequenceQueue.func1() at change_cache.go:283

After this long gap, a new document was pushed to SG which showed,

 ` 2019-02-22T08:28:12.147Z [INF] Cache: Received #108936 after 261ms ("1340.route.cb078380-3b35-40d4-acbd-8e63e8f583c3" / "1-823ad0130266cb4f959f332b878b359b")

   2019-02-22T08:28:12.147Z [INF] Cache: **Received unexpected out-of-order change - not in skippedSeqs (seq 108936, expecting 108948) doc "1340.route.cb078380-3b35-40d4-acbd-8e63e8f583c3" / "1-823ad0130266cb4f959f332b878b359b**"

So somehow SG assigned the sequence number which was purged in the clean up tasks? Could it be that s.last was still holding to this sequence?

adamf · February 25, 2019, 5:59am

I agree it looks like a scenario where SG allocated and then attempted to reuse the sequence after a window longer than the skipped sequence expiry, but I’m trying to get to the bottom of what exact update pattern would result in that behaviour. The

Is it possible to post the full logs from both your SG nodes to a gist? That will help me get a better picture of exactly what updates are triggering the sequence allocation, but not triggering the expected releaseSequence.

Thanks.

bharath · February 25, 2019, 9:56am

Sure! Here are the gists,

SG Node 1 - https://gist.github.com/BharathMG/fa486aaa6553595f160f0cc6b61a90f7
SG Node 2 - https://gist.github.com/BharathMG/000e4905f37d0916b571b7345153e699

I removed some lines like **heartbeat, plain 200s and GET \ ** to reduce the log lines. Log starts from Sequence #108928 till Sequence #108947

bharath · February 27, 2019, 8:01am

@adamf Did you get a chance to look at the above logs?

adamf · February 28, 2019, 1:12am

Thanks for sharing. This looks like a potential bug when using 1.x-style replication (bulk_docs) under low load. I’ve filed https://issues.couchbase.com/browse/CBG-253 to track the fix.

bharath · February 28, 2019, 6:36am

Awesome! Thanks a lot for your help @adamf !

alexegli · August 30, 2019, 2:50pm

I think I have the same issue, but was a bit confused by the cause posted above. In about a 12 hour period I saw about 150 of these messages in my sync gateway log:
WARNING: Skipped Sequence 41126734 didn’t show up in MaxChannelLogMissingWaitTime, and isn’t available from the * channel view. If it’s a valid sequence, it won’t be replicated until Sync Gateway is restarted. – db.(*changeCache).CleanSkippedSequenceQueue.func1() at change_cache.go:220

I also saw about 8000 “sync fn rejected” messages for sync gateway rejecting invalid docs. But we have many successful writes in between the failed writes and the Skipped Sequence error. Based on the explanation above so long as there is any successful write to the DB by any user within 60 minutes of a failed write, I should not be seeing the “skipped sequence” errors? If so, what else could be causing those skipped sequence errors?

CB Server Community 4.1.1
Sync Gateway Community 1.3.1