Odd Change Feed Behavior with "Received Unexpected Out-of-Order Change" Log Lines

We have a background process that performs long polls of the _changes feed. It uses the since parameter, feeding it the last results from a previous poll. Lastseq is treated opaquely as a string, and we don’t pass any other filters, etc.

One of our test clusters has been “missing” documents recently, and when we started to dig into the logs we found some lines we hadn’t seen before. Specifically, lines like:

  • 2016-06-20T15:14:42.872Z Cache: Received unexpected out-of-order change - not in skippedSeqs (seq 838056, expecting 848757) doc “-3cgde20IM2tQptJVgC3KyB” / “1-79d850ffae862b7e3d578b756950b62a”

This is happening on documents we create locally as well (not synced from couchbase lite). The trick is that when a document is “missed” it never shows up without “rewinding” or restarting the _changes feed to a much earlier starting sequence. We’ve also noticed that we’re getting really large split sequence ranges that stay open for many hours, and which usually encompass the range of these missed documents (in other words, we always see the “missing” documents/sequences when lastseq is of the form “nnnn::mmmm” where mmmm - nnnn has, at times, been greater than 10,000).

One last observation… If we look at the last sequence number and it is, say x, when we immediate create a document we’ll find that the sequence number assigned on this mis-behaving cluster, y, is much smaller than x. For example, I just pulled _sync:seq and got 848997. I then created a document and that document has sequence 837889. As a counterpoint, on a “working” cluster I see a _sync:seq value of 651493 and a newly created document picked up the sequence number 651494 as you would expect.

We found reference in some previous posts to issues that can occur with backup/restore and the last sequence number getting out of wack. We’ve confirmed that errors like the ones above do not seem to be hitting our other clusters. We use spot instances for our test cluster, and did recently have to rebuild the cluster from backups after AWS terminated one or more spot instances.

Anybody know anything more about that error above, and whether it could actually result in missed _changes when using longpoll? Is there a safe correction that can be applied, or is it a wash at this point? Really interested in understanding what could have gone wrong to get us here.

DanP

PS-These are three node clusters running Couchbase 4.0.0-4051-community and Sync Gateway 1.2.0(79;9df63a5). There is a sync gateway instance running on each of the nodes in the cluster.

That does sounds like an issue - specifically the scenario where new documents are being created with a sequence number less than _sync:seq. It sounds like the Sync Gateway has a backlog of unused sequence numbers (potentially from a previous failed bulk operation), and they aren’t getting allocated as expected.

Can you file a github issue so that we can look into it further (https://github.com/couchbase/sync_gateway/issues)? Any log information about previous bulk_docs operations on the “bad” Sync Gateway node would be helpful.

It looks like post was not active big amount of time - but was something changed related to this issue? I have couchbase community edition 4.5.0 (2 instances) and sync gateway 1.3.1 (2 instances) and I see in logs that same problem pretty often happens. And it breaks replication to mobile devices in case new doc/existing doc is updated with lower sequence number.

@Lohmatyi You shouldn’t be seeing this on a regular basis. As above - can you file a github issue with additional details on your setup, and preferably Sync Gateway logs for both SG nodes showing the problem?

For sure I shouldn’t - because replication is broken. But I see them. Will try to cut SG logs a bit to not upload big amount of MB.