"_changes" feed not 100% reliable


#1

Sync Gateway version

Couchbase Sync Gateway/1.5.1(4;cb9522c) (multiple instances, dynamically scaled)

Operating system

Ubuntu 14.04.3 LTS

Expected behavior

When repeatedly requesting GET /{db}/_changes from a Sync Gateway database on the admin port with a since parameter set to the last_seq value returned in the previous request, the expectation is that 100% of updated documents will be reliably returned in the responses. Or in other words this strategy ought to result in a guaranteed feed of every single upserted document in a Sync Gateway database over time.

Actual behavior

After extensive investigation following repeated anecdotal reports from users of a production implementation backed by Sync Gateway, it is demonstrably the case that upserted documents are not included in this changes feed, with significant regularity. It would seem that this most frequently occurs when the database sequence transitions from, or to, a compound sequence (stable_seq::seq) representation.

A Typical Example

A recent example that I investigated followed this pattern. The logs from the change monitor application reveal that:

  1. query made with since: 11045496
  2. response contains 1 document (seq: 11045497), and last_seq: 11045497
  3. subsequent query made with since: 11045497.
  4. response contains 2 documents (seq: 11045497::11045499 & 11045497::11045500), and last_seq: 11045497::11045500
  5. subsequent query made with since: 11045497::11045500
  6. response contains 1 document (seq: 11045497::11045501), and last_seq: 11045497::11045501
  7. subsequent query made with since: 11045497::11045501
  8. response contains 26 documents (seq: 11045497::11045502 … 11045497::11045527), and last_seq: 11045497::11045527
  9. subsequent query made with since: 11045497::11045527
  10. response contains 1 document (seq 11045528), and last_seq: 11045528

This is a total of 31 documents between seq: 11045497 and seq: 11045528

Some short time later once the compound sequence situation has resolved, the following manual confirmation is invoked:

  1. query made with since=11045496&limit=32
  2. response contains 32 document (seq: 11045497 … 11045528), and last_seq: 11045528

This is a total of 32 documents between seq: 11045497 and seq: 11045528

Analysis of the returned documents reveals that in the first example the response in step 4 was missing the document returned as seq: 11045498 in the second example. Or in other words, the response to step 3 is missing the expected document seq: 1045497::11045498. (The log files surrounding this snippet include exhaustive data for several days both before and after this sequence, and the document at seq: 11045498 is demonstrably missing from the change monitor.

Observations

  1. It would seem that the occurrence rate of missed document changes increased significantly when we recently introduced replicating clients into our system (previously, the sync gateway database was only accessed by services directly over the REST API). The speculation is that these replicating clients (perhaps by now invoking the previously unused bulk “set” APIs on the Sync Gateway endpoints?) have increased the frequency of whatever root cause leads to the unreliability. Note that this is speculation only.

  2. We are using this changes feed to detect the presence of replicated “transactional” documents, and also to recognise changes to “state” documents within Sync Gateway. Our system is absolutely contingent on the guaranteed detection of such document changes in order to trigger application critical business logic and subsequent flow-on document mutations. The unreliability of this changes feed is a mission critical situation for us, and our production system is currently unhealthy, with no immediately obvious mitigation strategy.

  3. Perhaps most alarmingly: On the assumption that the remote clients (using Couchbase provided clients) essentially utilise the same changes feed mechanism in order to implement pull replication functionality (albeit over the public port, and with user context channel filtering), there is now a significant concern that such pull replication might also be not 100% reliable and might potentially result in missing document mutations replicating down to clients. Again this constitutes a mission critical failure mode for us.

  4. A search of the forum and issue history for Sync Gateway yields various similar sounding issues, which may, or may not be related, for example:

Thanks for your attention to this issue.


#2

Note: also raised as https://github.com/couchbase/sync_gateway/issues/3721


#3

Tracking investigation of this issue on the github ticket (https://github.com/couchbase/sync_gateway/issues/3721).


#4

Same version of SG 1.5.1(4;cb9522c), similar problem here. but we are just using 1 SG and 1 CBS.
When doing pull replication from CBL, CBL does not getting latest revision of documents, but a older revision.

sg config

{
“log”: [""],
“adminInterface”:":4985",
“interface”: “:4984”,
“databases”: {
“db”: {
“allow_empty_password”: false,
“unsupported”: {
“user_views”: {
“enabled”: true
}
},
“server”: “http://CB_host”,
“bucket”: “db”,
“username”: “admin”,
“password”: “Password”,
“enable_shared_bucket_access”: true,
“import_docs”: “continuous”,
“users”: {
“GUEST”: { “disabled”: true },
“admin”: { “password”: “Password”, “disabled”: false, “admin_roles”: [“admin”] }
“user001”: { “password”: “Password”, “disabled”: false, “admin_roles”: [“user”] }
},
“roles”: {
“admin”: { “admin_channels”: ["
"] },
“user”: { “admin_channels”: [“c001”] }
},
“sync”:function(doc, meta) { if (doc.type =='product') requireRole("admin"); channel('c001'); }
},
“CORS”: {
“Origin”: [""],
“LoginOrigin”: ["
"],
“Headers”: [“Content-Type”],
“MaxAge”: 1728000
}
}

  1. we get session of SG with user001 and perform replication on CBL to get latest document.
  2. then update a documents with PUT on SG with admin account
  3. perform the replication on CBL again.

Changes+: Changes POST request. URL: /db/_changes?feed=normal&heartbeat=30000&style=all_docs&since=287, feed: normal, options: {Since:287 Limit:0 Conflicts:true IncludeDocs:false Wait:false Continuous:false Terminator: HeartbeatMs:30000 TimeoutMs:300000 ActiveOnly:false}, filter: , bychannel: [], docIds: [] (to user001)
Changes+: Int sequence multi changes feed…
Changes: MultiChangesFeed(channels: {*}, options: {Since:287 Limit:0 Conflicts:true IncludeDocs:false Wait:false Continuous:false Terminator:0xc42164bbc0 HeartbeatMs:30000 TimeoutMs:300000 ActiveOnly:false}) … (to user001)
Changes+: MultiChangesFeed: channels expand to “!:1,c001:27” … (to user001)
> Cache: getCachedChanges("!", 287) --> 0 changes valid from #1
> Changes+: [changesFeed] Found 0 changes for channel !
> Changes+: [changesFeed] Found 0 changes for channel c001
Changes: MultiChangesFeed done (to user001)
HTTP+: #198: --> 200 OK (0.0 ms)

http://sg/db

{
“committed_update_seq”: 288,
“compact_running”: false,
“db_name”: “config”,
“disk_format_version”: 0,
“instance_start_time”: 1535371130018573,
“purge_seq”: 0,
“state”: “Online”,
"update_seq": 288
}

the updated document is not being sync to the CBL

when we get full _changes list from SG. the changes seq is only on 282, which is the seq before we restart the SG.

After, restart sync gateway. CBL can get latest document correctly. It seems SG stop trace the changes after SG restarted. We only found following warning on the log, don’t know whether it is related

WARNING: Error processing DCP stream - will attempt to restart/reconnect: worker connect, server: 172.19.0.2:11210, err: dial tcp 172.19.0.2:11210: getsockopt: connection timed out – base.(*DCPReceiver).OnError() at dcp_feed.go:136
WARNING: Error processing DCP stream - will attempt to restart/reconnect: worker connect, server: 172.19.0.2:11210, err: dial tcp 172.19.0.2:11210: getsockopt: connection timed out – base.(*DCPReceiver).OnError() at dcp_feed.go:136

We use CB long time ago, but never meeting this problem before. it is just happen recently on 1.5.1 when we start a new project.