Pull Replication Stopped/Blocked After Updating to Swift 3


#1

We’re running into a strange issue with swift 3 and CBL 1.4 occasionally stopping/blocking pull replication writing documents to the device.
After enabling:

CBLManager.enableLogging("Sync")
CBLManager.enableLogging("SyncVerbose")
CBLManager.enableLogging("ChangeTracker")
CBLManager.enableLogging("ChangeTrackerVerbose"), 

I was able to see these errors coming through:

Sync: CBLBulkDownloader[…/_bulk_get]: Starting new document; ID="(null)“
Sync: CBLBulkDownloader[/…/_bulk_get]: Finished document
Sync: CBLMultipartDocumentReader[_id=”(null)"]: Finished loading (0 attachments)
Sync: Giving up on {-vRw6KYPEQEij-gZ1LdgoBi #2-541b03f216a084e57e0225c8f68fb485}: CBLHTTP[404, “not_found”]

The logs appeared for several documents, and after viewing a handful of the documents, I could see the revision in the iPad’s log was older than the revision available on the gateway/server.
We’ve encountered the “Giving up on” logs numerous times when this pull issue occurs.

When we encounter this error, it blocks all pull replication from saving any of the pulled documents to the device. However, push replication behaves normally.

This issue is difficult to recreate in our test environment, but we frequently see it at customer locations.

Deleting and reinstalling the database from the iPad solves the issue. However, if we leave the iPads alone long enough, the pull replication eventually kicks back on.

It’s worth noting we also saw this issue in CBL 1.3, but we didn’t start seeing this problem until we switched to swift 3.

Has anyone else encountered this issue, or a similar problem? Push replication continues normally, but pull replication seems to be blocked.

Thanks


Document no update after reactive from background (iOS)
#2

New info regarding this issue:

Recently, we have also seen pull replication stopping without the “giving up on…” messages

Currently, I’m investigating this error message: “CBLWebSocketChangeTracker[…]: Connection error #1, retrying in 2.0 sec: PSWebSocketError[3, “Output stream end encountered”]”

Thanks


#3

What does happen after retrying? It could be just a temporary connection issue that will get resolved after retry or after the connection issue is fixed.


#4

I general I don’t think the issue is related to Swift 3.

Sync: Giving up on {-vRw6KYPEQEij-gZ1LdgoBi #2-541b03f216a084e57e0225c8f68fb485}: CBLHTTP[404, “not_found”]

This is the error sent by SGW as a result of sending _bulk_get to request for the document revision that doesn’t exist anymore on the SGW side.


#5

Thanks for your reply @pasin

I appreciate the information on the “Giving up on” logs
Regarding the “Giving up on” logs, we were only able to capture those logs for one customer, and I believe we may have two separate issues with similar effects (blocking pull)

Yesterday, I was able to capture logs at another location with what appeared to be the same issue (pull blocked, reinstall solved the problem). However, the logs did not contain the “Giving up on…” error.
Currently, I believe the majority of problems we’re dealing with may have to do with the "Output stream end encountered\" error.

After that error,
I see logs like this:

Sync: CBLRestPuller[http://...]: postProgressChanged (0/0, active=0 (batch=0, net=0), lastSeq=11116025::11136612, online=1, error=(null))
Sync: CBLRestPuller[http://...] STOPPED
Replication: CBLRestPuller[http://...] took 260.459 sec; error=(null)
Sync: CBLRestPuller[http://...]: postProgressChanged (0/0, active=0 (batch=0, net=0), lastSeq=11116025::11136612, online=1, error=(null))

Interestingly, the lastSeq=11116025::11136612
is in all but the first CBLRestPuller[http://…]: postProgressChanged log.
In other words, the sequence never gets past 11116025

Would that be related to the blocked pull replication?

Thanks again


#6

@pasin
I’m wondering if it could be related to this issue:

Would you know if that was ever resolved?

Thanks


#7

@Caroline “Output stream end encountered” error indicates that client received end of websocket stream for some reasons such as connectivity issue or server error. If the pull replicator can recover after retry, it should be fine in general.

Can you cross check with the Sync Gateway log during the same timeframe to see if there is an error on the SGW side or not?

I couldn’t confirm if it’s the same issue as Document no update after reactive from background (iOS) or not. It could be the same or different.


#8

Thanks @pasin,

I don’t see any error… But, I’ve attached the logs i saw roughly around the time of the error on the iPad.

_time=2017-05-16T13:52:00.664-05:00 _level=INFO _msg=go-couchbase: call to Do("_sync:user:***") in github.com/couchbase/go-couchbase.(*Bucket).casNext took 1.497782806s

_time=2017-05-16T13:52:01.219-05:00 _level=INFO _msg=go-couchbase: call to Do("_sync:user:***") in github.com/couchbase/go-couchbase.(*Bucket).casNext took 1.429551467s

2017-05-16T13:52:02.489-05:00 HTTP: #287: GET /…/_changes?feed=websocket (as ***)
2017-05-16T13:52:02.489-05:00 HTTP+: #287: --> 101 Upgraded to WebSocket protocol (0.0 ms)

2017-05-16T13:52:02.604-05:00 HTTP: #290: GET /…/_changes?feed=websocket (as ***)
2017-05-16T13:52:02.604-05:00 Changes+: Int sequence multi changes feed…
2017-05-16T13:52:02.604-05:00 HTTP+: #290: --> 101 Upgraded to WebSocket protocol (0.0 ms)

2017-05-16T13:52:03.191-05:00 Changes: MultiChangesFeed({******}, {Since:1055 Limit:0 Conflicts:true IncludeDocs:false Wait:true Continuous:true Terminator:0xc423e156e0 HeartbeatMs:300000 TimeoutMs:300000 ActiveOnly:false}) … (to ***)
2017-05-16T13:52:03.206-05:00 Changes+: Int sequence multi changes feed…
2017-05-16T13:52:03.206-05:00 Changes: MultiChangesFeed({***
***}, {Since:1055 Limit:0 Conflicts:true IncludeDocs:false Wait:true Continuous:true Terminator:0xc423e15b00 HeartbeatMs:300000 TimeoutMs:300000 ActiveOnly:false}) … (to ***)

2017-05-16T13:52:03.299-05:00 HTTP: #263: GET /…/_changes?feed=websocket (as ***)
2017-05-16T13:52:03.299-05:00 HTTP+: #263: --> 101 Upgraded to WebSocket protocol (0.0 ms)

2017-05-16T13:52:03.405-05:00 HTTP: #264: POST /…/changes?feed=longpoll&heartbeat=300000&style=all_docs&since=11041288%3A%3A11054109&filter=sync_gateway%2Fbychannel (as ***)
2017-05-16T13:52:03.406-05:00 Changes+: Int sequence multi changes feed…
2017-05-16T13:52:03.406-05:00 Changes: MultiChangesFeed({***
***170515, ******170516, ******170517, ******_vital}, {Since:11041288::11054109 Limit:0 Conflicts:true IncludeDocs:false Wait:true Continuous:false Terminator:0xc4258541e0 HeartbeatMs:300000 TimeoutMs:300000 ActiveOnly:false}) … (to ***)

time=2017-05-16T13:52:03.574-05:00 level=INFO msg=go-couchbase: call to Do(“sync:user:***") in github.com/couchbase/go-couchbase.(Bucket).casNext took 368.59968ms
2017-05-16T13:52:03.575-05:00 Changes+: MultiChangesFeed: channels expand to "
**
***:526” … (to ***)
2017-05-16T13:52:03.575-05:00 Cache: Initialized cache for channel "***
***" with options: &{ChannelCacheMinLength:50 ChannelCacheMaxLength:500 ChannelCacheAge:1m0s}
2017-05-16T13:52:03.575-05:00 Cache: getCachedChanges("***
***", 1055) --> 0 changes valid from #1069
2017-05-16T13:52:03.575-05:00 Cache: Querying ‘channels’ view for "***
***" (start=#1056, end=#1069, limit=0)

Hopefully this helps.

We also encountered the error again a few minutes ago. Again, I was able to see the "Output stream end encountered" error

Thanks


#9

Can you post the log (both client log and SGW log) of the recent incident that the issue just happened (missing doc)?


#10

I can post the logs of the most recent occurrence (it’ll take a moment to organize)

However, I believe there’s a little confusion. We aren’t seeing a missing doc,
All we’re seeing with these last 3 iPad issues is the CBLWebSocketChangeTracker[0x1781c4380 ...]: Connection error #1, retrying in 2.0 sec: PSWebSocketError[3, \"Output stream end encountered\"] error message

So, there’s no missing document, but we are seeing pull replication stop (but push replication continues normally)

Thanks


#11

Also could you please confirm what happened after retrying in 2.0 sec? If the replicator can recover, I’m not sure what the issue we are facing here.


#12

@pasin
I’ll look into that now. What message should I be looking for on the retry?

Thanks


#13

From the log below as an example, you can see Connection error #1, retrying in 2.0 sec message which indicates that there was a connection error, and the change tracker is going to retry in 2.0 second. The #1 means the number of times that the error has been occurred. After that you can see that, the change tracker can restart and work correctly.

2017-05-17 02:00:17 +0000Sync: CBLRestPuller[http://xxxx/aaaa/]: postProgressChanged (0/0, active=0 (batch=0, net=0), lastSeq=1010, online=1, error=(null))
: CBLWebSocketChangeTracker[0x1781c4380 auth]: Connection error #1, retrying in 2.0 sec: PSWebSocketError[3, \"Output stream end encountered\"]
2017-05-17 02:00:17 +0000: CBLWebSocketChangeTracker[0x1781c4380 auth]: Connection error #1, retrying in 2.0 sec: PSWebSocketError[3, \"Output stream end encountered\"]
: CBLWebSocketChangeTracker[0x1781c4fb0 aaaa]: Connection error #1, retrying in 2.0 sec: PSWebSocketError[3, \"Output stream end encountered\"]
2017-05-17 02:00:17 +0000: CBLWebSocketChangeTracker[0x1781c4fb0 aaaa]: Connection error #1, retrying in 2.0 sec: PSWebSocketError[3, \"Output stream end encountered\"]
ChangeTracker: CBLWebSocketChangeTracker[0x1741c5be0 bucket2]: WebSocket opened
2017-05-17 02:00:17 +0000ChangeTracker: CBLWebSocketChangeTracker[0x1741c5be0 bucket2]: WebSocket opened
Sync: CBLReplication[from http://xxxx/aaaa/]: idle, progress = 0 / 0, err: (null)
2017-05-17 02:00:17 +0000Sync: CBLReplication[from http://xxxx/aaaa/]: idle, progress = 0 / 0, err: (null)
ChangeTracker: CBLWebSocketChangeTracker[0x1781c4fb0 aaaa]: Starting...
2017-05-17 02:00:19 +0000ChangeTracker: CBLWebSocketChangeTracker[0x1781c4fb0 aaaa]: Starting...
Sync: CBLWebSocketChangeTracker[0x1781c4fb0 aaaa]: GET //xxxx/aaaa/_changes?feed=websocket
2017-05-17 02:00:19 +0000Sync: CBLWebSocketChangeTracker[0x1781c4fb0 aaaa]: GET //xxxx:4984/aaaa/_changes?feed=websocket
ChangeTracker: CBLWebSocketChangeTracker[0x1781c4fb0 aaaa]: GET //xxxx:4984/aaaa/_changes?feed=websocket {
    Authorization = \"Basic xxxxxxx\";
    Host = \"xxxx:4984\";
    \"User-Agent\" = \"CouchbaseLite/1.3 (iOS 1.4.0/05977be0b6d778c256f3ba45548b43f08da2ee75)\";
}
2017-05-17 02:00:19 +0000ChangeTracker: CBLWebSocketChangeTracker[0x1781c4fb0 aaaa]: GET //xxxx:4984/aaaa/_changes?feed=websocket {
    Authorization = \"Basic xxxxx\";
    Host = \"xxxx:4984\";
    \"User-Agent\" = \"CouchbaseLite/1.3 (iOS 1.4.0/05977be0b6d778c256f3ba45548b43f08da2ee75)\";
}
ChangeTracker: CBLWebSocketChangeTracker[0x1781c4fb0 aaaa]: Started... <http://xxxx:4984/aaaa/_changes?feed=websocket>
2017-05-17 02:00:19 +0000ChangeTracker: CBLWebSocketChangeTracker[0x1781c4fb0 aaaa]: Started... <http://xxxx:4984/aaaa/_changes?feed=websocket>
ChangeTracker: CBLWebSocketChangeTracker[0x1781c4fb0 aaaa]: WebSocket opened
2017-05-17 02:00:20 +0000ChangeTracker: CBLWebSocketChangeTracker[0x1781c4fb0 aaaa]: WebSocket opened
ChangeTracker: CBLWebSocketChangeTracker[0x1781c4380 auth]: WebSocket opened
2017-05-17 02:00:20 +0000ChangeTracker: CBLWebSocketChangeTracker[0x1781c4380 auth]: WebSocket opened
ChangeTracker: CBLWebSocketChangeTracker[0x1781c4fb0 aaaa]: Got a message: []
2017-05-17 02:00:20 +0000ChangeTracker: CBLWebSocketChangeTracker[0x1781c4fb0 aaaa]: Got a message: []
ChangeTracker: CBLWebSocketChangeTracker[0x1781c4fb0 aaaa]: read 2 bytes
2017-05-17 02:00:20 +0000ChangeTracker: CBLWebSocketChangeTracker[0x1781c4fb0 aaaa]: read 2 bytes
ChangeTracker: CBLWebSocketChangeTracker[0x1781c4fb0 aaaa]: caught up!
2017-05-17 02:00:20 +0000ChangeTracker: CBLWebSocketChangeTracker[0x1781c4fb0 aaaa]: caught up!

#14

Thanks @pasin

I see it trying to recover:

2017-05-17 16:43:19 +0000: ChangeTracker: CBLWebSocketChangeTracker[0x1703c3570 ***]: Starting...

2017-05-17 16:43:19 +0000: Sync: CBLWebSocketChangeTracker[0x1703c3570 ***]: GET //...:4984/***/_changes?feed=websocket\

    2017-05-17 16:43:19 +0000: ChangeTracker: CBLWebSocketChangeTracker[0x1703c3570 ***]: GET //...:4984/***/_changes?feed=websocket \{
        Authorization = "Basic xxx";
        Host = "...:4984";
        "User-Agent" = "CouchbaseLite/1.3 (iOS 1.4.0/05977be0b6d778c256f3ba45548b43f08da2ee75)";
    }

    2017-05-17 16:43:19 +0000: ChangeTracker: CBLWebSocketChangeTracker[0x1703c3570 ***]: Started... <http://...:4984/***/_changes?feed=websocket>


    2017-05-17 16:43:19 +0000: : CBLWebSocketChangeTracker[0x1703c3570 ***]: Connection error #2, retrying in 4.0 sec: NSURLError[-1004, "Connection refused", <http://...:4984/***/_changes?feed=websocket>]<--NSPOSIXError[61, "Connection refused"]

    2017-05-17 16:43:23 +0000: ChangeTracker: CBLWebSocketChangeTracker[0x1703c3570 ***]: Starting...

    2017-05-17 16:43:23 +0000: Sync: CBLWebSocketChangeTracker[0x1703c3570 ***]: GET //...:4984/***/_changes?feed=websocket
    
    2017-05-17 16:43:23 +0000: ChangeTracker: CBLWebSocketChangeTracker[0x1703c3570 ***]: GET //...:4984/***/_changes?feed=websocket \{
        Authorization = "Basic xxx";
        Host = "...:4984";
        "User-Agent" = "CouchbaseLite/1.3 (iOS 1.4.0/05977be0b6d778c256f3ba45548b43f08da2ee75)";\
    \}
    
    2017-05-17 16:43:23 +0000: ChangeTracker: CBLWebSocketChangeTracker[0x1703c3570 ***]: Started... <http://...:4984/***/_changes?feed=websocket>\
    
    2017-05-17 16:43:27 +0000: ChangeTracker: CBLWebSocketChangeTracker[0x1703c3570 ***]: WebSocket opened\

You’ll notice, it seems to succeed on the second attempt (after 4 sec), and it pulls documents after that. However, Later, as the program ran, we encountered the pull simply stopping, and these are the last logs we see from the puller:

2017-05-17 20:09:19 +0000: Sync: CBLRestPuller[http://...:4984/***] saved remote checkpoint '11258268::11274617' (_rev=0-159)\
\
2017-05-17 20:09:19 +0000: Sync: CBLRestPuller[http://...:4984/***] checkpointing sequence=11258268::11274707\
\
2017-05-17 20:09:19 +0000: Sync: <CBLRemoteSession: 0x17088efb0>: PUT /***/_local/xxxxxxxxxxx\
\
2017-05-17 20:09:19 +0000: Sync: CBLRestPuller[http://...:4984/***] saved remote checkpoint '11258268::11274707' (_rev=0-160)\
\
2017-05-17 20:09:19 +0000: Sync: CBLRestPuller[http://...:4984/***] Progress: set active = 0\
\
2017-05-17 20:09:19 +0000: Sync: CBLRestPuller[http://...:4984/***]: postProgressChanged (18/18, active=0 (batch=0, net=0), lastSeq=11258268::11274707, online=1, error=(null))\
\
2017-05-17 20:09:19 +0000: Sync: CBLRestPuller[http://...:4984/***] STOPPED\
\
2017-05-17 20:09:19 +0000: : Replication: CBLRestPuller[http://...:4984/***] took 231.540 sec; error=(null)\

Also, is it expected behavior for the CBLRestPuller to output “…took 231.540 sec; error=(null)” after the CBLRestPuller has already stopped?

Thanks


#15

Also something to mention,

The device (Device A) is currently in this state (push replication works correctly, but I don’t see any pulling logs)
For example I can create a document on Device A, and see it replicate to the other devices. However, if I create a document on Device B, Device A will never pull it. but every other device pulls that document correctly.

Thanks


#16

The “…took 231.540 sec” means that the replicator has been run for 231.540 seconds from start until stop.

Regarding the fact that the replicator is stopped, can you track from that point up in the log to see if you see any errors OR something like “STOPPING…” which means that the replicator was stopped from calling replicator.stop() method by the application code. From the log that I have, I have seen the following.

Sync: CBLRestPuller[http://xxxxx/bbbbbb2] STOPPING...
2017-05-17 02:00:39 +0000Sync: CBLRestPuller[http://xxxxx/bbbbbb2] STOPPING...
ChangeTracker: CBLWebSocketChangeTracker[0x1741c5be0 bbbbbb2]: stop
2017-05-17 02:00:39 +0000ChangeTracker: CBLWebSocketChangeTracker[0x1741c5be0 bbbbbb2]: stop
ChangeTracker: CBLWebSocketChangeTracker[0x1741c5be0 bbbbbb2]: stop
2017-05-17 02:00:39 +0000ChangeTracker: CBLWebSocketChangeTracker[0x1741c5be0 bbbbbb2]: stop
Sync: CBLRestPuller[http://xxxxx/bbbbbb2] Progress: set active = 0
2017-05-17 02:00:39 +0000Sync: CBLRestPuller[http://xxxxx/bbbbbb2] Progress: set active = 0
Sync: CBLRestPuller[http://xxxxx/bbbbbb2]: postProgressChanged (0/0, active=0 (batch=0, net=0), lastSeq=11116025::11136612, online=1, error=(null))
2017-05-17 02:00:39 +0000Sync: CBLRestPuller[http://xxxxx/bbbbbb2]: postProgressChanged (0/0, active=0 (batch=0, net=0), lastSeq=11116025::11136612, online=1, error=(null))
Sync: CBLRestPuller[http://xxxxx/bbbbbb2] STOPPED
2017-05-17 02:00:39 +0000Sync: CBLRestPuller[http://xxxxx/bbbbbb2] STOPPED
: Replication: CBLRestPuller[http://xxxxx/bbbbbb2] took 834.072 sec; error=(null)