How to address the issue caused by initial One-Shot Pull skiping Iteration

cheng.ustc · March 17, 2017, 5:09pm

Hi there,

I am having an issue caused by the one-shot pull skipping iteration, the belowing is the sync gateway log:

2017-03-15T13:57:06.235-06:00 HTTP: #062: POST /data/_changes?feed=normal&heartbeat=30000&style=all_docs&active_only=true&filter=sync_gateway%2Fbychannel (as userbPQNF4Ryy3_7Wbv13YXH7Agu)
2017-03-15T13:57:06.235-06:00 Changes+: Int sequence multi changes feed…
2017-03-15T13:57:06.235-06:00 Changes: MultiChangesFeed({rwhRT1xyBFrfbzmwol5jzuLbt-}, {Since:0 Limit:0 Conflicts:true IncludeDocs:false Wait:false Continuous:false Terminator:0xc4200cf380 HeartbeatMs:30000 TimeoutMs:300000 ActiveOnly:true}) … (to userbPQNF4Ryy3_7Wbv13YXH7Agu)
2017-03-15T13:57:06.235-06:00 Changes+: MultiChangesFeed: channels expand to “rwhRT1xyBFrfbzmwol5jzuLbt-:2471” … (to userbPQNF4Ryy3_7Wbv13YXH7Agu)
2017-03-15T13:57:06.235-06:00 Changes+: Grant for channel [rwhRT1xyBFrfbzmwol5jzuLbt-] is after the current sequence - skipped for this iteration. Grant:[2471] Current:[2470] (to userbPQNF4Ryy3_7Wbv13YXH7Agu)
2017-03-15T13:57:06.235-06:00 Changes+: Found sequence later than stable sequence: stable:[2470] entry:[2471] (_user/userbPQNF4Ryy3_7Wbv13YXH7Agu)
…
2017-03-15T13:57:06.756-06:00 HTTP: #064: POST /data/_changes?feed=normal&heartbeat=30000&style=all_docs&active_only=true&filter=sync_gateway%2Fbychannel (as userbPQNF4Ryy3_7Wbv13YXH7Agu)
2017-03-15T13:57:06.756-06:00 Changes+: Int sequence multi changes feed…

The skipped iteration will stop the replicator rather than error out. As a result, the client considers the initial replication is done even it isn’t yet.

This isn’t happening in SG 1.3.1 nor SG 1.4-98, but in SG 1.4-2.

Please help, thanks!

adamf · March 17, 2017, 10:32pm

The logging message mentioned - “Grant for channel … is after the current sequence” means that the document doing the channel grant hasn’t yet been buffered and added to the channel cache. The channel backfill associated with that new channel grant won’t be processed until that happens.

A one-shot replication is just a single iteration, so what this message means in this scenario is that you’ve issued the one-shot replication before Sync Gateway has received the document triggering the channel grant from Couchbase Server’s mutation feed. This is the expected behaviour - it’s required to ensure that the new channel backfill doesn’t skip the user’s last_seq value ahead and potentially miss data.

cheng.ustc · March 17, 2017, 10:49pm

Ok, if this is expected behavior, how do we code around it? The question comes to: how do we differentiate between an empty channel vs. unready channel?

Right now, in either case, the replication will have: changesCount = 0 and completedChangesCount = 0 and Trigger=STOP_IMMEDIATE, I have no way to say, it is caused by unready channel and I shall wait for certain amount of time and kick off the one-shot replicator again.

adamf · March 17, 2017, 11:18pm

How are you determining whether to kick off the one-shot replicator in general?

cheng.ustc · March 20, 2017, 3:58pm

I don’t think we do, we kick it off after setting up the replicators. Are you suggesting there is a way to check against the channel to see if it is ready for one-shot replication?

UPDATE: our use case is whenever user opens a document from the client, we need to pull down all the data of that document via the one-shot replicator. Once the initial sync is over, we switch the replicator to continuous.

adamf · March 20, 2017, 4:27pm

This is a case of a race between the channel grant (the update to the user doc), and the one-shot replication. If the one-shot replication happens before notification of channel grant reaches Sync Gateway, it’s expected that the one-shot replication won’t return any data. This would always be the case for one-shot replications - you’re just getting a point in time.

I can’t tell what’s triggering your channel grant in this case. It looks like an admin REST API-based grant, so I assume it’s not coming from a push replication on the client. I’m assuming the client knows something about this grant (which is why you’re expecting results) - this sounds like a matter of modifying the handling to account for potential latency in the grant being processed.

Using longpoll (continuous) replication would probably be the easiest way to handle this - Sync Gateway will go into wait mode, and return the results whenever the channel grant arrives.

justincheema · March 20, 2017, 4:27pm

Hi Adam,

Thanks for your help with this! The issue here is a duplicate of this so you can close out that issue and we’ll just all look at this one.

I’m going to get you the request/response for receiving the null value for the last sequence value, but in the meantime I wanted to keep the conversation going by responding here.

I’ll try and give you some background about how we kick off our one-shot replicator. We store all of the information our application requires in a data bucket in Couchbase. Before the user can look at the content they wish to view they have to be given access. You can see this in the AUTH log generated (in the other thread). Immediately after this finishes we perform a one-shot pull of the data. As noted our initial replicator gets shutdown because of the GRANT process. The issue is that we don’t know how to proceed at this point. We do not want to keep attempting one-shot pulls from the Sync Gateway hoping that the channel cache has been updated. What is a reliable way to perform the one-shot pull?

If the supported way is to use the last sequence number, then let’s track down why the null value is being passed back for it. As mentioned I’ll get you that request/response as soon as I can.

adamf · March 20, 2017, 4:33pm

@justincheema In your scenario, is there a reason you can’t do a continuous replication, instead of doing a one-shot replication? That’s the usual approach for this scenario (listening for changes on the Sync Gateway).

If it’s the case that the contents of the channel are static, so you really only want to do a one-shot replication once - a single longpoll replication would be ideal, but I don’t know if that’s supported by Couchbase Lite today. Your choices might be limited to retrying the one-shot replication, or starting a continuous replication, and stopping it once you’ve received the expected results.

cheng.ustc · March 20, 2017, 6:01pm

@adamf We have tried to use continuous replication only, but we still need to address the issue: how do we differentiate the unready channel vs. the empty channel, or in your word: what is the expected results.

In our use case, we need to let the user know at some point, the document is ready to go. But there is lack of ways for us to say, your document is ready even though it is empty. To help you understand, let me give you an example similar to our model:

Consider the parent document as a File and it corresponds to a read-only channel and a rw channel.
A File will have Pages (also documents) that will be pulled down by user whenever they open the File.

Here the problem, when user opens the File, he is facing two situations:

The File simply is empty - no Page at all, so user will create a blank page to start with
The File has Pages yet not pulled down

However, we are lack of ways to tell which situation is the case we are dealing with, so it ends up with either one of the two unexpected behavior:

User creates unnecessary blank pages even though the File isn’t empty, or
the pull replicator keeps running (displaying the loading screen) waiting for the empty File to pull down some documents which will never happen

justincheema · March 21, 2017, 4:05pm

Some additional information we have found. The Sync Gateway is not returning a null for the initial one-shot pull as you predicted. The log that was being generated was for the pull replicator. So now the current flow is this:

We authorize a user on a channel that will be used by our application
Start the one-shot pull replicator for the channel above
The backfill process correctly blocks until the cache is updated
The one-shot pull replicator returns and stops the replicator

Not sure where to go from here. We don’t want to re-start the replicator as we will never know if the channel is actually empty or not. i.e. How long do we keep re-starting the replicator until we give up and declare that the channel has no data?

Once the backfill process is unblocked, do we get notified somehow? If we can plug into that system then that would solve our issue.

justincheema · March 21, 2017, 5:03pm

One other thing I’d like to ask about is how the Sync Gateway caches replicator API calls. Here’s an snippet of a log from my Sync Gateway:

2017-03-21T10:13:47.496-06:00 Cache: Initializing changes cache with options {ChannelCacheOptions:{ChannelCacheMinLength:0 ChannelCacheMaxLength:0 ChannelCacheAge:0s} CachePendingSeqMaxWait:5s CachePendingSeqMaxNum:10000 CacheSkippedSeqMaxWait:1h0m0s}
2017-03-21T10:14:32.108-06:00 HTTP:  #001: POST /data/_changes?feed=longpoll&heartbeat=30000&style=all_docs&since=11017&filter=sync_gateway%2Fbychannel  (as firstfirstfirst.WN-Aqf7bXT8n2XYWM8QqPqpo)
2017-03-21T10:14:32.108-06:00 Changes+: Int sequence multi changes feed...
2017-03-21T10:14:32.108-06:00 Changes: MultiChangesFeed({rw7yvTxmlMlar9MjJ070HfNdgm}, {Since:11017 Limit:0 Conflicts:true IncludeDocs:false Wait:true Continuous:false Terminator:0xc42044e600 HeartbeatMs:30000 TimeoutMs:300000 ActiveOnly:false}) ...   (to firstfirstfirst.WN-Aqf7bXT8n2XYWM8QqPqpo)
2017-03-21T10:14:32.108-06:00 Changes+: MultiChangesFeed: channels expand to "rw7yvTxmlMlar9MjJ070HfNdgm:11016" ...   (to firstfirstfirst.WN-Aqf7bXT8n2XYWM8QqPqpo)
2017-03-21T10:14:32.108-06:00 Cache: Initialized cache for channel "rw7yvTxmlMlar9MjJ070HfNdgm" with options: &{ChannelCacheMinLength:50 ChannelCacheMaxLength:500 ChannelCacheAge:1m0s}
2017-03-21T10:14:32.108-06:00 Cache: getCachedChanges("rw7yvTxmlMlar9MjJ070HfNdgm", 11017) --> 0 changes valid from #11021
2017-03-21T10:14:32.108-06:00 Cache:   Querying 'channels' view for "rw7yvTxmlMlar9MjJ070HfNdgm" (start=#11018, end=#11021, limit=0)
2017-03-21T10:14:32.113-06:00 Cache:     Got no rows from view for "rw7yvTxmlMlar9MjJ070HfNdgm"
2017-03-21T10:14:32.113-06:00 Cache: GetChangesInChannel("rw7yvTxmlMlar9MjJ070HfNdgm") --> 0 rows
2017-03-21T10:14:32.113-06:00 Changes+: [changesFeed] Found 0 changes for channel rw7yvTxmlMlar9MjJ070HfNdgm
2017-03-21T10:14:32.113-06:00 Changes+: MultiChangesFeed waiting...   (to firstfirstfirst.WN-Aqf7bXT8n2XYWM8QqPqpo)
2017-03-21T10:14:32.113-06:00 Changes+: Waiting for "Data"'s count to pass 0

This call wasn’t made from our application at this time. It was made about an hour before seeing this log. That tells me that the call is cached somehow and is trying to be processed now. What I’m wondering is where this cache lives. I restarted all parts of the system involved here and I still see this in the logs. Does this mean that the request is cached in Couchbase Server?

adamf · March 21, 2017, 5:15pm

@justincheema There’s no caching of API calls - they are processed immediately. Possibly there’s a clock difference between the client and SG that accounts for the difference you’re seeing?

adamf · March 21, 2017, 5:35pm

On the channel replication question - there’s no concept of notification that a channel is “ready”. A one-shot replication is always just going to reflect the stable set of changes that are ready for replication, which will have some latency from the time those changes are written. In a multi-author environment, there’s always the possibility of concurrent updates, and typically applications need to be written to account for that.

In the scenarios described here, though, the core issue is related to channel grants, and backfill of documents already present in the channel once the user has been granted access. To differentiate between the ‘channel doesn’t exist’ and the ‘channel exists but hasn’t been replicated yet’ scenarios, the user being issued the channel grant could make an update to a document that resides in the channel (potentially some sort of subscription document). Once they receive that document, the user will know they are up to date in the channel to that point, and proceed accordingly.

justincheema · March 21, 2017, 6:57pm

Hi Adam,

Thank you so much for your continued support on this issue.

When starting the Sync Gateway, I see the following log"

2017-03-21T11:20:55.007-06:00 Cache: Initializing changes cache with options {ChannelCacheOptions:{ChannelCacheMinLength:0 ChannelCacheMaxLength:0 ChannelCacheAge:0s} CachePendingSeqMaxWait:5s CachePendingSeqMaxNum:10000 CacheSkippedSeqMaxWait:1h0m0s}

Is this the cache that the backfill pulls from? Is there a way to configure it so that the AUTH action we perform gets updated in the cache more quickly?

cheng.ustc · March 21, 2017, 7:05pm

@adamf, yes, you are absolutely right, a one-shot replication should reflect the stable set of changes that are ready for replication, but in fact SG is simply lying to the one-shot replication (when it skips the iteration) by saying changesCount = 0 rather than returning the stable set of changes, isn’t it?

If you look at #062 and #064 in SG log, SG internally retried the long pull once the cached sequence is up-to-date, however the event of the retry will not notify the one-shot replicator because the replicator is stopped. I do believe it is a bug in the CBL or SG for handling this edge case for one-shot replication. Can SG do one of the following:

for one-shot replication only, don’t skip iteration, just return the stable set of changes
for skipping iteration, rather than returning changesCount = 0, returning changesCount = -1, 0 is a legit number that will confuse the two cases: unready channel vs. empty channel
rather than having the trigger STOP_IMMEDIATE to stop the one-shot replicator once the skipping iteration happens, use IDLE. So consequent retry of the long pull will notify the one-shot replicator.

The solution you suggested will work, but it is more a hack on the application layer to maintain the state of the channel which should be the responsibility of the SG itself.

adamf · March 21, 2017, 7:37pm

In this scenario, the channel grant is not yet included in the stable set of changes. Sync Gateway hasn’t received and buffered the sequence that triggers the channel grant, so it’s correct that Sync Gateway shouldn’t return any changes to the one-shot replication request. Sync Gateway is skipping the channel in this changes iteration because it’s not yet valid to send any changes for that channel (because the channel grant occurs later than the set of sequences Sync Gateway has buffered).

In no sense is Sync Gateway reporting something other than the stable set of changes known to Sync Gateway at the point in time the request is made.

cheng.ustc · March 22, 2017, 12:14am

What changed in 1.4 compared to 1.3 or 1.4.dp caused the grant happens after the SG buffered sequences, prior to 1.4, it has been working for us.

Even if it is th case SG isn’t valid to return any result, changesCount=0 as possibly a valid result shouldn’t be returned neither and is apprently confusing the one-shot replicator.

How about my suggested approach 2 and 3, wouldn’t that be better than simply returning changesCount=0?

adamf · March 22, 2017, 12:38am

I agree that prior to 1.4 the race issue was allowing you to retrieve documents from the channel immediately, even if Sync Gateway hadn’t actually buffered the channel grant. Although that made things work for you in this scenario, it had the more serious implication that clients could permanently miss data (in particular, non-processed changes with sequences earlier than the channel grant).

changesCount isn’t something returned by Sync Gateway - I expect it’s just Couchbase Lite returning a count of the number of results in the changes response.

Currently there’s no mechanism in the Sync Gateway _changes response to indicate that there are changes that aren’t being sent because they are later than the stable sequence. I see the value in this particular scenario, but I’m not convinced it’s a widely applicable scenario. In general client applications shouldn’t be making any assumptions about whether or not they will see a given (recent) change in a one-shot replication. In this case, prior to 1.4, the application was taking advantage of a zero latency bug (recent updates to the user doc were being applied immediately, instead of in the correct sequence order).

cheng.ustc · March 22, 2017, 2:03am

@adamf, first of all, thanks for your valuable input, it really helps us understanding the issue. I think you are right in the sense that the application was relying on the existing bug (zero latency bug) to work properly and I think this fix in 1.4 solved a critical issue.

One the other hand, I am not sold on the fact that the client shouldn’t make any assumption about the one-shot replication, because according to the documentation of one-shot replication, “by default a replication runs long enough to transfer all the changes from the source to the target database, then quits.” Now, if we can’t make any assumption based of the quitting of one-shot replication, how can we use one-shot replication reliably? Does that mean we need to start the one-shot replication over and over again expecting all the documents to get pulled? If so, how is that different from continuous replication?

Can you confirm if couchbase (likely in 2.0) is heading towards the direction of getting rid of one-shot replication? If so, we will ensure ourself onboard early to use only continuous replication. If not, do you mind providing some suggestions on the reliable way of using one-shot replication down the road?

Again, really appreciated your help with the issue we encountered.

adamf · March 22, 2017, 5:36am

One point that may be worth noting is that the change in 1.4 was only to fix the handling a user channel grant. It’s always been the case that a one-shot replication issued immediately after a document write wasn’t guaranteed to see that write. e.g. if user A wrote document foo to a channel, and then user B immediately issued a one-shot replication against channel A, they wouldn’t be guaranteed to see that change (going back to 1.0). The change in 1.4 was just about preventing a potential data loss scenario, and not making a change in philosophy about one-shot vs. continuous.

I think there are a lot of valid use cases for one-shot replication - basically anywhere where immediate read-your-own-write type responsiveness isn’t required, such as users coming online and pushing/pulling any changes made while they were offline.

In your type of application design, where (if I’m understanding it correctly), users request access to a channel and then immediately want to replicate the contents of that channel - I think you’re stuck with some extra work to ensure you’re waiting long enough for the access grant to be buffered by Sync Gateway. The suggestion I provided earlier would be one approach to handle this without using continuous replication - using some sort of marker document in this case to notify the client that the channel grant has been processed, regardless of other channel contents.

I think it’s a valid request that Sync Gateway and Couchbase Lite try to find additional ways to communicate information about replication status. It’s difficult for Sync Gateway to notify a client that it’s ‘caught up’, though - in a system under load, Sync Gateway is always going to be in ‘catching up’ mode, and a single Sync Gateway node doesn’t have any visibility into the channel assignment of the documents it hasn’t seen yet.