In rare cases documents with recycled ids are not pushed up

benjamin_glatzeder · June 12, 2020, 10:56am

CBL 2.7.1 on Android
SG 2.7.3

This forum post from 3 years ago is similar to my situation. In my case the document ID is derived by the user entered content. This dramatically speeds up getting a document by its ID - if the contents are known - compared to running a N1QL query.

For a long time now I have been receiving user reports that recycled document IDs stop syncing. This seems to happen randomly. I was never able to face the same issue. Either by doing it manually over and over or by running automated ui tests. I was also never able to catch an exception in the Android app reported by a user or any at all with this specific issue:

for (ReplicatedDocument document : replication.getDocuments()) {
                    CouchbaseLiteException err = document.getError();
                    if (err != null) {
                    // log exception
                    }

And I could never find any log about a document which was rejected by SG fn even though I knew what document didn’t sync since I received user reports. I ran the grep command over SG logs similar to this:

grep -r "<user entered content>" *
grep -r "<document ID>" *

Recently I changed the architecture and document IDs are now created randomly. Before I go live with the update I’m happy to try another approach to somehow log the issue and to find the root cause.

borrrden · June 12, 2020, 11:29am

Looking at the logs would be more interesting in that case, both on the Lite side and the Sync Gateway side. You might think they are not getting pushed up but there are other factors involved too (i.e. is a delete winning a conflict for some reason, is your sync function not granting access in some cases, etc). But the issue as you describe it now is a needle in a haystack situation. There are too many moving parts to try to form a sane theory.

benjamin_glatzeder · June 24, 2020, 11:50am

I started to log these exceptions in an analytics service and found the following:

ReplicatedDocument{PcF0tJwfNIU3g4F6ye5gTbGatOw1::shoppingList::0tvgi2f9w31w,
err=CouchbaseLiteException{CouchbaseLite,10409,'Document revision conflict'}} -- 
Document:{PcF0tJwfNIU3g4F6ye5gTbGatOw1::shoppingList::0tvgi2f9w31w@1461-433428666fb5d35d767f0889d192bd8e37b9b1ce.channels=>PcF0tJwfNIU3g4F6ye5gTbGatOw1,name=> (... document contents  ...)}

This document is found in the Couchbase Server console. This type of document always uses a random ID when created. Only if it was deleted and immediately undeleted by the user the same ID is used. Any aforementioned architecture change which I might release won’t solve this issue. The default CBL conflict resolver is used.

Since the exception in the Android app shows the document ID I grepped the sg_warn and sg_info logs of that day.

grep -r  PcF0tJwfNIU3g4F6ye5gTbGatOw1::shoppingList::0tvgi2f9w31w
-> no logs
grep -r  PcF0tJwfNIU3g4F6ye5gTbGatOw1
-> many info logs. No issue found

It looks like I have only the second half of that day stored in the logs. So I might have missed it. Soon I’ll release this update to a wider user base.

What would be a good way to diagnose the issue? Is grep -r <doc-ID> a good start?

hyling · June 25, 2020, 11:37am

I think we’re running across the same issue : CBL 2.x auto conflict resolution results in unresolved conflicts

benjamin_glatzeder · June 25, 2020, 2:47pm

True, it’s a 10409 error on the client. I’m currently rolling out an app update which logs the errors. Later I’ll release an update which uses random IDs for a specific document type. Hopefully this will bring the sync error rate down by a lot. As @borrrden mentioned there most likely won’t be a bug fix unless we, the developer community, create a reproducable sync bug.
I only code for Android. Is your stack iOS-only, @hyling? In your linked thread I came across a comment from snej. There he describes rapidly updating a document on both sides and running into issues. I think with such a program one would run into many 10409 errors, too. But proving that it stops updating and not solving the conflict even if the server has a new document revision won’t be easy. Will it be easier to do so manually? I see that you were able to reproduce it but only every 200th time. Since it is likely a threading issue is it somehow possible to store all the lower level logs until one runs into the issue? To be able to reproduce it everytime sounds unlikely.

hyling · June 25, 2020, 9:39pm

Yes, my stack is iOS only.

I’m not sure what you meant by “manually”. If you meant force an revision conflict with the client and server, I’ve tried that with one shot replicators and those do resolve the conflict correctly. From my experience it only occurs with the continuous replicators which makes it hard to manually reproduce a conflict.

I think you’re right it’s probably a threading issue. I agree coming up with a test case that reproducible this every time is unlikely. What lower level logs are you referring to?

benjamin_glatzeder · June 26, 2020, 7:16am

About lower level logs: I have this custom logging class in Android for CBL:

    // ...
    Database.log.setCustom(new LogTestLogger(LogLevel.VERBOSE));
    // ...
    private static class LogTestLogger implements Logger {

        @NonNull
        private final LogLevel level;

        LogTestLogger(@NonNull LogLevel level) {
            this.level = level;
        }

        @NonNull
        @Override
        public LogLevel getLevel() {
            return level;
        }

        @Override
        public void log(@NonNull LogLevel level, @NonNull LogDomain domain, @NonNull String message) {
            switch (level.ordinal()) {
                case 0:
                    Log.d(domain.name(), message);
                    break;
                case 1:
                    Log.v(domain.name(), message);
                    break;
                case 2:
                    Log.i(domain.name(), message);
                    break;
                case 3:
                    Log.w(domain.name(), message);
                    break;
                case 4:
                    Log.e(domain.name(), message);
                    break;
                case 5:
                    Log.wtf(domain.name(), message);
                    break;
            }
            // this method will never be called if param level < this.level
            // handle the message, for example piping it to a third party framework
        }
    }

But I don’t know if all logs appear which are necessary to find the root cause. By all logs I mean the logs from the core library.

Would you say that the following scenario will cause conflicts and eventually run into a 10409 error:

there is a document with a single field
two clients update this document over and over using continuous replication
1 device loses internet connection every so often for an unspecified duration

If you think yes I’m happy to build this app and share my findings!

borrrden · June 26, 2020, 8:09am

Custom logging implementations will receive logs from core as well. But what you’ve done is the same thing that the console logger does (log to the Android console). You could just change the logging level on the console logger to get the same result.

hyling · June 26, 2020, 12:45pm

@benjamin_glatzeder
Thanks for the log info, I’ll look adding those to my app as well.

Yes, I believe those conditions should replicate this problem. I was going to use the Network Link Conditioner for iOS and macOS to reproduce the lossy network connection Network Link Conditioner - NSHipster. Hopefully there’s something similar on Android.

I’ll create a test app for iOS and share my findings as well.

hyling · June 26, 2020, 12:56pm

@rob-keepsafe
This is how we plan on reproducing the 10409 conflicts with newer server revision issue. Have you had any success reproducing it in your setup?

rob-keepsafe · June 26, 2020, 4:51pm

Our issue happens with a single client only (using continuous replication) which points more towards a threading issue and divergent branches of the same document’s revisions tree.

MutableDocument is not thread safe so you have to be sure that all changes to all documents are serialized on the same queue. We do a lot of concurrent ops that all get serialized into the database so I wouldn’t doubt there are some issues here.

I would also advise you use an id prefix for different document types; it’ll help with conflicts and also help for deletes so you at least know what kind of document it was once all metadata is purged.

benjamin_glatzeder · July 6, 2020, 8:56am

Here’s an update:

I didn’t get to create a test app yet and might not for awhile. I plan to put a physical device in and out of a turned-off microwave when testing. That should make sure that the device loses connection and that documents will be updated while offline. The other device will keep its connection and will update the same document and thus there should be conflicts.

I believe the hard part will be to run into a situation when it just stops to sync. So that the client never pulls the revisions from SG and solves the conflicts locally. Any new document update would need to run into the 10409 error as some of our users do with some documents.

I also found some new error messages:

CouchbaseLite,10409,‘conflicts with server document’
CouchbaseLite,10409,‘Document revision conflict’
CouchbaseLite,10404,‘Document revision is not accessible’
CouchbaseLite,10403,’(unknown HTTP status)’

Edit: The errors appear for all types of documents in my app. It’s very likely that most conflicts are solved when the next pull takes place - and in rare cases never. Some types appear more often but that is probably because those are updated more often.

hyling · July 10, 2020, 7:33pm

Ah thanks for the clarification and the tips! Yes, I’ve been bitten by the delete case a couple of times and had to start using id prefix as well. I believe most of my changes happen on just one background thread but I’ll look into that.

hyling · July 10, 2020, 7:34pm

Thanks for the update. I had to get an app release out and I’m just now getting to the test app. I’ll update when I have the results.