Changes feed failing on SG 1.5.1 with "Unrecoverable error attempting to update xattr for key"


#1

Hi there,

Yesterday I spotted that a large number of updates coming into SG from mobile devices are failing on the bulk docs update from Couchbase lite. However, my colleagues have noticed over the last few weeks that actions that sync data from SG or read views from SG have been taking longer and longer.

The errors that are happening look like this:

2018-10-03T13:54:35.566Z CRUD+: Invoking sync on doc "files.shapes.966|97b59d2d9155be283f91040e36746a51aa2675576d664bd50031ec9df93048a6" rev 2-c8fc7b55e8d0319aa4bff2c9acf406ee
2018-10-03T13:54:35.567Z CRUD+: No old revision "files.shapes.966|97b59d2d9155be283f91040e36746a51aa2675576d664bd50031ec9df93048a6" / "1-25f1e1dab482852272b683ac7fc25058"
2018-10-03T13:54:35.567Z CRUD+: Saving doc (seq: #18585714, id: files.shapes.966|97b59d2d9155be283f91040e36746a51aa2675576d664bd50031ec9df93048a6 rev: 2-c8fc7b55e8d0319aa4bff2c9acf406ee)
2018-10-03T13:54:35.568Z WARNING: Unrecoverable error attempting to update xattr for key:files.shapes.966|97b59d2d9155be283f91040e36746a51aa2675576d664bd50031ec9df93048a6 cas:0 deleteBody:false error:key not found -- base.CouchbaseBucketGoCB.UpdateXattr.func1() at bucket_gocb.go:1029
2018-10-03T13:54:35.568Z CRUD+: Did not update document "files.shapes.966|97b59d2d9155be283f91040e36746a51aa2675576d664bd50031ec9df93048a6" w/ xattr: key not found
2018-10-03T13:54:35.569Z CRUD+: Released unused sequence #18585714
2018-10-03T13:54:35.569Z 	BulkDocs: Doc "files.shapes.966|97b59d2d9155be283f91040e36746a51aa2675576d664bd50031ec9df93048a6" --> 404 missing (key not found)

We’re running “Couchbase Sync Gateway/1.5.1(4;cb9522c)” on top of Couchbase “couchbase-server-community/now 5.0.1-5003-1”. This set up has been stable for us since January 2018. The only recent issue was that we had some power outage on one of our three CB nodes, but we were able to bring that back and everything seemed fine afterwards.

Our SG settings look like:

"databases": {
  "construct": {
    "server": "http://*******:8091",
    "bucket": "construct",
    "username": "******",
    "password": "******",
    "enable_shared_bucket_access": true,
    "unsupported": {
      "user_views": {"enabled":true}
    }
  }
},

What made me spot this was that I have a graph that tracks the sequence number that’s in use - that’s increasing by around 2M per day as there are 10s of thousands of these updates coming in per hour and failing. This has been happening since around the time that we had the power outage - the sequence number graph for the last 6 months looks like a hockey stick, with the start of the curve around the time of the power outage.

When looking at the nature of the updates, around 96% of them are on documents that SG is not able to retrieve via the admin port - so it looks like they don’t exist at all as far as ADMIN is concerned. Regarding the remaining 4%, they are in all sorts of states, there are failed revs that are:

  • are older than the current known to SG
  • are equal to the current in SG - but have a different rev hash, so have failed.
  • are younger than the current known to SG.

The nearest thing that I can find while searching is this closed error: https://github.com/couchbase/sync_gateway/issues/3307 - I’m wondering if it could be related?

Any help you can give / suggest would be great - we’ve been pulling our hair out on this one.

Cheers,

James


#2

This just came up in a developer meeting, and that 3307 issue you mentioned definitely looks like it could be the cause.

You could verify this is the case to see if this is happening for docs that are being pushed to sync gateway as tombstone (deleted) docs before Sync Gateway has ever seen them.

Is it possible to setup a parallel environment to see if the issue is fixed for you in SG 2.1?


#3

Yes, I’ve been able to reproduce the error on our development environment. In aeroplane mode I create a new form and save it (I guess this creates rev 1-x) then delete the form (I guess this creates the rev 2-x instance with no channels). When coming back online and triggering sync, this generates the exact same error as before:

2018-10-10T16:26:05.032Z WARNING: Unrecoverable error attempting to update xattr for key:forms.data.20.pxKtzJ61zeHrMq5S1d1P2|e33b99eb02a7f56e6af1f1349cd05e6119c0852545e2315c24d302c4270ec77c cas:0 deleteBody:false
 error:key not found -- base.CouchbaseBucketGoCB.UpdateXattr.func1() at bucket_gocb.go:1029

I need to discuss with the team, but I think we might be able to try out SG 2.1 in our dev environment. Do you have a recommended upgrade process? Maybe that’s documented somewhere? Is it OK for this situation to go from SG 1.5.1 to SG 2.1 directly or are there any versions / migrations we need to walk?


#4

FYI I’ve put in place a dirty work around which appears to have unblocked the devices from syncing their tombstones.

First I built a list of all unique doc IDs for the failures in the logs (this is a terrible emergency grep - use at your own risk):

sudo grep -E "BulkDocs.* 404 missing" /var/log/sync_gateway/sync_gateway_access* | grep -Eo '".*"' | grep -Eo '[^"]*' | sort -u

Then for each document in the list, created a new doc via the admin port with content {"to_tombstone": true}. Finally, deleted that new document via the admin port so that the tombstone is left in place on the SG.

This has already has positive results as I can see that docs are coming in from the devices that were failing before, instead they are able to create a second tombstone:

2018-10-11T15:44:17.073Z CRUD+: Invoking sync on doc "forms.data.6|159c861a327b8e1eb5cb4b0bc3af1234afcf8052f83b6b5b86a75ea161801f8b" rev 2-3897b3f0532b80616621888b48b8a725
2018-10-11T15:44:17.073Z CRUD+: No old revision "forms.data.6|159c861a327b8e1eb5cb4b0bc3af1234afcf8052f83b6b5b86a75ea161801f8b" / "1-2b6ceedbe14775cad471c66d266cdd23"
2018-10-11T15:44:17.073Z CRUD+: updateDoc("forms.data.6|159c861a327b8e1eb5cb4b0bc3af1234afcf8052f83b6b5b86a75ea161801f8b"): Rev "2-3897b3f0532b80616621888b48b8a725" leaves "2-a3875f219406583e19a63fb30d2dfc3c" still current
2018-10-11T15:44:17.073Z CRUD+: Saving doc (seq: #29739227, id: forms.data.6|159c861a327b8e1eb5cb4b0bc3af1234afcf8052f83b6b5b86a75ea161801f8b rev: 2-a3875f219406583e19a63fb30d2dfc3c)
2018-10-11T15:44:17.074Z CRUD: Stored doc "forms.data.6|159c861a327b8e1eb5cb4b0bc3af1234afcf8052f83b6b5b86a75ea161801f8b" / "2-3897b3f0532b80616621888b48b8a725"

It looks like this will be able to reduce the traffic load so that our views are quicker and SG performance rises back to where it used to be. Also - we’ve ruled out the power outage on one of our CB nodes as a contributing factor.