Errors in SG 1.5.1 with CB 5.0

jamescooke · January 15, 2018, 4:09pm

We are running SG 1.5.1 (Couchbase Sync Gateway/1.5.1(4;cb9522c)) on top of CB 5.0, community editions. We’ve recently upgraded from SG 1.4 and CB 4 - we upgraded SGs first, then the CB cluster. We’re having a problem where our Sync Gateways are locking up and the the database access is timing out. We can “unlock” the problem by restarting the SG.

After a restart of the SGs, this is the way that the error seems to unroll:

First, we get lots of JSON parse errors as users get their feeds:

2018-01-15T14:33:03.536Z HTTP:  #928: POST /[database]/_revs_diff  (as user.2)
2018-01-15T14:33:03.597Z WARNING: Unexpected error parsing body of rev "19-63fb3798d240d46358c2842b41030532" -- db.(*document).getNonWinningRevisionBody() at document.go:309
2018-01-15T14:33:03.603Z WARNING: Unexpected error parsing body of rev "4-13c5a67bf539c4f1b5d36b40cfdd308c" -- db.(*document).getNonWinningRevisionBody() at document.go:309
2018-01-15T14:33:03.610Z WARNING: Unexpected error parsing body of rev "36-2fc399f9b4091cbe8afb61b3cbff8bae" -- db.(*document).getNonWinningRevisionBody() at document.go:309

Then we’re getting SGs locking up. I’m not sure if it’s a fluke but this has happened twice now when the number of changes feeds has reached 100… this appears in the logs:

2018-01-15T14:54:23.696Z WARNING: RetryLoop for Get acc.settings.117.jQbFKJNOjnsIsS19Ok-7CA|snag_codes giving up after 11 attempts -- base.RetryLoop() at util.go:298
2018-01-15T14:54:23.757Z WARNING: RetryLoop for Get acc.users.117|2208 giving up after 11 attempts -- base.RetryLoop() at util.go:298
2018-01-15T14:54:23.757Z WARNING: Changes feed: error getting doc sync data "acc.users.117|2208": operation has timed out -- db.(*Database).addDocToChangeEntry() at changes.go:106
2018-01-15T14:54:23.929Z WARNING: RetryLoop for Get _sync:user:user.2553 giving up after 11 attempts -- base.RetryLoop() at util.go:298
2018-01-15T14:54:23.929Z HTTP auth failed for username="user.2553"
2018-01-15T14:54:23.929Z HTTP:  #3317: PUT /[database]/_local/b993d11acdec2288a2024897aac6a684ed3af316
2018-01-15T14:54:23.929Z HTTP: #3317:     --> 401 Invalid login  (32619.6 ms)
...
2018-01-15T14:58:11.791Z HTTP:  #3496: POST /[database]/_bulk_docs  (as user.2554)
2018-01-15T14:58:12.046Z WARNING: RetryLoop for Get acc.settings.117.jQbFKJNOjnsIsS19Ok-7CA|snag_codes giving up after 11 attempts -- base.RetryLoop() at util.go:298
2018-01-15T14:58:15.259Z WARNING: RetryLoop for Get acc.settings.117.jQbFKJNOjnsIsS19Ok-7CA|snag_codes giving up after 11 attempts -- base.RetryLoop() at util.go:298
2018-01-15T14:58:16.455Z WARNING: RetryLoop for Get _sync:seq giving up after 11 attempts -- base.RetryLoop() at util.go:298
2018-01-15T14:58:16.460Z HTTP:  #3497: GET /_expvar  (ADMIN)
2018-01-15T14:58:16.460Z HTTP: Recording snapshot of current debug variables.
2018-01-15T14:58:16.488Z HTTP:  #3498: GET /[database]/  (ADMIN)
2018-01-15T14:58:18.652Z WARNING: RetryLoop for Incr with key: _sync:seq giving up after 11 attempts -- base.RetryLoop() at util.go:298
2018-01-15T14:58:18.652Z WARNING: Error from Incr in sequence allocator (1) - attempt (1/3): operation has timed out -- db.(*sequenceAllocator).incrWithRetry() at sequence_allocator.go:101
2018-01-15T14:58:18.662Z WARNING: Too many unsuccessful Incr attempts in sequence allocator - giving up (1): operation has timed out -- db.(*sequenceAllocator).incrWithRetry() at sequence_allocator.go:107
2018-01-15T14:58:18.662Z WARNING: Error from Incr in _reserveSequences(1): Unable to increment sequence: operation has timed out -- db.(*sequenceAllocator)._reserveSequences() at sequence_allocator.go:68
2018-01-15T14:58:19.112Z HTTP:  #3499: POST /[database]/_bulk_docs  (as user.2554)
2018-01-15T14:58:21.550Z WARNING: RetryLoop for Get _sync:seq giving up after 11 attempts -- base.RetryLoop() at util.go:298

After that the stats monitoring calls that we make to our database URL and _expvars endpoint via stats start to timeout and take about 30s. Our service goes down and we need to restart the SG to bring it back.

Earlier today we were able to wait around 40 minutes and the Sync Gateways came back online, however our clients are up and using the service now so we can’t wait that long at the moment.

This is our config:

{
  "interface": "localhost:4984",
  "adminInterface": "[IP address]:4985",
  "CORS": {
    "Origin":["https://[url]","https://[url]"],
    "Headers":["Authorization","Content-Type"]
  },
  "maxFileDescriptors": 250000,
  "databases": {
    "[database]": {
          "server": "http://cb01:8091",
          "bucket": "[bucket]",
          "username": "[user]",
          "password": "",
          "unsupported": {
            "user_views": {"enabled":true},
            "enable_extended_attributes": {"enabled":false}
          }       
     }
  },
  "logging": {
    "default": {
         "logFilePath": "/var/log/sync_gateway/sync_gateway_access.log",
         "logKeys": ["HTTP"],
         "logLevel": "debug",
         "rotation": {
           "maxsize": 1,
           "maxage": 30,
           "maxbackups": 7,
           "localtime": true
         }
    }
  }
}

We realised when this error started happening that we’d been running the SGs with enable_extended_attributes.enabled: true and have reverted that to false - could that have crated this issue?

Has anyone got any suggestions / fixes please? Is this related to xattrs? Any advice would be great.

priya.rajagopal · January 15, 2018, 7:15pm

The enable_extended_attributes is a pre-GA release flag that’s not supported in the GA release. This has been replaced with the enable_shared_bucket_access flag.

Yes- setting this to true will result in sync metadata to be stored in XAttrs. If it is set to false, then the metadata would be embedded in the document within _sync attribute .
Judging by the errors, it appears that there is an issue accessing the document sync metadata.

construct-pm · January 15, 2018, 9:46pm

Hi Priya,

I am working with James on this project and have some additional information and would like some clarification regarding your reply.

Some extra background.

We have never used a pre-GA release.I found the enable_extended_attributes setting on the forum a while back and we set it to false while we were running against a 4.1.1 server. This worked fine but I’ve since realised enable_shared_bucket_access is defaulted to false anyway.

We were told back in Oct by a Couchbase architect that the cluster would automatically upgrade when the SG detected it was at v5 and xttrs compatible. After we upgrade the cluster I set enable_extended_attributes to true thinking the data would have moved forward. We have never used the enable_shared_bucket_access or the continuous setting in SG config.

Is enable_extended_attributes acknowledged by the GA release? Or does the enable_shared_bucket_access default of false take preference.

We have rolled the SG back to 1.4.1 against the v5 server as we didn’t have the confidence to go froward and we aren’t seeing this retry loop issue so there is potentially something else here.

It would help if you could clarify what the default read and write timer for the cluster are set to. 1.5.1 we had to move it up to get handle some documents 1.4.1 we have never set any value and the default seems OK.

Thanks in advance.

priya.rajagopal · January 15, 2018, 10:24pm

I checked with the engineering team and it appears that there is enough going on here that we should track this via Github ticket.

Can you please file a ticket and include all the relevant details from both of your posts

As for this question

This attribute should be ignored in the GA release. So in theory, there should have never been XAttrs enabled.

construct-pm · January 16, 2018, 1:24pm

Hi Priya,

We will compile a gitHub issue today. In the meantime could you confirm the defaults for server read and write timer sync gateway settings. We’ve not set these in 1.4.1 but had to set them to 60 to enable this transaction to work from our server.

The same transaction does not work from the client (or rather the changes feed directly after causes the same retry loop. Potentially this can be mitigated with raising the changes feed timeout. Do you know that default for this as well?

Thanks in advance.

Mike

jamescooke · January 16, 2018, 1:52pm

Hi Priya, thanks for the replies.

Mike - thanks for following up.

I’ve created a github issue with more logs and info: https://github.com/couchbase/sync_gateway/issues/3212

priya.rajagopal · January 16, 2018, 6:35pm

James - thanks for the GitHub ticket. We will take a look at it

I think the default is left as 0 which I think is interpreted as no timeout. @traun ?

Here you go :

github.com

couchbase/sync_gateway/blob/ec5a5e4be0f9d8393272237c7119161335c83dc5/rest/changes_api.go#L39


      
          
          // Minimum value of _changes?heartbeat property
          const kMinHeartbeatMS = 25 * 1000
          
          // Default value of _changes?heartbeat property
          const kDefaultHeartbeatMS = 0
          
          // Default value of _changes?timeout property
          const kDefaultTimeoutMS = 5 * 60 * 1000
          
          // Maximum value of _changes?timeout property
          const kMaxTimeoutMS = 15 * 60 * 1000
          
          func (h *handler) handleRevsDiff() error {
          	var input map[string][]string
          	err := h.readJSONInto(&input)
          	if err != nil {
          		return err
          	}
          
          	h.response.Write([]byte("{"))