Sync Gateway fails to connect after partial cluster upgrade


#1

Hello,

We’re testing an upgrade from a Couchbase Server 2.5.1 Enterprise cluster with Sync Gateway 1.0.3 to Couchbase Server 4.0.0 Community with Sync Gateway 1.2 Community. It’s important that our application remains online during the upgrade, so we are attempting a rolling upgrade.

I removed one node from the cluster, upgraded it to 4.0.0, and then re-added the ugraded node. Immediately after rebalancing, Sync Gateway 1.0.3 (configured to connect directly to the 2.5.1 node) starts showing the following errors:

17:36:43.526750 WARNING: Couldn't interpret error type *errors.errorString, value Unable to complete action after 4 attemps -- base.ErrorAsHTTPStatus() at error.go:63 17:36:43.526804 HTTP: #10487: --> 500 Internal error: Unable to complete action afte r 4 attemps (38.8 ms)

Restarting Sync Gateway doesn’t seem to help; the following errors are logged and it stops:
17:37:57.426363 ==== Couchbase Sync Gateway/1.0.3(81;fa9a6e7) ==== 17:37:57.426378 Configured Go to use all 8 CPUs; setenv GOMAXPROCS to override this 17:37:57.426391 Opening db /sync_gateway_sw1 as bucket "sync_gateway_sw1", pool "default", server <http://localhost:8091> 17:37:57.426425 Opening Couchbase database sync_gateway_sw1 on <http://localhost:8091> 17:37:57.446078 FATAL: Error opening database: 502 Unable to connect to server: json: cannot unmarshal string into Go value of type int -- rest.RunServer() at config.go:415

I upgraded the affected Sync Gateway to 1.2, and got similar errors:

2016-03-31T17:51:44.837Z ==== Couchbase Sync Gateway/1.2.0(79;9df63a5) ==== 2016-03-31T17:51:44.837Z requestedSoftFDLimit < currentSoftFdLimit (5000 < 10240) no action needed 2016-03-31T17:51:44.837Z Opening db /sync_gateway_sw1 as bucket "sync_gateway_sw1", pool "default", server <http://localhost:8091> 2016-03-31T17:51:44.837Z Opening Couchbase database sync_gateway_sw1 on <http://localhost:8091> 2016-03-31T17:51:44.859Z FATAL: Error opening database: 502 Unable to connect to Couchbase Server (connection refused). Please ensure it is running and reachable at the configured host and port. Detailed error: json: cannot unmarshal string into Go value of type int -- rest.RunServer() at config.go:644

In all cases, the web console remains responsive on :8091.

Is this behavior expected? Our understanding is that this approach (remove a server, upgrade, and reattach) was the recommended process for upgrading a cluster live.


#2

Could you do:

# netstats -al | grep 8091
on the sync gateway server.

On the couchbase server do you see new connections or are they dropping?

Could you double check that bucket password still there with http://{ip}:8091/pools/default/buckets it will be saslPassword : {something}


#3

househippo,

Thanks for the reply. We’ve since decided to go with a complete cluster rebuild after testing various ways to do it cleanly, and are now up in production. The next time we plan an upgrade I’ll definitely look at the things you suggested.