Intermittent time-outs (client 2.0.5, server 2.2.0 community edition)


#1

Hello all,

We are experiencing intermittent Couchbase time-outs in our Node.js application. Our Node.js Couchbase module version is 2.0.5, and the server runs Couchbase 2.2.0. We have two Node.js servers connecting to a cluster of two Couchbase servers. Time-outs occur in both servers, under minimal or even no load at all.

Our API connects to Couchbase like this:

var cluster = new couchbase.Cluster('couchbase://dbserver1.local,dbserver2.local');
var bucket = cluster.openBucket('some-bucket', function (err) {
    // Say something if there's an error, otherwise we are cool
});

I managed to get some logs by setting LCB_LOGLEVEL=5 and LCB_CNTL_DETAILED_ERRCODES=1.

This is what happens during a timeout:

245730974ms [I0] {22131} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x23e3e70,IX=0) Scheduling next timeout for 0 ms
245730974ms [I0] {22131} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x23e3e70,IX=0) Scheduling next timeout for 0 ms
245730974ms [I0] {22131} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x23e3e70,IX=0) Scheduling next timeout for 0 ms

### These are the lines logged rigth when the error occurs

245730974ms [I0] {22131} [WARN] (server - L:374) <10.99.24.24:11210> (SRV=0x23e3e70,IX=0) Failing command (pkt=0x22ad1b0, opaque=15834, opcode=0x0) with error 0x17
245730976ms [I0] {22131} [TRACE] (confmon - L:292) Start refresh requested
245730977ms [I0] {22131} [ERROR] (server - L:457) <10.99.24.24:11210> (SRV=0x23e3e70,IX=0) Server timed out. Some commands have failed
245730977ms [I0] {22131} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x23e3e70,IX=0) Scheduling next timeout for 2500 ms
245730977ms [I0] {22131} [INFO] (confmon - L:166) Not applying configuration received via HTTP. No changes detected. A.rev=-1, B.rev=-1
245730977ms [I0] {22131} [TRACE] (confmon - L:271) Current provider is HTTP
245731502ms [I1] {22133} [DEBUG] (server - L:461) <10.99.24.25:11210> (SRV=0x1e688a0,IX=0) Scheduling next timeout for 2500 ms
245731502ms [I0] {22133} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x1cc17b0,IX=1) Scheduling next timeout for 2500 ms
245731502ms [I0] {22133} [DEBUG] (server - L:461) <10.99.24.25:11210> (SRV=0x1cb5f90,IX=0) Scheduling next timeout for 2500 ms
245731502ms [I1] {22133} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x1cb6ff0,IX=1) Scheduling next timeout for 2500 ms
245731659ms [I0] {22131} [DEBUG] (server - L:461) <10.99.24.25:11210> (SRV=0x23fab70,IX=1) Scheduling next timeout for 2500 ms
245731659ms [I1] {22131} [DEBUG] (server - L:461) <10.99.24.25:11210> (SRV=0x2402d80,IX=1) Scheduling next timeout for 2500 ms
245731659ms [I1] {22131} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x23fda10,IX=0) Scheduling next timeout for 2500 ms
245732222ms [I1] {22137} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x1a9aee0,IX=0) Scheduling next timeout for 2500 ms
245732882ms [I0] {22135} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x1de7d70,IX=0) Scheduling next timeout for 2500 ms
245732882ms [I1] {22135} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x1ea5e60,IX=0) Scheduling next timeout for 2500 ms
245732882ms [I1] {22135} [DEBUG] (server - L:461) <10.99.24.25:11210> (SRV=0x1f0f920,IX=1) Scheduling next timeout for 2500 ms
245732882ms [I0] {22135} [DEBUG] (server - L:461) <10.99.24.25:11210> (SRV=0x1ea4fd0,IX=1) Scheduling next timeout for 2500 ms
245732887ms [I1] {22137} [DEBUG] (server - L:461) <10.99.24.25:11210> (SRV=0x1aa01a0,IX=1) Scheduling next timeout for 2500 ms

### And these happen slightly later after the time-out error

245732976ms [I0] {22131} [ERROR] (htconfig - L:375) <10.99.24.24:8091> HTTP Provider timed out waiting for I/O
245732976ms [I0] {22131} [DEBUG] (ioctx - L:151) <10.99.24.24:8091> (CTX=0x109d9750,bc_http) Destroying. PND=1,ENT=0,SORC=1
245732976ms [I0] {22131} [INFO] (connection - L:441) <10.99.24.25:8091> (SOCK=0x10c20c90) Starting. Timeout=2000000us
245732978ms [I0] {22131} [TRACE] (connection - L:335) <10.99.24.25:8091> (SOCK=0x10c20c90) Received completion handler. Status=0. errno=0
245732978ms [I0] {22131} [INFO] (connection - L:116) <10.99.24.25:8091> (SOCK=0x10c20c90) Connected
245732978ms [I0] {22131} [DEBUG] (htconfig - L:346) Successfuly connected to REST API 10.99.24.25:8091
245732978ms [I0] {22131} [DEBUG] (ioctx - L:101) <10.99.24.25:8091> (CTX=0x10c21d30,unknown) Pairing with SOCK=0x10c20c90
245732980ms [I0] {22131} [TRACE] (htconfig - L:243) <10.99.24.25:8091> Received 198 bytes on HTTP stream
245732980ms [I0] {22131} [WARN] (htconfig - L:155) <10.99.24.25:8091> Got 404 on config stream. Assuming terse URI not supported on cluster
245732981ms [I0] {22131} [TRACE] (htconfig - L:243) <10.99.24.25:8091> Received 10 bytes on HTTP stream
245732982ms [I0] {22131} [TRACE] (htconfig - L:243) <10.99.24.25:8091> Received 224 bytes on HTTP stream
245732990ms [I0] {22131} [TRACE] (htconfig - L:243) <10.99.24.25:8091> Received 5792 bytes on HTTP stream
245732990ms [I0] {22131} [TRACE] (htconfig - L:243) <10.99.24.25:8091> Received 1448 bytes on HTTP stream
245732991ms [I0] {22131} [TRACE] (htconfig - L:243) <10.99.24.25:8091> Received 890 bytes on HTTP stream
245732991ms [I0] {22131} [DEBUG] (htconfig - L:260) <10.99.24.25:8091> Generation 2 -> 3
245732991ms [I0] {22131} [INFO] (confmon - L:174) Setting new configuration. Received via HTTP
245732991ms [I0] {22131} [INFO] (bootstrap - L:152) Got async step callback..
245732992ms [I0] {22131} [DEBUG] (bootstrap - L:55) Instance configured!
245732992ms [I0] {22131} [INFO] (newconfig - L:107) Config Diff: [ vBuckets Modified=1024 ], [Sequence Changed=1]
245732992ms [I0] {22131} [INFO] (newconfig - L:206) Reusing server 10.99.24.24:11210 (0x23e3e70). OldIndex=0. NewIndex=1
245732992ms [I0] {22131} [INFO] (newconfig - L:206) Reusing server 10.99.24.25:11210 (0x23fab70). OldIndex=1. NewIndex=0

And this is what occurs when things, suddenly, start working again:

822938ms [I1] {6526} [DEBUG] (server - L:461) <10.99.24.25:11210> (SRV=0x2bbe420,IX=1) Scheduling next timeout for 2500 ms
822938ms [I1] {6526} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x2bbf010,IX=0) Scheduling next timeout for 2500 ms
781071ms [I1] {6583} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x16816b0,IX=0) Scheduling next timeout for 2500 ms
781255ms [I1] {6583} [DEBUG] (server - L:461) <10.99.24.25:11210> (SRV=0x1686170,IX=1) Scheduling next timeout for 2500 ms
247330599ms [I1] {22137} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x1a9aee0,IX=0) Scheduling next timeout for 2500 ms
247330854ms [I1] {22137} [DEBUG] (server - L:461) <10.99.24.25:11210> (SRV=0x1aa01a0,IX=1) Scheduling next timeout for 2500 ms
247331205ms [I0] {22137} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x1aae750,IX=1) Scheduling next timeout for 2500 ms
247331647ms [I0] {22137} [DEBUG] (server - L:461) <10.99.24.25:11210> (SRV=0x1a97ee0,IX=0) Scheduling next timeout for 2500 ms
247331784ms [I0] {22131} [DEBUG] (server - L:461) <10.99.24.24:11210> (SRV=0x23e3e70,IX=0) Scheduling next timeout for 2500 ms
825255ms [I0] {6526} [DEBUG] (lcbio_mgr - L:416) <10.99.24.24:11210> (HE=0x33e6440) Creating new connection because none are available in the pool
825255ms [I0] {6526} [DEBUG] (lcbio_mgr - L:321) <10.99.24.24:11210> (HE=0x33e6440) Starting connection on I=0x2f6b000
825255ms [I0] {6526} [INFO] (connection - L:441) <10.99.24.24:11210> (SOCK=0x2f6b160) Starting. Timeout=2500000us
825256ms [I0] {6526} [TRACE] (connection - L:335) <10.99.24.24:11210> (SOCK=0x2f6b160) Received completion handler. Status=0. errno=0
825256ms [I0] {6526} [INFO] (connection - L:116) <10.99.24.24:11210> (SOCK=0x2f6b160) Connected
825256ms [I0] {6526} [DEBUG] (lcbio_mgr - L:271) <10.99.24.24:11210> (HE=0x33e6440) Received result for I=0x2f6b000,C=0x2f6b160; E=0x0
825256ms [I0] {6526} [DEBUG] (lcbio_mgr - L:223) <10.99.24.24:11210> (HE=0x33e6440) Assigning R=0x342c5b0 SOCKET=0x2f6b160
825256ms [I0] {6526} [TRACE] (server - L:487) <10.99.24.24:11210> (SRV=0x2bbeab0) Session not yet negotiated. Negotiating
825256ms [I0] {6526} [DEBUG] (ioctx - L:101) <10.99.24.24:11210> (CTX=0x2f3f060,unknown) Pairing with SOCK=0x2f6b160
825259ms [I0] {6526} [DEBUG] (ioctx - L:151) <10.99.24.24:11210> (CTX=0x2f3f060,sasl) Destroying. PND=0,ENT=1,SORC=1
825259ms [I0] {6526} [DEBUG] (ioctx - L:101) <10.99.24.24:11210> (CTX=0x2f18110,unknown) Pairing with SOCK=0x2f6b160
825259ms [I0] {6526} [DEBUG] (server - L:506) <10.99.24.24:11210> (SRV=0x2bbeab0,IX=0) Setting initial timeout=2495ms
825266ms [I0] {6526} [DEBUG] (lcbio_mgr - L:416) <10.99.24.25:11210> (HE=0x3095700) Creating new connection because none are available in the pool
825266ms [I0] {6526} [DEBUG] (lcbio_mgr - L:321) <10.99.24.25:11210> (HE=0x3095700) Starting connection on I=0x3095c30
825266ms [I0] {6526} [INFO] (connection - L:441) <10.99.24.25:11210> (SOCK=0x2fb8960) Starting. Timeout=2500000us
825268ms [I0] {6526} [TRACE] (connection - L:335) <10.99.24.25:11210> (SOCK=0x2fb8960) Received completion handler. Status=0. errno=0
825268ms [I0] {6526} [INFO] (connection - L:116) <10.99.24.25:11210> (SOCK=0x2fb8960) Connected
825268ms [I0] {6526} [DEBUG] (lcbio_mgr - L:271) <10.99.24.25:11210> (HE=0x3095700) Received result for I=0x3095c30,C=0x2fb8960; E=0x0
825268ms [I0] {6526} [DEBUG] (lcbio_mgr - L:223) <10.99.24.25:11210> (HE=0x3095700) Assigning R=0x3350de0 SOCKET=0x2fb8960
825268ms [I0] {6526} [TRACE] (server - L:487) <10.99.24.25:11210> (SRV=0x2d19310) Session not yet negotiated. Negotiating
825268ms [I0] {6526} [DEBUG] (ioctx - L:101) <10.99.24.25:11210> (CTX=0x2f51b20,unknown) Pairing with SOCK=0x2fb8960
825272ms [I0] {6526} [DEBUG] (ioctx - L:151) <10.99.24.25:11210> (CTX=0x2f51b20,sasl) Destroying. PND=0,ENT=1,SORC=1
825272ms [I0] {6526} [DEBUG] (ioctx - L:101) <10.99.24.25:11210> (CTX=0x2f184b0,unknown) Pairing with SOCK=0x2fb8960
825272ms [I0] {6526} [DEBUG] (server - L:506) <10.99.24.25:11210> (SRV=0x2d19310,IX=1) Setting initial timeout=2494ms

Following advice from other posts, we switched to the Couchbase protocol (from http://server:8091 to couchbase://server), but we also noticed that it fails and falls back to HTTP. From the logs:

12ms [I0] {8937} [INFO] (connection - L:116) <jmbadb06.jmba.local:11210> (SOCK=0x18b0ad0) Connected
12ms [I0] {8937} [DEBUG] (lcbio_mgr - L:271) <jmbadb06.jmba.local:11210> (HE=0x187ced0) Received result for I=0x1878a40,C=0x18b0ad0; E=0x0
12ms [I0] {8937} [DEBUG] (lcbio_mgr - L:223) <jmbadb06.jmba.local:11210> (HE=0x187ced0) Assigning R=0x187c520 SOCKET=0x18b0ad0
12ms [I0] {8937} [DEBUG] (ioctx - L:101) <jmbadb06.jmba.local:11210> (CTX=0x187cd30,unknown) Pairing with SOCK=0x18b0ad0
15ms [I0] {8937} [DEBUG] (ioctx - L:151) <jmbadb06.jmba.local:11210> (CTX=0x187cd30,sasl) Destroying. PND=0,ENT=1,SORC=1
15ms [I0] {8937} [DEBUG] (ioctx - L:101) <jmbadb06.jmba.local:11210> (CTX=0x186f250,unknown) Pairing with SOCK=0x18b0ad0
16ms [I0] {8937} [WARN] (cccp - L:392) <jmbadb06.jmba.local:11210> CCCP Packet responded with 0x81; nkey=0, nbytes=0, cmd=0xb5, seq=0xf00d
16ms [I0] {8937} [DEBUG] (ioctx - L:151) <jmbadb06.jmba.local:11210> (CTX=0x186f250,bc_cccp) Destroying. PND=0,ENT=1,SORC=1
16ms [I0] {8937} [INFO] (lcbio_mgr - L:491) <jmbadb06.jmba.local:11210> (HE=0x187ced0) Placing socket back into the pool. I=0x1878a40,C=0x18b0ad0
16ms [I0] {8937} [INFO] (confmon - L:202) Provider 'CCCP' failed
16ms [I0] {8937} [DEBUG] (confmon - L:236) Will try next provider in 0us
16ms [I0] {8937} [TRACE] (confmon - L:271) Current provider is HTTP

Any idea or suggestion will be more than welcome. Thanks!


#2

The logs would seem to indicate a topology change in the cluster (either a rebalance or failover) – (specifically, it’s stating all the vBuckets have changed and the order of the servers have changed as well):

245732992ms [I0] {22131} [INFO] (newconfig - L:107) Config Diff: [ vBuckets Modified=1024 ], [Sequence Changed=1]
245732992ms [I0] {22131} [INFO] (newconfig - L:206) Reusing server 10.99.24.24:11210 (0x23e3e70). OldIndex=0. NewIndex=1
245732992ms [I0] {22131} [INFO] (newconfig - L:206) Reusing server 10.99.24.25:11210 (0x23fab70). OldIndex=1. NewIndex=0

While this condition in itself should not cause operations to time out, rebalance operations to put some extra load on the server, especially in older versions.

While it is highly recommended to use the couchbase:// scheme (rather than the http:// scheme) for all cluster versions, bootstrapping/configuring via the memcached protocol (port 11210) is only available from Couchbase version 2.5 and higher. Since you are on 2.2 only the HTTP configuration is available.

Finally, it would indicate that either the network or the cluster is under some load by additionally examining the total time to retrieve the cluster configuration (several milliseconds). While this is within the margin of normal operations, typically configurations are received in a smaller timeframe.


#3

Thanks for your reply, Mark.

We’ve been investigating this issue for a very long time already. Today we found the root cause. As it turned out, there was a Squid proxy causing some trouble in the environment. Long story short, the solution was to add our Couchbase hostnames to the env variable no_proxy.