Client side timeout exceeded for operation


#1

Sometimes couchbase operations time out from our applications.

Application connects to various buckets in the same couchbase server.

Operations to all the buckets timeout when the issue occurs.

The timeout doesn’t occur generally. It is random and occurs as a spike. It doesn’t occur on all clients but on any random client. Usually occurs twice every day.

The client does not have any network connectivity issue.

Details -
Couchbase version - Community Edition 5.0.1 build 5003
Client SDK - node 2.3.3
Ops - 30k
Each key is created with an expiry of 10 mins

How can I proceed to debug/fix the issue?


#2

Hi, as a first step have you tried bumping up the log level (e.g. LCB_LOGLEVEL=5 node index.js) to see if there are any clues being shown there?


#3

This is the logs received during this particular error

7142630ms [I0] {31801} [WARN] (server - L:459) <X:11210> (SRV=0x2b96e70,IX=0) Failing command (pkt=0x2b81de0, opaque=234694, opcode=0x1d) with error LCB_ETIMEDOUT (0x17)
Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout, code=23
7142630ms [I0] {31801} [TRACE] (confmon - L:252) Refreshing current cluster map
7142630ms [I0] {31801} [ERROR] (server - L:534) <X:11210> (SRV=0x2b96e70,IX=0) Server timed out. Some commands have failed
7142630ms [I0] {31801} [TRACE] (server - L:538) <X:11210> (SRV=0x2b96e70,IX=0) Scheduling next timeout for 2499 ms. This is not an error
7142630ms [I0] {31801} [INFO] (confmon - L:145) Not applying configuration received via CCCP. No changes detected. A.rev=4816, B.rev=4816
7142630ms [I0] {31801} [TRACE] (confmon - L:239) Attempting to retrieve cluster map via CCCP
7142630ms [I0] {31801} [INFO] (cccp - L:143) Re-Issuing CCCP Command on server struct 0x2b96e70 (X:11210)
7142743ms [I0] {31801} [WARN] (server - L:325) <X:11210> (SRV=0x2b96e70,IX=0) Server sent us reply for a timed-out command. (OP=0x1d, RC=0x0, SEQ=234694)
7142743ms [I0] {31801} [INFO] (confmon - L:145) Not applying configuration received via CCCP. No changes detected. A.rev=4816, B.rev=4816


#4

You say that it usually occurs twice a day, is it usually around the same time each day?


#5

@chvck

I see these timeouts occurring in 2 different use cases -

One case is where timeouts occur throughout the day for random operations. This occurs across all clients. Haven’t been able to figure out a reason for this. The logs attached previously are for this case

Second case is where timeouts occur as a spike on individual clients. This affects about 4k-5k operations and spans a duration of 2-3 seconds. These clients are those with high load so it could be an impact of that. But I would like to know why it occurs at high load too. Attaching the logs for this use case too. (And adding the filtered logs in the message for quick reference). It looks like client got the response back after 3 seconds but it should not ideally take that long. Our end to end 99%tile latency is < 50 ms

Couchbase Errors.zip (27.0 KB)

17475778ms [I0] {23096} [TRACE] (server - L:538) <X:11210> (SRV=0x25d76b0,IX=0) Scheduling next timeout for 2491 ms. This is not an error
17476583ms [I2] {23096} [TRACE] (server - L:538) <X:11210> (SRV=0x2718780,IX=0) Scheduling next timeout for 2500 ms. This is not an error
17478268ms [I0] {23096} [TRACE] (server - L:538) <X:11210> (SRV=0x25d76b0,IX=0) Scheduling next timeout for 2500 ms. This is not an error
17479132ms [I2] {23096} [TRACE] (server - L:538) <X:11210> (SRV=0x2718780,IX=0) Scheduling next timeout for 2500 ms. This is not an error
17480771ms [I0] {23096} [TRACE] (server - L:538) <X:11210> (SRV=0x25d76b0,IX=0) Scheduling next timeout for 2494 ms. This is not an error
17482296ms [I2] {23096} [TRACE] (server - L:538) <X:11210> (SRV=0x2718780,IX=0) Scheduling next timeout for 1803 ms. This is not an error
17484514ms [I0] {23096} [TRACE] (server - L:538) <X:11210> (SRV=0x25d76b0,IX=0) Scheduling next timeout for 622 ms. This is not an error
17484607ms [I2] {23096} [TRACE] (server - L:538) <X:11210> (SRV=0x2718780,IX=0) Scheduling next timeout for 1057 ms. This is not an error

17487837ms [I0] {23096} [WARN] (server - L:459) <X:11210> (SRV=0x25d76b0,IX=0) Failing command (pkt=0xcc217b0, opaque=11550189, opcode=0x1d) with error LCB_ETIMEDOUT (0x17)
17488673ms [I0] {23096} [TRACE] (confmon - L:252) Refreshing current cluster map
17488673ms [I0] {23096} [ERROR] (server - L:534) <X:11210> (SRV=0x25d76b0,IX=0) Server timed out. Some commands have failed
17488673ms [I0] {23096} [TRACE] (server - L:538) <X:11210> (SRV=0x25d76b0,IX=0) Scheduling next timeout for 0 ms. This is not an error

17488799ms [I2] {23096} [TRACE] (confmon - L:252) Refreshing current cluster map
17488799ms [I2] {23096} [ERROR] (server - L:534) <X:11210> (SRV=0x2718780,IX=0) Server timed out. Some commands have failed
17488799ms [I2] {23096} [TRACE] (server - L:538) <X:11210> (SRV=0x2718780,IX=0) Scheduling next timeout for 2500 ms. This is not an error


17489275ms [I0] {23096} [INFO] (confmon - L:145) Not applying configuration received via CCCP. No changes detected. A.rev=4816, B.rev=4816
17489275ms [I0] {23096} [TRACE] (confmon - L:239) Attempting to retrieve cluster map via CCCP
17489275ms [I0] {23096} [INFO] (cccp - L:143) Re-Issuing CCCP Command on server struct 0x25d76b0 (X:11210)


17491568ms [I0] {23096} [ERROR] (server - L:534) <X:11210> (SRV=0x25d76b0,IX=0) Server timed out. Some commands have failed
17491568ms [I0] {23096} [TRACE] (server - L:538) <X:11210> (SRV=0x25d76b0,IX=0) Scheduling next timeout for 0 ms. This is not an error
17491683ms [I0] {23096} [ERROR] (cccp - L:164) <NOHOST:NOPORT> Could not get configuration: LCB_ETIMEDOUT (0x17)
17491683ms [I0] {23096} [INFO] (confmon - L:177) Provider 'CCCP' failed
17491683ms [I0] {23096} [TRACE] (confmon - L:201) Maximum provider reached. Resetting index
17491824ms [I0] {23096} [INFO] (bootstrap - L:164) Not requesting a config refresh because of throttling parameters. Next refresh possible in 6849ms or 99 errors. See LCB_CNTL_CONFDELAY_THRESH and LCB_CNTL_CONFERRTHRESH to modify the throttling settings

17492660ms [I0] {23096} [WARN] (server - L:325) <X:11210> (SRV=0x25d76b0,IX=0) Server sent us reply for a timed-out command. (OP=0x1d, RC=0x0, SEQ=11550189)
17492660ms [I0] {23096} [WARN] (server - L:325) <X:11210> (SRV=0x25d76b0,IX=0) Server sent us reply for a timed-out command. (OP=0x1d, RC=0x0, SEQ=11550190)

17494297ms [I0] {23096} [TRACE] (server - L:538) <X:11210> (SRV=0x25d76b0,IX=0) Scheduling next timeout for 788 ms. This is not an error
17494937ms [I2] {23096} [TRACE] (server - L:538) <X:11210> (SRV=0x2718780,IX=0) Scheduling next timeout for 2450 ms. This is not an error
17495095ms [I0] {23096} [TRACE] (server - L:538) <X:11210> (SRV=0x25d76b0,IX=0) Scheduling next timeout for 2264 ms. This is not an error
17497359ms [I0] {23096} [TRACE] (server - L:538) <X:11210> (SRV=0x25d76b0,IX=0) Scheduling next timeout for 2497 ms. This is not an error
17497388ms [I2] {23096} [TRACE] (server - L:538) <X:11210> (SRV=0x2718780,IX=0) Scheduling next timeout for 2500 ms. This is not an error
17499857ms [I0] {23096} [TRACE] (server - L:538) <X:11210> (SRV=0x25d76b0,IX=0) Scheduling next timeout for 2500 ms. This is not an error
17499888ms [I2] {23096} [TRACE] (server - L:538) <X:11210> (SRV=0x2718780,IX=0) Scheduling next timeout for 2500 ms. This is not an error

#6

Have you tried looking in the server logs? There might be more information there if you look at timestamps around the same time as you’re seeing issue client side. You can get logs via the logs tab in the Admin UI or via cbcollect.


#7

Server logs don’t show anything related to connection issues or any error during the timeouts.


#8

Anything to go forward here or any more data points which could help to find the issue here? @chvck


#9

Do you have any server load (CPU etc…) monitoring in place? As the server is responding with the data but not within the timeout timespan it could be that the request/response is getting held up in the network somewhere? As you say the response is coming back from the server, just not fast enough.


#10

Yes we do. The server metrics are all good. Also if it was a case of an issue with the server all clients would be affected right?

We have about 150 clients and the spike in timeout occurs only for 1 or 2 clients during peak load. The node SDK documentation states that operations are inherently batched between node event loops. Could the batch size cause this issue? I couldn’t find anything to limit batch sizes if this is true.


#11

Hey @krishan1390,

It is indeed possible that the batching is the cause. The Node.js SDK won’t dispatch any of your operations until one iteration of the event loop has completed. If you are performing specific operations which block the event loop, or are spending a very large amount of time scheduling ops, its possible that the ops will expire before the SDK even has a chance to dispatch them. Does this sound possible?

Cheers, Brett


#12

You mean the expiry is based on internal queue and not based upon time taken to receive response from server after the request has been dispatched

If that was the case we wouldn’t get Server timed out. Some commands have failed from the SDK right?


#13

Hey @krishan1390,

The timer is started when the operation is scheduled, not when it is written to the network. Long amounts of time spent waiting in the queue could cause the operation to be timed out before it even gets written to the network.

Cheers, Brett


#14

Got it. If that’s the case it could mean an issue with my application. Will debug it further.

But even in this case would we get

Server timed out. Some commands have failed

message from the SDK?