Ruby SDK Times Out with Back-to-Back Queries

query

#1

We have been seeing a decent number of timeouts to queries using the Ruby SDK.

Environment:

  • AWS EC2
  • Database - m3.2xlarge - Ubuntu 14.04 LTS
  • Couchbase 4.1.1-5914-enterprise (single server)
  • Separate test node - Ubuntu 14.04 LTS, Same Security Group
  • libcouchbase 2.6.2
  • Ruby SDK 1.3.14
  • NO CLIENTS AT ALL EXCEPT THE TEST CODE BELOW

Sample code to prompt the failure:

require "couchbase"
require "securerandom"

connection = Couchbase.connect(
  hostname: '10.90.7.218',
  port: 8091,
  bucket: 'sync_gateway_sw1',
  timeout: 15000000
)

1000.times do |x|
  id = SecureRandom.uuid
  query_results = connection.design_docs['users_v2'].all_users_by_external_id( key: id ).fetch
  puts " #{x} -- #{id} == #{query_results}"
end

Super simple block of code, execute the ruby, and while the exact number of iterations will vary, we will get a Couchbase::Error::Timeout (error 0x17) with the message “failed to execute HTTP request, Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout”. Invariably it seems the failing key is “/pools/default/buckets/sync_gateway_sw1/ddocs”.

We’ve got detailed libcouchbase logs, and the last few lines will usually look similar to this:

18031ms [I0] {1372} [TRACE] (http-io - L:230) <cb-d02.beta-int.sightplan.com:8091> GET  http://cb-d02.beta-int.sightplan.com:8091/pools/default/buckets/sync_gateway_sw1/ddocs. Body=0 bytes
18031ms [I0] {1372} [DEBUG] (lcbio_mgr - L:416) <cb-d02.beta-int.sightplan.com:8091> (HE=0x26cfa90) Creating new connection because none are available in the pool
18031ms [I0] {1372} [DEBUG] (lcbio_mgr - L:321) <cb-d02.beta-int.sightplan.com:8091> (HE=0x26cfa90) Starting connection on I=0x28d3800
18031ms [I0] {1372} [INFO] (connection - L:450) <cb-d02.beta-int.sightplan.com:8091> (SOCK=0x28e1a20) Starting. Timeout=75000000us
18031ms [I0] {1372} [TRACE] (connection - L:267) <cb-d02.beta-int.sightplan.com:8091> (SOCK=0x28e1a20) Got event handler for new connection
18031ms [I0] {1372} [TRACE] (connection - L:314) <cb-d02.beta-int.sightplan.com:8091> (SOCK=0x28e1a20) Scheduling asynchronous watch for socket.
28041ms [I0] {1372} [DEBUG] (lcbio_mgr - L:464) <cb-d02.beta-int.sightplan.com:8092> (HE=0x251dd60) Idle connection expired
93096ms [I0] {1372} [ERROR] (http-io - L:231) Connection to failed with Err=0x17
93096ms [I0] {1372} [TRACE] (confmon - L:292) Start refresh requested
93096ms [I0] {1372} [ERROR] (connection - L:127) <cb-d02.beta-int.sightplan.com:8091> (SOCK=0x28e1a20) Failed: lcb_err=0x17, os_errno=0
93096ms [I0] {1372} [DEBUG] (lcbio_mgr - L:271) <cb-d02.beta-int.sightplan.com:8091> (HE=0x26cfa90) Received result for I=0x28d3800,C=(nil); E=0x17

I’ve seen various forum and ruby posts complaining of this issue, but most are older. I curious if this is a known issue, because it seems really easy to generate (just back to back queries on a node that is taking NO writes and is otherwise stable and has ZERO load – no other clients).


Getting couchbase error: Client-Side timeout exceeded for operation
#2

This forum post looks pretty relevant and similar, but for Node.js… Figured I’d link to it just in case this is lower level… Getting couchbase error: Client-Side timeout exceeded for operation

Also, just to make sure it was clear in the post above… Both the database instance and the test instance in the original post are in the same security group on AWS. Shouldn’t be any firewall issues in this case (as the other post said, these nodes are “close” together).


#3

Through a bunch of testing, we identified the core of the problem (in case it helps anyone else). We were able to prove that similar snippets in Java and Python did not have problems on the same box, so we knew it was unlikely to be the backend, general port contention, or even libcouchbase (Python worked, Ruby did not).

In the end it came down to “design_docs” call in the Ruby SDK. It is not enough to cache and reuse the connection in the Ruby SDK. You also need to cache the results of design_docs. In other words, the fix (workaround) to the code first posted is:

require "couchbase"
require "securerandom"

connection = Couchbase.connect(
  hostname: '10.90.7.218',
  port: 8091,
  bucket: 'sync_gateway_sw1',
  timeout: 15000000
)
design_docs = connection.design_docs

1000.times do |x|
  id = SecureRandom.uuid
  query_results = design_docs['users_v2'].all_users_by_external_id( key: id ).fetch
  puts " #{x} -- #{id} == #{query_results}"
end

With that one simple change, all the timeouts seem to go away.

Having said this, given the number of ports we had and the load that we had, I still believe there is an underlying issue with the original code failing so quickly and so simply. Not sure if it is a libcouchbase or Ruby thing, but it still seems there is the chance of an underlying issue when other platforms (Node.js) seem to be experiencing the client side timeouts.

Hopefully other folks can make use of this to improve their timeout situation.