We have been seeing a decent number of timeouts to queries using the Ruby SDK.
Environment:
- AWS EC2
- Database - m3.2xlarge - Ubuntu 14.04 LTS
- Couchbase 4.1.1-5914-enterprise (single server)
- Separate test node - Ubuntu 14.04 LTS, Same Security Group
- libcouchbase 2.6.2
- Ruby SDK 1.3.14
- NO CLIENTS AT ALL EXCEPT THE TEST CODE BELOW
Sample code to prompt the failure:
require "couchbase"
require "securerandom"
connection = Couchbase.connect(
hostname: '10.90.7.218',
port: 8091,
bucket: 'sync_gateway_sw1',
timeout: 15000000
)
1000.times do |x|
id = SecureRandom.uuid
query_results = connection.design_docs['users_v2'].all_users_by_external_id( key: id ).fetch
puts " #{x} -- #{id} == #{query_results}"
end
Super simple block of code, execute the ruby, and while the exact number of iterations will vary, we will get a Couchbase::Error::Timeout (error 0x17) with the message “failed to execute HTTP request, Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout”. Invariably it seems the failing key is “/pools/default/buckets/sync_gateway_sw1/ddocs”.
We’ve got detailed libcouchbase logs, and the last few lines will usually look similar to this:
18031ms [I0] {1372} [TRACE] (http-io - L:230) <cb-d02.beta-int.sightplan.com:8091> GET http://cb-d02.beta-int.sightplan.com:8091/pools/default/buckets/sync_gateway_sw1/ddocs. Body=0 bytes
18031ms [I0] {1372} [DEBUG] (lcbio_mgr - L:416) <cb-d02.beta-int.sightplan.com:8091> (HE=0x26cfa90) Creating new connection because none are available in the pool
18031ms [I0] {1372} [DEBUG] (lcbio_mgr - L:321) <cb-d02.beta-int.sightplan.com:8091> (HE=0x26cfa90) Starting connection on I=0x28d3800
18031ms [I0] {1372} [INFO] (connection - L:450) <cb-d02.beta-int.sightplan.com:8091> (SOCK=0x28e1a20) Starting. Timeout=75000000us
18031ms [I0] {1372} [TRACE] (connection - L:267) <cb-d02.beta-int.sightplan.com:8091> (SOCK=0x28e1a20) Got event handler for new connection
18031ms [I0] {1372} [TRACE] (connection - L:314) <cb-d02.beta-int.sightplan.com:8091> (SOCK=0x28e1a20) Scheduling asynchronous watch for socket.
28041ms [I0] {1372} [DEBUG] (lcbio_mgr - L:464) <cb-d02.beta-int.sightplan.com:8092> (HE=0x251dd60) Idle connection expired
93096ms [I0] {1372} [ERROR] (http-io - L:231) Connection to failed with Err=0x17
93096ms [I0] {1372} [TRACE] (confmon - L:292) Start refresh requested
93096ms [I0] {1372} [ERROR] (connection - L:127) <cb-d02.beta-int.sightplan.com:8091> (SOCK=0x28e1a20) Failed: lcb_err=0x17, os_errno=0
93096ms [I0] {1372} [DEBUG] (lcbio_mgr - L:271) <cb-d02.beta-int.sightplan.com:8091> (HE=0x26cfa90) Received result for I=0x28d3800,C=(nil); E=0x17
I’ve seen various forum and ruby posts complaining of this issue, but most are older. I curious if this is a known issue, because it seems really easy to generate (just back to back queries on a node that is taking NO writes and is otherwise stable and has ZERO load – no other clients).