Couchbase.IO.SendTimeoutExpiredException: The operation has timed out

SDK version: 2.4.8
Server version: 4.5.1

We are intermittently receiving send timeout exceptions from the client SDK. The Exception on the OperationResult object only yields “Couchbase.IO.SendTimeoutExpiredException: The operation has timed out”. These timeouts cascade and effectively stop all transactions from occurring on that service(let’s call this Service A) on that node. The only fix we have at the moment is to recycle the service’s app pool on that node.

There’s a Service B(same SDK version and timeout configs) running on the same server. Service B operates on the same bucket as Service A, but does not have send timeout problems even when Service A goes bad. This makes it seem like this isn’t load related.

  • We’ve enabled Log4Net, but that did not log any useful information on any log levels.
  • We’ve made all Couchbase calls in Service A async, but we still get the issue.

Could you direct us to any troubleshooting methods we could use to identify the issue?

2 Likes

@alexc -

  • Posting (or sending the logs if your concerned about exposing the contents) may help since were used to reading and interpreting them.

  • Post your configuration if possible (put dummy values for IPs if needed)

  • After that, packet capture with WireShark can/may be useful

  • Note that a SendTimeoutException is thrown when the client is waiting to read the response from the server; this timeout is controlled by ClientConfiguration.SendTimeout - if this value is too low any network hiccup can cause timeout (or large docs or contention, etc.). Tuning this value may help.

1 Like

Thanks for getting back so quickly.

I’ll have to wait til the next time our issue pops up (this tends to happen every 2-3 days) to post logs. I just re-enabled Log4Net and we didn’t keep backups of the logs.

Here’s our configuration. We opted to use default values for near everything.

<couchbaseClients>
<couchbase useSsl="false" enableConfigHeartBeat="false">
  <servers>
    <add uri="[cacheserver1]" />
    <add uri="[cacheserver2]" />
    <add uri="[cacheserver3]" />
    <add uri="[cacheserver4]" />
    <add uri="[cacheserver5]" />
  </servers>
  <buckets>
    <add name="[bucket]" password="" useSsl="false"></add>
  </buckets>
</couchbase>
</couchbaseClients>
1 Like

@jmorris Here’s the logs for 2 back to back failed transactions:

2017-09-05 08:14:32,199 [196] DEBUG Couchbase.IO.Services.PooledIOService - Couchbase.IO.SendTimeoutExpiredException: The operation has timed out.																								
at Couchbase.IO.MultiplexingConnection.Send(Byte[] request)																								
at Couchbase.IO.Services.PooledIOService.Execute[T](IOperation`1 operation)																								
2017-09-05 08:14:32,199 [196] DEBUG Couchbase.Core.Buckets.CouchbaseRequestExecuter - Operation doesn't support retries for key UAccountMem-14837038																								
2017-09-05 08:14:32,199 [196] DEBUG Couchbase.Core.Buckets.RequestExecuterBase - Operation for key UAccountMem-14837038 failed after 1 retries using vb823 from rev65898 and opaque18714363. Reason: The operation has timed out.																								
2017-09-05 08:14:32,199 [196] TRACE Couchbase.Configuration.ConfigContextBase - Getting KeyMapper for rev#65898 on thread 196																								
2017-09-05 08:14:32,199 [196] TRACE Couchbase.Core.Buckets.VBucketKeyMapper - Using index 421 for key UAccountSet-14837038 - rev65898																								
2017-09-05 08:14:32,199 [196] TRACE Couchbase.IO.Services.SharedPooledIOService - Using conn f9eaf90c-5516-4e0a-9a6d-67b93b3a2450 on [IP3]:11210																								
2017-09-05 08:14:32,199 [59] TRACE Couchbase.Configuration.ConfigContextBase - Getting KeyMapper for rev#65898 on thread 59																								
2017-09-05 08:14:32,199 [59] TRACE Couchbase.Core.Buckets.VBucketKeyMapper - Using index 215 for key UAccountSec-14837038 - rev65898																								
2017-09-05 08:14:32,199 [59] TRACE Couchbase.IO.Services.SharedPooledIOService - Using conn fd4ae96a-ff82-46f8-8fa0-f2a95d3cf9b6 on [IP2]:11210																								
2017-09-05 08:14:32,246 [188] TRACE Couchbase.Configuration.ConfigContextBase - Getting KeyMapper for rev#65898 on thread 188																								
2017-09-05 08:14:32,246 [188] TRACE Couchbase.Core.Buckets.VBucketKeyMapper - Using index 729 for key UAccount-1080675 - rev65898																								
2017-09-05 08:14:32,246 [188] TRACE Couchbase.IO.Services.SharedPooledIOService - Using conn e59f0c32-6437-42e3-8c3f-411c307c8a04 on [IP4]:11210																								
2017-09-05 08:14:32,246 [171] TRACE Couchbase.Configuration.ConfigContextBase - Getting KeyMapper for rev#65898 on thread 171																								
2017-09-05 08:14:32,246 [171] TRACE Couchbase.Core.Buckets.VBucketKeyMapper - Using index 201 for key UAccountMem-1080675 - rev65898																								
2017-09-05 08:14:32,246 [171] TRACE Couchbase.IO.Services.SharedPooledIOService - Using conn 95263714-30ea-471c-934d-4c4b4a15b30d on [IP1]:11210																								
2017-09-05 08:14:32,246 [196] TRACE Couchbase.Configuration.ConfigContextBase - Getting KeyMapper for rev#65898 on thread 196																								
2017-09-05 08:14:32,246 [196] TRACE Couchbase.Core.Buckets.VBucketKeyMapper - Using index 139 for key UAccountSec-1080675 - rev65898																								
2017-09-05 08:14:32,246 [196] TRACE Couchbase.IO.Services.SharedPooledIOService - Using conn b2b9872a-9804-4f17-9bdb-8d5d93caa8de on [IP1]:11210																								
2017-09-05 08:14:32,246 [188] TRACE Couchbase.Configuration.ConfigContextBase - Getting KeyMapper for rev#65898 on thread 188																								
2017-09-05 08:14:32,246 [188] TRACE Couchbase.Core.Buckets.VBucketKeyMapper - Using index 729 for key UAccount-1080675 - rev65898																								
2017-09-05 08:14:32,246 [188] TRACE Couchbase.IO.Services.SharedPooledIOService - Using conn 899c4338-6e0d-41ba-9f4a-13351ce97089 on [IP4]:11210																								
2017-09-05 08:14:32,246 [176] TRACE Couchbase.Configuration.ConfigContextBase - Getting KeyMapper for rev#65898 on thread 176																								
2017-09-05 08:14:32,246 [176] TRACE Couchbase.Core.Buckets.VBucketKeyMapper - Using index 201 for key UAccountMem-1080675 - rev65898																								
2017-09-05 08:14:32,246 [176] TRACE Couchbase.IO.Services.SharedPooledIOService - Using conn 95263714-30ea-471c-934d-4c4b4a15b30d on [IP1]:11210																								
2017-09-05 08:14:32,246 [59] TRACE Couchbase.Configuration.ConfigContextBase - Getting KeyMapper for rev#65898 on thread 59																								
2017-09-05 08:14:32,246 [59] TRACE Couchbase.Core.Buckets.VBucketKeyMapper - Using index 42 for key UAccountSet-1080675 - rev65898																								
2017-09-05 08:14:32,246 [59] TRACE Couchbase.IO.Services.SharedPooledIOService - Using conn b2b9872a-9804-4f17-9bdb-8d5d93caa8de on [IP1]:11210																								
2017-09-05 08:14:32,261 [177] TRACE Couchbase.Configuration.ConfigContextBase - Getting KeyMapper for rev#65898 on thread 177																								
2017-09-05 08:14:32,261 [177] TRACE Couchbase.Core.Buckets.VBucketKeyMapper - Using index 139 for key UAccountSec-1080675 - rev65898																								
2017-09-05 08:14:32,261 [177] TRACE Couchbase.IO.Services.SharedPooledIOService - Using conn 95263714-30ea-471c-934d-4c4b4a15b30d on [IP1]:11210																								
2017-09-05 08:14:32,277 [188] TRACE Couchbase.Configuration.ConfigContextBase - Getting KeyMapper for rev#65898 on thread 188																								
2017-09-05 08:14:32,277 [188] TRACE Couchbase.Core.Buckets.VBucketKeyMapper - Using index 729 for key UAccount-1080675 - rev65898																								
2017-09-05 08:14:32,277 [188] TRACE Couchbase.IO.Services.SharedPooledIOService - Using conn e59f0c32-6437-42e3-8c3f-411c307c8a04 on [IP4]:11210																								
2017-09-05 08:14:32,293 [171] TRACE Couchbase.Configuration.ConfigContextBase - Getting KeyMapper for rev#65898 on thread 171																								
2017-09-05 08:14:32,293 [171] TRACE Couchbase.Core.Buckets.VBucketKeyMapper - Using index 201 for key UAccountMem-1080675 - rev65898																								
2017-09-05 08:14:32,293 [171] TRACE Couchbase.IO.Services.SharedPooledIOService - Using conn b2b9872a-9804-4f17-9bdb-8d5d93caa8de on [IP1]:11210																								
2017-09-05 08:14:32,293 [196] TRACE Couchbase.Configuration.ConfigContextBase - Getting KeyMapper for rev#65898 on thread 196																								
2017-09-05 08:14:32,293 [196] TRACE Couchbase.Core.Buckets.VBucketKeyMapper - Using index 139 for key UAccountSec-1080675 - rev65898																								
2017-09-05 08:14:32,293 [196] TRACE Couchbase.IO.Services.SharedPooledIOService - Using conn 95263714-30ea-471c-934d-4c4b4a15b30d on [IP1]:11210																								
2017-09-05 08:14:32,293 [188] TRACE Couchbase.Configuration.ConfigContextBase - Getting KeyMapper for rev#65898 on thread 188																								
2017-09-05 08:14:32,293 [188] TRACE Couchbase.Core.Buckets.VBucketKeyMapper - Using index 729 for key UAccount-1080675 - rev65898																								
2017-09-05 08:14:32,293 [188] TRACE Couchbase.IO.Services.SharedPooledIOService - Using conn 899c4338-6e0d-41ba-9f4a-13351ce97089 on [IP4]:11210																								
2017-09-05 08:14:32,293 [164] TRACE Couchbase.Configuration.ConfigContextBase - Getting KeyMapper for rev#65898 on thread 164																								
2017-09-05 08:14:32,293 [164] TRACE Couchbase.Core.Buckets.VBucketKeyMapper - Using index 900 for key UAccountDem-1080675 - rev65898																								
2017-09-05 08:14:32,293 [164] TRACE Couchbase.IO.Services.SharedPooledIOService - Using conn d5b80179-8ae3-4a80-baf4-0fe8d13ab29d on [IP5]:11210																								
2017-09-05 08:14:32,605 [174] DEBUG Couchbase.IO.Services.PooledIOService - Couchbase.IO.SendTimeoutExpiredException: The operation has timed out.																								
at Couchbase.IO.MultiplexingConnection.Send(Byte[] request)																								
at Couchbase.IO.Services.PooledIOService.Execute[T](IOperation`1 operation)																								
2017-09-05 08:14:32,605 [174] DEBUG Couchbase.Core.Buckets.CouchbaseRequestExecuter - Operation doesn't support retries for key UAccountSet-18908022																								
2017-09-05 08:14:32,605 [174] DEBUG Couchbase.Core.Buckets.RequestExecuterBase - Operation for key UAccountSet-18908022 failed after 1 retries using vb852 from rev65898 and opaque18714366. Reason: The operation has timed out.																								
2017-09-05 08:14:32,605 [174] TRACE Couchbase.Configuration.ConfigContextBase - Getting KeyMapper for rev#65898 on thread 174																								
2017-09-05 08:14:32,605 [174] TRACE Couchbase.Core.Buckets.VBucketKeyMapper - Using index 550 for key UAccountSec-18908022 - rev65898																								
2017-09-05 08:14:32,605 [174] TRACE Couchbase.IO.Services.SharedPooledIOService - Using conn 15fc03a1-9d14-4f0c-a933-ba247189a1cb on [IP3]:11210																								
2017-09-05 08:14:32,886 [188] TRACE Couchbase.Configuration.ConfigContextBase - Getting KeyMapper for rev#65898 on thread 188																								
2017-09-05 08:14:32,886 [188] TRACE Couchbase.Core.Buckets.VBucketKeyMapper - Using index 38 for key UAccount-4610258 - rev65898																								
2017-09-05 08:14:32,886 [188] TRACE Couchbase.IO.Services.SharedPooledIOService - Using conn b2b9872a-9804-4f17-9bdb-8d5d93caa8de on [IP1]:11210																								
2017-09-05 08:14:32,902 [196] TRACE Couchbase.Configuration.ConfigContextBase - Getting KeyMapper for rev#65898 on thread 196																								
2017-09-05 08:14:32,902 [196] TRACE Couchbase.Core.Buckets.VBucketKeyMapper - Using index 566 for key UAccountMem-4610258 - rev65898																								
2017-09-05 08:14:32,902 [196] TRACE Couchbase.IO.Services.SharedPooledIOService - Using conn f9eaf90c-5516-4e0a-9a6d-67b93b3a2450 on [IP3]:11210																								
2017-09-05 08:14:32,902 [171] TRACE Couchbase.Configuration.ConfigContextBase - Getting KeyMapper for rev#65898 on thread 171																								
2017-09-05 08:14:32,902 [171] TRACE Couchbase.Core.Buckets.VBucketKeyMapper - Using index 628 for key UAccountSec-4610258 - rev65898																								
2017-09-05 08:14:32,902 [171] TRACE Couchbase.IO.Services.SharedPooledIOService - Using conn e59f0c32-6437-42e3-8c3f-411c307c8a04 on [IP4]:11210																								
2017-09-05 08:14:32,902 [188] TRACE Couchbase.Configuration.ConfigContextBase - Getting KeyMapper for rev#65898 on thread 188																								
2017-09-05 08:14:32,902 [188] TRACE Couchbase.Core.Buckets.VBucketKeyMapper - Using index 38 for key UAccount-4610258 - rev65898																								
2017-09-05 08:14:32,902 [188] TRACE Couchbase.IO.Services.SharedPooledIOService - Using conn 95263714-30ea-471c-934d-4c4b4a15b30d on [IP1]:11210																								
2017-09-05 08:14:32,902 [164] TRACE Couchbase.Configuration.ConfigContextBase - Getting KeyMapper for rev#65898 on thread 164																								
2017-09-05 08:14:32,902 [164] TRACE Couchbase.Core.Buckets.VBucketKeyMapper - Using index 566 for key UAccountMem-4610258 - rev65898																								
2017-09-05 08:14:32,902 [164] TRACE Couchbase.IO.Services.SharedPooledIOService - Using conn 15fc03a1-9d14-4f0c-a933-ba247189a1cb on [IP3]:11210																								
2017-09-05 08:14:32,902 [143] TRACE Couchbase.Configuration.ConfigContextBase - Getting KeyMapper for rev#65898 on thread 143																								
2017-09-05 08:14:32,902 [143] TRACE Couchbase.Core.Buckets.VBucketKeyMapper - Using index 725 for key UAccountSet-4610258 - rev65898																								
2017-09-05 08:14:32,902 [143] TRACE Couchbase.IO.Services.SharedPooledIOService - Using conn 899c4338-6e0d-41ba-9f4a-13351ce97089 on [IP4]:11210																								
2017-09-05 08:14:32,902 [176] TRACE Couchbase.Configuration.ConfigContextBase - Getting KeyMapper for rev#65898 on thread 176																								
2017-09-05 08:14:32,902 [176] TRACE Couchbase.Core.Buckets.VBucketKeyMapper - Using index 628 for key UAccountSec-4610258 - rev65898																								
2017-09-05 08:14:32,902 [176] TRACE Couchbase.IO.Services.SharedPooledIOService - Using conn e59f0c32-6437-42e3-8c3f-411c307c8a04 on [IP4]:11210																								
2017-09-05 08:14:33,293 [178] TRACE Couchbase.Configuration.ConfigContextBase - Getting KeyMapper for rev#65898 on thread 178																								
2017-09-05 08:14:33,293 [178] TRACE Couchbase.Core.Buckets.VBucketKeyMapper - Using index 72 for key UAccountMem-21033734 - rev65898																								
2017-09-05 08:14:33,293 [178] TRACE Couchbase.IO.Services.SharedPooledIOService - Using conn b2b9872a-9804-4f17-9bdb-8d5d93caa8de on [IP1]:11210

@alexc - that timeout appears to be the client waiting on a response from the server. Since no other errors precede it, I suspect perhaps something network/environment related.

-Jeff

We are receiving exactly the same error intermittently on our servers.

2018-01-16 20:56:20.588|STP SmartThreadPool Thread #7667|ERROR|Cruise.Cache.Base.Cache.CouchbaseCache|ExecuteUpsert|Upsert: False The operation has timed out.|key 83399d8b4ed55cb07fd22bee8b576b8ed46e7571|exception Couchbase.IO.SendTimeoutExpiredException: The operation has timed out.-> at Couchbase.IO.MultiplexingConnection.Send(Byte request)-> at Couchbase.IO.Services.MultiplexingIOService.Execute[T](IOperation`1 operation)

2018-01-16 20:56:20.588|STP SmartThreadPool Thread #7667|FATAL|Cruise.Cache.Base.Cache.CouchbaseCache|ExecuteUpsert|cache timeout and stopped working with error The operation has timed out. duration 15.0089136|83399d8b4ed55cb07fd22bee8b576b8ed46e7571| at System.Environment.GetStackTrace(Exception e, Boolean needFileInfo)-> at System.Environment.get_StackTrace()-> at Couchbase.IO.SendTimeoutExpiredException…ctor(String message)-> at Couchbase.IO.MultiplexingConnection.Send(Byte request)-> at Couchbase.IO.Services.MultiplexingIOService.Execute[T](IOperation1 operation)-> at Couchbase.Core.Server.Send[T](IOperation1 operation)-> at Couchbase.Core.Buckets.MemcachedRequestExecuter.SendWithRetry[T](IOperation1 operation)-> at Couchbase.MemcachedBucket.Upsert[T](String key, T value, TimeSpan expiration, TimeSpan timeout)-> at Couchbase.MemcachedBucket.Upsert[T](String key, T value, TimeSpan expiration)-> at Cruise.Cache.Base.Cache.CouchbaseCache.ExecuteUpsert[TEntity](String ckey, TEntity valueToSave, TimeSpan timeToLive)-> at Amib.Threading.Internal.WorkItemsGroupBase.<>c__DisplayClass43_03.b__0(Object state) in M:\Dev\STP\STP.git\SmartThreadPool\WorkItemsGroupBase.cs:line 337-> at Amib.Threading.Internal.WorkItem.ExecuteWorkItem() in M:\Dev\STP\STP.git\SmartThreadPool\WorkItem.cs:line 381-> at Amib.Threading.Internal.WorkItem.Execute() in M:\Dev\STP\STP.git\SmartThreadPool\WorkItem.cs:line 314-> at Amib.Threading.SmartThreadPool.ExecuteWorkItem(WorkItem workItem) in M:\Dev\STP\STP.git\SmartThreadPool\SmartThreadPool.cs:line 910-> at Amib.Threading.SmartThreadPool.ProcessQueuedItems() in M:\Dev\STP\STP.git\SmartThreadPool\SmartThreadPool.cs:line 850-> at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)-> at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)-> at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)-> at System.Threading.ThreadHelper.ThreadStart()

After we start receiving that specific error, the client continue to give SendTimeoutExpiredException and continue to stop working until we recycle the pool.

The solution we implemented is to catch the SendTimeoutExpiredException and close and reopen the connections using ClusterHelper.Close and ClusterHelper.Inizialize

I suppose the is an error on the client that doesn’t reopen the connections when they loose something.

We are using Client SDK 2.5.3

That shouldn’t be necessary. Any thoughts @jmorris or @MikeGoldsmith?

@simopala -

A SendTimeoutExpiredException happens when the client times-out out waiting for a response to return from the server. Generally it is some kind of network or transport failure that causes this response to “hang”; often its a TCP RST. When the SDK generates this exception, it will check the state of the connection and recreate a new one if its no longer connected. You should not have to re-cycle the app pool or close and re-open the client by calling ClusterHelper.Close/Initialize; the SDK should repair itself.

Since this is just the exception that is generated when the request times out, it doesn’t contain enough information to determine why it timed out or what the internal state of the SDK is. To get this information you’ll need to enable logging on the SDK and grep through the logs (you can send me a dm with them and i’ll take a look as well).

Another useful tool is to run sdk-doctor to help diagnose any network issues between your app and the cluster.

-Jeff

Logging is enabled in our servers, at warn level, but no other exceptions has been thrown.
When the client start throwing a SendTimeoutExpiredException, it will continue throwing those exceptions and doesn’t reopen the connections. The only workaround we found was to force close and re-initialize the cluster.

I tried executing the sdk-doctor utility on our server (1 of 3) and I received come alerts:

sdk-doctor diagnose http://az-cah-couch001.crociere.lan:8091/pricing -u Administrator -p xxxxxx

09:59:42.464 INFO ? Parsing connection string http://az-cah-couch001.crociere.l an:8091/pricing
09:59:42.501 WARN ? Connection string is using the deprecated http:// scheme.
Use the couchbase:// scheme instead!
09:59:42.502 INFO ? Connection string identifies the following CCCP endpoints:
09:59:42.502 INFO ? Connection string identifies the following HTTP endpoints:
09:59:42.502 INFO ? 1. az-cah-couch001.crociere.lan:8091
09:59:42.503 INFO ? Connection string specifies bucket pricing
09:59:42.503 WARN ? Your connection string specifies only a single host. You sh
ould consider adding additional static nodes from your cluster to this list to i
mprove your applications fault-tolerance
09:59:42.504 INFO ? Performing DNS lookup for host az-cah-couch001.crociere.lan
09:59:42.507 INFO ? Bootstrap host az-cah-couch001.crociere.lan refers to a se
rver with the address 10.70.4.21
09:59:42.507 INFO ? Not attempting CCCP, as the connection string does not suppo
rt it
09:59:42.508 INFO ? Attempting to connect to cluster via HTTP (Terse)
09:59:42.508 INFO ? Attempting to fetch terse config via http from az-cah-couch 001.crociere.lan:8091
09:59:42.524 WARN ? Bootstrap host az-cah-couch001.crociere.lan is not using t
he canonical node hostname of 10.70.4.21. This is not neccessarily an error,
but has been known to result in strange and difficult-to-diagnose errors in the
future when routing gets changed.
09:59:42.525 INFO ? Identified the following nodes:
09:59:42.525 INFO ? [0] 10.70.4.21
09:59:42.526 INFO ? mgmt: 8091, indexStreamMaint: 9105
, capi: 8092
09:59:42.526 INFO ? moxi: 11211, indexAdmin: 9100
, indexScan: 9101
09:59:42.526 INFO ? indexHttp: 9102, indexStreamInit: 9103
, indexStreamCatchup: 9104
09:59:42.532 INFO ? projector: 9999, kv: 11210
, n1ql: 8093
09:59:42.534 INFO ? [1] 10.70.4.22
09:59:42.534 INFO ? mgmt: 8091, indexAdmin: 9100
, indexHttp: 9102
09:59:42.534 INFO ? indexStreamCatchup: 9104, capi: 8092
, projector: 9999
09:59:42.535 INFO ? n1ql: 8093, indexScan: 9101
, indexStreamInit: 9103
09:59:42.535 INFO ? indexStreamMaint: 9105, kv: 11210
, moxi: 11211
09:59:42.536 INFO ? [2] 10.70.4.23
09:59:42.536 INFO ? mgmt: 8091, indexScan: 9101
, indexStreamCatchup: 9104
09:59:42.536 INFO ? capi: 8092, moxi: 11211
, n1ql: 8093
09:59:42.542 INFO ? indexAdmin: 9100, indexHttp: 9102
, indexStreamInit: 9103
09:59:42.543 INFO ? indexStreamMaint: 9105, projector: 9999
, kv: 11210
09:59:42.544 WARN ? Your configuration was fetched via a non-optimal path, you s
hould update your connection string and/or cluster configuration to allow CCCP c
onfig fetch
09:59:42.544 INFO ? Fetching config from 10.70.4.21:8091
09:59:42.547 INFO ? Failed to retreive cluster information (status code: 401)
09:59:42.549 ERRO ? Failed to connect to KV service at 10.70.4.21:11210 (error
: invalid bucket name/password)
09:59:42.571 INFO ? Successfully connected to MGMT service at 10.70.4.21:8091
09:59:42.581 INFO ? Successfully connected to CAPI service at 10.70.4.21:8092
09:59:42.583 INFO ? Successfully connected to N1QL service at 10.70.4.21:8093
09:59:42.591 ERRO ? Failed to connect to KV service at 10.70.4.22:11210 (error
: invalid bucket name/password)
09:59:42.595 INFO ? Successfully connected to MGMT service at 10.70.4.22:8091
09:59:42.605 INFO ? Successfully connected to CAPI service at 10.70.4.22:8092
09:59:42.623 INFO ? Successfully connected to N1QL service at 10.70.4.22:8093
09:59:42.631 ERRO ? Failed to connect to KV service at 10.70.4.23:11210 (error
: invalid bucket name/password)
09:59:42.641 INFO ? Successfully connected to MGMT service at 10.70.4.23:8091
09:59:42.647 INFO ? Successfully connected to CAPI service at 10.70.4.23:8092
09:59:42.650 INFO ? Successfully connected to N1QL service at 10.70.4.23:8093
09:59:42.652 WARN ? Failed to perform KV connection performance analysis on 10. 70.4.21:11210 (error: %!d(string=invalid bucket name/password))
09:59:42.654 WARN ? Failed to perform KV connection performance analysis on 10. 70.4.22:11210 (error: %!d(string=invalid bucket name/password))
09:59:42.657 WARN ? Failed to perform KV connection performance analysis on 10. 70.4.23:11210 (error: %!d(string=invalid bucket name/password))
09:59:42.658 INFO ? Diagnostics completed

Summary:
←[33m[WARN]←[0m Connection string is using the deprecated http:// scheme. Use
the couchbase:// scheme instead!
←[33m[WARN]←[0m Your connection string specifies only a single host. You should
consider adding additional static nodes from your cluster to this list to impro
ve your applications fault-tolerance
←[33m[WARN]←[0m Bootstrap host az-cah-couch001.crociere.lan is not using the c
anonical node hostname of 10.70.4.21. This is not neccessarily an error, but
has been known to result in strange and difficult-to-diagnose errors in the futu
re when routing gets changed.
←[33m[WARN]←[0m Your configuration was fetched via a non-optimal path, you shoul
d update your connection string and/or cluster configuration to allow CCCP confi
g fetch
←[33m[WARN]←[0m Failed to perform KV connection performance analysis on 10.70.4 .21:11210 (error: %!d(string=invalid bucket name/password))
←[33m[WARN]←[0m Failed to perform KV connection performance analysis on 10.70.4 .22:11210 (error: %!d(string=invalid bucket name/password))
←[33m[WARN]←[0m Failed to perform KV connection performance analysis on 10.70.4 .23:11210 (error: %!d(string=invalid bucket name/password))
←[31m[ERRO]←[0m Failed to connect to KV service at 10.70.4.21:11210 (error: in
valid bucket name/password)
←[31m[ERRO]←[0m Failed to connect to KV service at 10.70.4.22:11210 (error: in
valid bucket name/password)
←[31m[ERRO]←[0m Failed to connect to KV service at 10.70.4.23:11210 (error: in
valid bucket name/password)

Found multiple issues, see listing above.

if I try connecting via couchbase:// I’m receiving an unexpected error:

sdk-doctor diagnose couchbase://az-cah-couch001.crociere.lan:8091/pricing -u Administrator -p xxxxxx

10:01:02.536 INFO ? Parsing connection string couchbase://az-cah-couch001.croci ere.lan:8091/pricing
10:01:02.586 INFO ? Connection string identifies the following CCCP endpoints:
10:01:02.586 INFO ? 1. az-cah-couch001.crociere.lan:8091
10:01:02.586 INFO ? Connection string identifies the following HTTP endpoints:
10:01:02.586 INFO ? Connection string specifies bucket pricing
10:01:02.587 WARN ? Your connection string specifies only a single host. You sh
ould consider adding additional static nodes from your cluster to this list to i
mprove your applications fault-tolerance
10:01:02.587 INFO ? Performing DNS lookup for host az-cah-couch001.crociere.lan
10:01:02.591 INFO ? Bootstrap host az-cah-couch001.crociere.lan refers to a se
rver with the address 10.70.4.21
10:01:02.592 INFO ? Attempting to connect to cluster via CCCP
10:01:02.592 INFO ? Attempting to fetch config via cccp from az-cah-couch001.cr ociere.lan:8091
10:06:02.599 ERRO ? Failed to fetch configuration via cccp from az-cah-couch001 .crociere.lan:8091 (error: EOF)
10:06:02.600 INFO ? Not attempting HTTP (Terse), as the connection string does n
ot support it
10:06:02.600 INFO ? Not attempting HTTP (Full), as the connection string does no
t support it
10:06:02.601 ERRO ? All endpoints specified by your connection string were unrea
chable, further cluster diagnostics are not possible
10:06:02.601 INFO ? Diagnostics completed

Summary:
←[33m[WARN]←[0m Your connection string specifies only a single host. You should
consider adding additional static nodes from your cluster to this list to impro
ve your applications fault-tolerance
←[31m[ERRO]←[0m Failed to fetch configuration via cccp from az-cah-couch001.cro ciere.lan:8091 (error: EOF)
←[31m[ERRO]←[0m All endpoints specified by your connection string were unreachab
le, further cluster diagnostics are not possible

Found multiple issues, see listing above.

I am asking for logging to be enabled within the SDK, not your application logs: Couchbase SDKs

Is your bucket name and password correct in your configuration?

I suspect networking issues (possible DNS issue?); your node is not reachable. You’ll need to diagnose and resolve these issues. The timeouts are a side-effect of something not being correct between the application and the cluster.

Thanks for you response.

I know that the problem can be related to a networking problem (maybe a dns issue). I’m investigating about this issue.

What I don’t understand is why I have to catch the timeout exception and force a close and re-initialize of the cluster helper. This is a task that has to be done internally by the client.
If I don’t “restart” the connections on couchbase client it will continue throwing those SendTimeoutExpiredException forever, for every operation. The connections should be automatically reopened by the client.

Do you have some advise or investigating on that?

1 Like

@simopala -

That is interesting because the SDK does indeed handle the creation/destruction of connections, this is how the client supports use-cases such as failover, node swapping and rebalances.

One thing to note is that when you call Initialize you are effectively re-bootstrapping the SDK instead of relying on the inner mechanisms which manage connections. The cluster tells the client via a cluster map (config) update if it has changed or not; however if the cluster doesn’t send a revision, then the cached IPEndPoints, IPAddress, etc maybe reused. That could be your problem, but not sure…grepping the SDK logs would help here.

Additionally, if the client detects too many failures coming from a node, it will go into a back-off state where it won’t send requests until it determines that it can connect to the node and receive NOOPs. In this case it will immediately fail the request and return a NodeUnavailableException until the client can successfully can connect and recieve NOOPs.

-Jeff

Hi All,

I am not sure if you have sorted out the issue?
We are also having the similar issue as @alexc mentioned at the beginning - the client complains operation has timed out and unless we restart the ClusterHelper, all the requests are remained timed out.

The version of .Net client is 2.5.5, and ClusterHelper (mentioned here: https://developer.couchbase.com/documentation/server/4.0/sdks/dotnet-2.2/cluster-helper.html) is adapted in our implementation.

Our setup is as follow:

<couchbaseClients>
  <couchbase useSsl="false">
    <servers>
      <add uri="http://192.168.10.75:8091/pools"/>
    </servers>
    <buckets>
      <add name="myBucket">
        <connectionPool maxSize="10" minSize="5"/>
      </add>
    </buckets>
  </couchbase>
</couchbaseClients>

We found the time out issu happens after “idle” for a while - meaning when we stop sending requests to the server for a while then we send again, time out happens.

Having read @jmorris 's suggestion to enable SDK log, we turned it on. After examining the log a little bit, we found the time out issue happens since the connection that was used before getting disconnected.

Here are part of the log:

[2018-03-16 13:04:14,642] Couchbase.IO.ConnectionBase 
Handling disconnect for connection b4abdb4d-1763-4aba-9e2e-1c531175ce2f: Couchbase.IO.RemoteHostClosedException: The remote host (10.38.110.11:11210) has gracefully closed this connection.


[2018-03-16 13:04:14,642] Couchbase.IO.ConnectionBase 
Closing connection b4abdb4d-1763-4aba-9e2e-1c531175ce2f


[2018-03-16 13:04:14,642] Couchbase.IO.ConnectionBase 
Handling disconnect for connection 1090cb62-ac02-4c1b-ac6b-8f408b9f03a8: Couchbase.IO.RemoteHostClosedException: The remote host (10.38.110.11:11210) has gracefully closed this connection.


[2018-03-16 13:04:14,642] Couchbase.IO.ConnectionBase 
Closing connection 1090cb62-ac02-4c1b-ac6b-8f408b9f03a8


...... Check config and processing config ......


[2018-03-16 13:04:33,549] Couchbase.Core.Server 
Sending Get`1 with key PARBJS180510180526-100-JWZ-1002-Y using server 10.38.110.11:11210


[2018-03-16 13:04:33,549] Couchbase.IO.ConnectionBase 
Handling disconnect for connection b4abdb4d-1763-4aba-9e2e-1c531175ce2f: System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'System.Net.Sockets.Socket'.
   at System.Net.Sockets.Socket.Send(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags, SocketError& errorCode)
   at System.Net.Sockets.Socket.Send(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags)
   at Couchbase.IO.MultiplexingConnection.Send(Byte[] request)


[2018-03-16 13:04:33,549] Couchbase.Core.Server 
Sending Get`1 with key PARBJS180510180526-100-JWZ-2007-Y using server 10.38.110.11:11210


[2018-03-16 13:04:33,586] Couchbase.IO.ConnectionBase 
Handling disconnect for connection 1090cb62-ac02-4c1b-ac6b-8f408b9f03a8: System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'System.Net.Sockets.Socket'.
   at System.Net.Sockets.Socket.Send(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags, SocketError& errorCode)
   at System.Net.Sockets.Socket.Send(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags)
   at Couchbase.IO.MultiplexingConnection.Send(Byte[] request)

...... Check config and processing config ......

[2018-03-16 13:04:48,588] Couchbase.IO.Services.PooledIOService 
Couchbase.IO.SendTimeoutExpiredException: The operation has timed out.
   at Couchbase.IO.MultiplexingConnection.Send(Byte[] request)
   at Couchbase.IO.Services.PooledIOService.Execute[T](IOperation`1 operation)


[2018-03-16 13:04:48,588] Couchbase.IO.Services.PooledIOService 
Couchbase.IO.SendTimeoutExpiredException: The operation has timed out.
   at Couchbase.IO.MultiplexingConnection.Send(Byte[] request)
   at Couchbase.IO.Services.IOServiceBase.Execute[T](IOperation`1 operation, IConnection connection)
   at Couchbase.IO.Services.IOServiceBase.EnableServerFeatures(IConnection connection)
   at Couchbase.IO.Services.IOServiceBase.CheckEnabledServerFeatures(IConnection connection)
   at Couchbase.IO.Services.PooledIOService.Execute[T](IOperation`1 operation)


[2018-03-16 13:04:48,588] Couchbase.Core.Server 
Checking if node 10.38.110.11:11210 should be down - last: 12:59:15.2323782, current: 13:04:48.5880404, count: 1


[2018-03-16 13:04:48,588] Couchbase.Core.Server 
Checking if node 10.38.110.11:11210 should be down - last: 12:59:15.2323782, current: 13:04:48.5880404, count: 2


[2018-03-16 13:04:48,588] Couchbase.Core.Buckets.CouchbaseRequestExecuter 
Operation doesn't support retries for key PARBJS180510180526-100-JWZ-2007-Y


[2018-03-16 13:04:48,588] Couchbase.Core.Buckets.RequestExecuterBase 
Operation for key PARBJS180510180526-100-JWZ-2007-Y failed after 1 retries using vb898 from rev69 and opaque326. Reason: The operation has timed out.


[2018-03-16 13:04:48,588] Couchbase.Core.Buckets.CouchbaseRequestExecuter 
Operation doesn't support retries for key PARBJS180510180526-100-JWZ-1002-Y


[2018-03-16 13:04:48,588] Couchbase.Core.Buckets.RequestExecuterBase 
Operation for key PARBJS180510180526-100-JWZ-1002-Y failed after 1 retries using vb989 from rev69 and opaque325. Reason: The operation has timed out.

Hope this helps to solve the problem.

Regards,
Xi

Do you have TCP keep-alive set, either programmatically and/or on the server
https://developer.couchbase.com/documentation/server/current/sdk/dotnet/client-settings.html

We had a few issues with timeouts when the firewall/F5 started closing connections it deemed inactive, having a 60s keep alive ping fixed that.

@clinton1ql Thank you for the tip.

We do not enable any keep live setting explicitly. However, I found the keep live is enabled by default and set to 2 hours according to the document.

The issue we are facing is after the first request sent, we wait for 5 more minutes and send the second one. The second one is very likely to end up with operation time out. When we enable the SDK logging, we wait to send the second request after the connection is closed (we were watching the log in the test). And this guarantees a operation timed out.

This part is what we want to ask. Normally if a connection is disconnected because of idle, I think that’s perfectly fine. We just restablish the connection and send the later requests out. However, my client does a little abnormal - unless I reset it, it will return timed out for each request I try to send later.

@xiduan -

Your connection is definitely being closed probably by something between the app and the server (firewall, LB, etc): The remote host (10.38.110.11:11210) has gracefully closed this connection.

In this case the SDK will re-establish a connection and for operation that allow retries, retry the operation on the new connection. Normally, the application never knows this happens and just works as expected. However, since the operation type is a mutation, retries are not allowed at the SDK level because it may have succeeded on the server and the client just never got the response.:

In this case, the application should retry the operation immediately if that is the behavior you wish to have. This second request will likely succeed assuming that whatever closed the connection initially hasn’t closed it again.

Hopefully, that helps!

-Jeff

Hi Jeff,

Thank you very much for pointing out the issue.
I do not quite understand that:

What we did is request a bucket from the ClustHelper and issue a generic Get option, like this:

IBucket statusBucket = ClusterHelper.GetBucket("status");
var statusGetResult = statusBucket.Get<CacheStatus>(key);

Please could you shed some light that how can I make it able to retry at SDK level?

Regards,
Xi

@xiduan -

I was referring to this log output:

You should only see that when using a some sort of mutation without using CAS.

I can’t see enough of the log to correlate the exact operation that is failing. A GET should be retried until it times out. However, the initial GET might spend its lifetime waiting for the connection response and then when that times out, then it too would time out and not be retried. The next operation should succeed in that case, assuming the connection wasn’t closed again. You can send me a more complete log using a private message if you wish and I can take a look.

This would simply be a retry loop in the application where you check the response and decide if you want to retry or not based on the status of the operation response.

-Jeff

Hi @jmorris

According to our observation, the Get operation was hanging for about 15 seconds before it returned back.
We print the operation result into the log if it is not a successful one as well - we can see the operation result failure happened 15 seconds later after the request was issued (between the 15 seconds there were some config check according to the SDK log) and we don’t do any retry between them.
I am eager to send you the full log but I can’t find a place to draft a private message. As a new member, I am not able to send an attachment as well.
Please could you advise how can I send the private message, or if it possible to send you by email?

Regards,
Xi

20180317-debug.zip (73.2 KB)
Hi @jmorris ,

I have managed to upload my SDK log here.

The problem is that once disconnected, it takes a long time (about 15s) to result in a timed out response and it didn’t recovery by itself because it will give the same timed out response if I send another request after the first one. Moreover, I tried to set up a timeout parameter with the Get operation in another test, but the timeout parameter given by myself was not respected - it still took about 15s to respond back the timed out message.

Both the client and the server are on the same LAN and I have disabled firewall on the server when doing the test.

The server version is 4.5.0 and the client version is 2.5.5.

Many Thanks,
Xi