Bucket Operation Timeout on .GetDocument keeping bucket in bad state


#1

I’m using the latest .net sdk v2.x with CB Server v3.x. I am doing some perf tests and hitting some timeouts on the bucket.GetDocument call. I’m finding that when this happens, the bucket can’t be used again, as it gets a timeout on every subsequent call. I’m using the ClusterHelper class for my cluster and bucket managment. Is this the expected behavior? When I remove the bucket and get a new for the subsequent calls, everything works fine with subsequent calls.

Thanks!


#2

@troy_stone

Are you using SDK 2.2? I know that there were some improvements related to timeout handling in that version.

If you are already on 2.2, then can you provide some more detailed error message text?

Thanks!


#3

I’m using 2.2.2. All I get from the GetDocument call is my result object with a Success=false and a message of: The operation has timed out. Again, once this occurs once, all subsequent calls using that bucket object timeout. If I reintialize the Cluster or use a new bucket, it will work fine.


#4

I’m also seeing an issue with my Cluster being disposed, which is causing a null ref exception when I call GetBucket().


#5

@troy_stone

You’re not calling Dispose on your buckets or clusters are you? If you’re using ClusterHelper, you shouldn’t use using statements or call Dispose on either of them.

Brant


#6

I am having to call the ClusterHelper.RemoveBucket(), which in turn disposes the bucket. I’m having to call this because of the original issue of the bucket timeouts.


#7

@troy_stone

Previously you mentioned that you were hitting timeouts during some perf testing. Out of curiosity, during this perf testing where you doing GetDocument or GetDocumentAsync calls. If GetDocument, how many simultaneous threads where you using?


#8

GetDocument, I’ve ramped it up to 180 threads per second, but previously it was as low as 10. I’ve also noticed that if I call ClusterHelper.GetBucket() in my app start-up and don’t actually use the bucket, it will timeout as well. Is that expected?


#9

I really need to get to the bottom of this, otherwise, I’m going to have to abandon the use of Couchbase for my project. Here’s an example of the Cluster disposing after perf test is done, without me removing the bucket or closing the cluster myself.

2015-12-16 06:56:44.3327 | DEBUG | 18 | Couchbase.IO.ConnectionPool1[[Couchbase.IO.Connection, Couchbase.NetClient, Version=2.2.2.0, Culture=neutral, PublicKeyToken=05e9c6b5a9ec94c2]] | Acquire existing: 6942676f-72ef-4e0c-a315-99565012d3c1 | 192.127.232.90:11210 | [4, 5] - c89c1e70-8251-404b-9997-6d84d624783e - Disposed: False
2015-12-16 06:56:44.3327 | DEBUG | 18 | Couchbase.IO.ConnectionBase | Sending opaque1229 on 6942676f-72ef-4e0c-a315-99565012d3c1
2015-12-16 06:56:44.3327 | DEBUG | 16 | Couchbase.IO.ConnectionBase | Receive 1400 bytes for opaque1229 with 6942676f-72ef-4e0c-a315-99565012d3c1 on server 192.127.232.90:11210 offset16384
2015-12-16 06:56:44.3327 | DEBUG | 14 | Couchbase.IO.ConnectionBase | Receive 6445 bytes for opaque1229 with 6942676f-72ef-4e0c-a315-99565012d3c1 on server 192.127.232.90:11210 offset16384
2015-12-16 06:56:44.3327 | DEBUG | 14 | Couchbase.IO.ConnectionBase | Complete with set 1229 with 6942676f-72ef-4e0c-a315-99565012d3c1 on server 192.127.232.90:11210
2015-12-16 06:56:44.3327 | DEBUG | 18 | Couchbase.IO.ConnectionBase | Complete opaque1229 on 6942676f-72ef-4e0c-a315-99565012d3c1
2015-12-16 06:56:44.3327 | DEBUG | 18 | Couchbase.IO.ConnectionPool1[[Couchbase.IO.Connection, Couchbase.NetClient, Version=2.2.2.0, Culture=neutral, PublicKeyToken=05e9c6b5a9ec94c2]] | Releasing: 6942676f-72ef-4e0c-a315-99565012d3c1 on 192.127.232.90:11210 - c89c1e70-8251-404b-9997-6d84d624783e 2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.CouchbaseBucket | Attempting dispose on thread 18 2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.CouchbaseBucket | Current bucket refCount# 1 2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.CouchbaseBucket | Removing bucket refCount# 0 2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.CouchbaseBucket | Disposing on thread 18 2015-12-16 06:56:52.7103 | INFO | 18 | Couchbase.Configuration.Server.Providers.ConfigProviderBase | Unregistering observer 900000003 2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.Configuration.ConfigContextBase | Disposing ConfigContext 2015-12-16 06:56:52.7103 | INFO | 18 | Couchbase.Core.Server | Disposing Server for 192.127.232.90:11210 2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.IO.Strategies.DefaultIOStrategy | Disposing DefaultIOStrategy for 192.127.232.90:11210 - cfeac944-cbd0-4b44-988f-51560fe54288 2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.IO.ConnectionPool1[[Couchbase.IO.Connection, Couchbase.NetClient, Version=2.2.2.0, Culture=neutral, PublicKeyToken=05e9c6b5a9ec94c2]] | Disposing ConnectionPool for 192.127.232.90:11210 - c89c1e70-8251-404b-9997-6d84d624783e
2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.IO.ConnectionBase | Disposing fa11d21e-e34a-4191-a694-9dee7ec5f7e6
2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.IO.ConnectionBase | Disposing the timer for fa11d21e-e34a-4191-a694-9dee7ec5f7e6
2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.IO.ConnectionBase | Disposing 6942676f-72ef-4e0c-a315-99565012d3c1
2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.IO.ConnectionBase | Disposing the timer for 6942676f-72ef-4e0c-a315-99565012d3c1
2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.IO.ConnectionBase | Disposing 5cd9f726-17cf-402b-9e32-70f1af3681e7
2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.IO.ConnectionBase | Disposing the timer for 5cd9f726-17cf-402b-9e32-70f1af3681e7
2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.IO.ConnectionBase | Disposing 089b1662-e9ad-4497-a04f-1ea0305b80fe
2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.IO.ConnectionBase | Disposing the timer for 089b1662-e9ad-4497-a04f-1ea0305b80fe
2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.IO.ConnectionBase | Disposing 8ea8f2b8-f1d8-4498-aace-1408c952ba95
2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.IO.ConnectionBase | Disposing the timer for 8ea8f2b8-f1d8-4498-aace-1408c952ba95
2015-12-16 06:56:52.7103 | INFO | 18 | Couchbase.Core.Server | Disposing Server for 192.127.232.91:11210
2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.IO.Strategies.DefaultIOStrategy | Disposing DefaultIOStrategy for 192.127.232.91:11210 - f33ae2e0-8708-46c6-a51a-6967359ab6af
2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.IO.ConnectionPool1[[Couchbase.IO.Connection, Couchbase.NetClient, Version=2.2.2.0, Culture=neutral, PublicKeyToken=05e9c6b5a9ec94c2]] | Disposing ConnectionPool for 192.127.232.91:11210 - 1cfde8a0-95a4-4a2f-a3b3-7d9906263e4d 2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.IO.ConnectionBase | Disposing 38c3b8fd-34a7-47a3-9ebd-4d8415847be7 2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.IO.ConnectionBase | Disposing the timer for 38c3b8fd-34a7-47a3-9ebd-4d8415847be7 2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.IO.ConnectionBase | Disposing 9ffe9151-a113-4257-9376-b322a815c921 2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.IO.ConnectionBase | Disposing the timer for 9ffe9151-a113-4257-9376-b322a815c921 2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.IO.ConnectionBase | Disposing 234f366f-a835-4501-bce0-66932076ff7e 2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.IO.ConnectionBase | Disposing the timer for 234f366f-a835-4501-bce0-66932076ff7e 2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.IO.ConnectionBase | Disposing 82fcdb86-35fa-4bc1-9c62-f2bb689d372f 2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.IO.ConnectionBase | Disposing the timer for 82fcdb86-35fa-4bc1-9c62-f2bb689d372f 2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.IO.ConnectionBase | Disposing 8ad9085a-0673-4d98-9e39-aeb13af440fc 2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.IO.ConnectionBase | Disposing the timer for 8ad9085a-0673-4d98-9e39-aeb13af440fc 2015-12-16 06:56:52.7103 | INFO | 18 | Couchbase.Configuration.Server.Providers.ConfigProviderBase | Removing config for observer 900000003 2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.Configuration.Server.Providers.ConfigProviderBase | Disposing ConfigurationProvider: CarrierPublicationProvider 2015-12-16 06:56:52.7103 | DEBUG | 18 | Couchbase.Cluster | Disposing Cluster


#10

I figured out the issue with the Cluster closing, it was an issue on my end with the ClusterHelper.Close being called when it shouldn’t.

Thanks for the help!


#11

@troy_stone

So, on the Dispose problem I’m thinking it’s related to disposing and reopening the bucket regularly, the SDK just isn’t really designed to handle that use case. So I think we should try to focus on the original issue, which was getting timeout errors running your bulk insert.

Here’s my suspicion as to what your original problem was, however I don’t have enough facts to say for certain. I’m just going to throw out a possible what that a bulk insert would have been written that may have caused your problem.

  1. Startup multiple threads to run the bulk insert
  2. Each thread then runs a series of InsertAsync commands, but doesn’t do an “await” for them to be completed
  3. This actually queues insert commands as fast as the processor can reasonably do so, potentially queueing hundreds or thousands
  4. The SDK tries to send these inserts through the connection pool to your cluster. Under the default configuration, that is a maximum of 10 connections per server. On a 3 node cluster, that means only 30 operations can be pending at once.
  5. Depending on the speed of your cluster, it is quite possible that some of the later operations out of the very large queue in the SDK begin timing out because they are waiting on free ports in the ConnectionPool.
  6. Increasing timeouts, the connection pool size, or server speed would help mitigate this problem, but all that would do is increase the total update threshold before timeouts started occurring. The core problem would still remain.

The approach I would recommend would be to make the inserts run in parallel, but with a concurrency cap. I’d guess maybe double the number of inserts as the total size of your connection pool. That way you’ll be getting the next batch of updates ready while the previous batch is waiting on a response from the server. So if you have 3 servers and 10 connections per server, run 60 tasks, each awaiting the previous insert before another is attempted.

If this wasn’t your implementation, please let me know what the differences are.

Thanks,
Brant


#12

troy_stone -

In a nutshell, an OperationTimeoutException is a generic exception that is returned when the operation cannot be completed during the timespan defined by ClientConfiguration.DefaultOperationLifespan which defaults to 2.5s. There are many, many reasons why an operation would fail, so in general each failure should be handled independently. That being said, the cause of one OperationTimeout if not resolved could cause future OperationTimeouts…this is likely what you are seeing here. As a rule, the SDK should recover however.

In terms of objects disposing, this is generally tied to your application - the Cluster and Bucket’s you open, must be maintained for the entire application lifespan. The ClusterHelper should help you here, but you still need to ensure that they are not going out of scope or are being closed/disposed by your application. Additionally, I suspect that you may have to do some tuning to match your performance expectations.

I am not sure I understand what you mean here, can you elaborate?

If you dispose of the Cluster, most definitely GetBucket will fail!

In order to help, I would need to see your code. If you can create an NCBC and upload an example project, I’ll take a look at it.

Also, feel free to reach out to me on twitter: @jeffrysmorris

Thanks!

Jeff


#13

AS i mentioned above, I figured it out. There was a ClusterHelper.Close in an HttpModule ApplicationDispose, when it should have been in the global.asax Application_End().