Unable to locate node errors in client during rebalance

Hello,

Any help/suggestions would be appreciated. Thanks!


Problem:

When a couchbase node is removed or failed over we receive “Unable to locate node” errors on every request/insert until the node finishes rebalancing. These errors are only thrown on keys that are stored on the node that has been removed. During this time our sites take a huge performance hit resulting in 404 errors and long response times.


My expected result:

When a node is removed/fails the client detects that it is no longer available right away and allows our application to fallback to the database.

We have tried to reduce the dead timeout so the couchbase client refreshes its node list right away but we are still seeing a lot of errors.


Question:

For a production environment what are the suggested client configuration settings when trying to get 300ms response times for a website?

My next step at this point is to enable logging on the couchbase client but hopefully someone here has some insight.


Environment Setup

Server: 2.2.0 enterprise edition (build-821)

  • 3 to 5 nodes
  • 1 bucket

Client: 1.3.4.0

  • We access couchbase from our site layer and our service layer.
  • Errors are thrown from site machine and service machine.
  • The service layer is a set of .Net 4.5 WCF services hosted in IIS 7.5 with a .Net 4.0 application pool
  • The site layer is an ASP.Net MVC 4 .Net 4.5 application.
  • Recently enabled ASP.Net compatibility mode so the couchbase client instance can be stored in HttpContext.Current.Cache and be shared across all WCF endpoints.
  • Our configs have entries for 3 couchbase nodes
  • We run 5 nodes with 1 bucket with about 20 clients connected.
  • ~50000 documents in the bucket
  • RAM / Quota Usage: 8.25GB / 24.4GB
  • Data / Disk Usage: 2.25GB / 2.88GB
  • ~600 to 1200 ops/sec
  • replicas: disabled
  • compaction: disabled

Client Configuration

<servers bucket="default" bucketPassword="" username="couchbase" password="couchbase">
  <add uri="http://couchbase1:8091/pools" />
  <add uri="http://couchbase2:8091/pools" />
  <add uri="http://couchbase3:8091/pools" />
</servers>
<socketPool connectionTimeout="00:00:10" deadTimeout="00:00:05" queueTimeout="00:00:02.500" receiveTimeout="00:00:05" />
<httpClient initializeConnection="false" timeout="00:00:10"/>

Source Files of Interest:

CouchbaseClient.cs
Throws a ClientErrors.FAILURE_NODE_NOT_FOUND error if the server pool [this.Pool.Locate(hashedKey)] returns null.

DefaultServerPool.cs


Related Issues:

http://www.couchbase.com/communities/q-and-a/unable-locate-node-error-net-sdk

http://www.couchbase.com/issues/browse/NCBC-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel

Thanks for the very detailed question(s). There is a lot going on here, so I’ll try to help as best I can.

First, what bucket types are you using? Are you using views or straight k/v?

Second (answers your “Problem” section), during a rebalance scenario (which should only be happening occasionally when you add, remove or fail over a node from an operational perspective - this should be exceptional and not the normal op process), you should only occasionally be receiving (like less than 1% of your ops) a Not My VBucket error, a ClientErrors.FAILURE_NODE_NOT_FOUND indicates something is _really_wrong with your system. Internally during a rebalance, the client will retry the operations where a NMV is encountered - rarely does this bubble up to the application layer. I am guessing that there is something else going on here. Enabling logging (something higher than DEBUG or you will be swamped with trace statements) and configure it to write to file; you can create a jira ticket and attach the file here: http://www.couchbase.com/issues/browse/NCBC

Third (“My expected result”), you can exception handling and logging within your application to do this. Reducing the dead timeout won’t help you here; the client creates a streaming connection to the server and config updates happen when a new config is published.

Fourth (“Question”), this is something you need to determine by tuning, measuring and repeating. Since every deployment and use-case is different, there is no panacea.

Thanks,

Jeff

We are using Couchbase buckets. No views.

I am reworking our code now to never log the “Unable to locate node” (api status code 126) errors to see if that helps with performance.

Thanks for the quick response. I will reply if I make some progress.

My previous post had some formatting issues and was displaying in full. Take a look at it now.