ConfigurationException and OperationTimeoutException with 1 machine down, or in Pend (warm-up) state

Hi,

My application tries to make connections and do a getBulk, but I’m getting the following 2 Exceptions separately.

1)
Caused by: com.couchbase.client.vbucket.ConfigurationException: Could not fetch a valid Bucket configuration.
at com.couchbase.client.vbucket.provider.BucketConfigurationProvider.bootstrap(BucketConfigurationProvider.java:123) ~[couchbase-client-1.4.2.jar:na]
at com.couchbase.client.vbucket.provider.BucketConfigurationProvider.getConfig(BucketConfigurationProvider.java:373) ~[couchbase-client-1.4.2.jar:na]
at com.couchbase.client.CouchbaseConnectionFactory.getVBucketConfig(CouchbaseConnectionFactory.java:317) ~[couchbase-client-1.4.2.jar:na]
at com.couchbase.client.CouchbaseClient.(CouchbaseClient.java:258) ~[couchbase-client-1.4.2.jar:na]
at com.nsn.ngdb.common.cache.client.impl.couchbase.CouchbaseSpyMemcacheClient.(CouchbaseSpyMemcacheClient.java:227) ~[couchbase-cache-server-plugin-16_7.1.jar:na]
… 11 common frames omitted

2)
Caused by: net.spy.memcached.OperationTimeoutException: Timeout waiting for bulk values: waited 100,000 ms. Node status: Connection Status { xxx10.10.10.164:11210 active: true, authed: true, last read: 97,348 ms ago xxx/10.10.10.162:11210 active: true, authed: true, last read: 97,409 ms ago xxx/10.10.10.163:11210 active: false, authed: false, last read: 228,364 ms ago }
at net.spy.memcached.MemcachedClient.getBulk(MemcachedClient.java:1567) ~[spymemcached-2.11.3.jar:2.11.3]
at net.spy.memcached.MemcachedClient.getBulk(MemcachedClient.java:1602) ~[spymemcached-2.11.3.jar:2.11.3]
at net.spy.memcached.MemcachedClient.getBulk(MemcachedClient.java:1617) ~[spymemcached-2.11.3.jar:2.11.3]
at com.test.cocuhbase.getBulk(CouchbaseSpyMemcacheClient.java:504) ~[couchbase-cache-server-test.jar:na]
… 14 common frames omitted
Caused by: net.spy.memcached.internal.CheckedOperationTimeoutException: Operation timed out. - failing node: xxx/10.10.10.163:11210
at net.spy.memcached.internal.BulkGetFuture.get(BulkGetFuture.java:127) ~[spymemcached-2.11.3.jar:2.11.3]
at net.spy.memcached.internal.BulkGetFuture.get(BulkGetFuture.java:52) ~[spymemcached-2.11.3.jar:2.11.3]
at net.spy.memcached.MemcachedClient.getBulk(MemcachedClient.java:1556) ~[spymemcached-2.11.3.jar:2.11.3]
… 17 common frames omitted

Environment:
Couchbase version: Version: 2.1.1 community edition (build-764)
3 node couchbase cluster

Memory:
In Use: 33.7 GB
Unused: 14.9 GB

Total buckets: 10
Item count: > 9 Mn
Replication: 1
Operation timeout: 100 sec

I’ve noticed this mostly when 1 of the 3 couchbase machines are down, or if any machine is in pending state.
I was expecting if 1 machine is down, the other 2 machines should be able to provide the valid bucket with connection and data.
Even after the failover of the machine which was down, these errors are seen.
Any clue on possible reasons ?

Please let me know if more information/logs are required.

Thanks,
Balkrishan

how do you configure your CouchbaseClient initially? Do you provide the URLs to the 3 nodes or just one? If so, would it happen to be the one that is down by any chance?

I’m giving all the 3 nodes in the connection URL to connect to.

Caused by: net.spy.memcached.OperationTimeoutException: Timeout waiting for bulk values: waited 100,000 ms. Node status: Connection Status { xxx10.10.10.164:11210 active: true, authed: true, last read: 97,348 ms ago xxx/10.10.10.162:11210 active: true, authed: true, last read: 97,409 ms ago xxx/10.10.10.163:11210 active: false, authed: false, last read: 228,364 ms ago }

the log indicates that the SDK still tries to contact the 3rd (down) node and was able to read from it last 220ms ago… you say the down node has been failed over? was it before starting the bulk? before starting the client?

also since you copied the second exception again in your last response I’m not sure what you meant to say… does that mean that you changed your code to give the 3 nodes and you don’t see the first exception anymore?

Sorry for the confusion.
I copied the 2nd exception again, just to re-express that the connection url had 3 nodes since the beginning and highlighted that it said, Connection Status : active is true for 2 nodes, and false for 1 node.

The application continuously try to do the getBulk and we didn’t stop the application. So the error was seen before the failover.
Why does the SDK still tries to connect to the 3rd machine, which is down, because, if it was unable to connect somehow, shouldn’t it try to connect to the next available?

ok thanks for the clarification. which version of the SDK are you using exactly? if you can provide logs (eg. in a secret gist) that would be great too, yes :smile:

I’m using SDK 1.4.2 ( tried using 1.4.10 also, but no improvement).
Sorry, that is all the relevant log actually from the couchbase stacktrace.

ok after looking back at the code for getBulk, it makes sense why this fails if the node hasn’t been failed over before starting the bulk get:

getBulk prepares the operations by distributing the keys between their responsible nodes as they are seen at the start of the bulk. :information_desk_person:

if you retry after that, the cluster map should have been updated and the next getBulk would then target the promoted replica.

I’m sure you can even make it so it only retries keys that haven’t been retrieved:

  • in the callback, add the keys that are correctly received to a collection doneKeys (a thread-safe append efficient one like a ConcurrentLinkedQueue :rocket:)
  • if there’s an error, removeAll the doneKeys from the original list of keys to fetch (or a copy if that list is used elsewhere)
  • retry with the smaller list of keys

:bulb: I advise you to stick to the latest and greatest patch of the SDK, if there’s a new one it means bugs have been fixed :no_entry: :bug:

Ok. We can try out this solution.
Thanks.

Also, is the 1st exception (Could not fetch a valid Bucket ) related to the same thing ?

I don’t think so, if at least one of the nodes is up in the cluster (and you initialize the client with the full list) it should manage to connect. You mentioned that at one point all the nodes in the cluster where in “pending” state? That could be the source of the error.

Yes, i also suspect the “pending” state for this connection error.
Can you please tell me from where I can learn more about the couchbase “pending” (warm-up) state ?
What does it mean, when it occurs, how to resolve this state, etc.

looks like this post is related Couchbase buckets shutting down and server going to pending state so if you’re from the same team maybe you can continue inquiry on pending state there?

otherwise I’d suggest to “reply as a new topic” in the Couchbase server category and I’ll try to ping a few folks and dig up the correct documentation.