Couchbase v3 SDK `KvNotMyVBucket` errors after add node + rebalance

eugene-shcherbo · January 31, 2023, 10:48am

Hi,

After we add a new node to a cluster, then perform rebalancing our applications start throwing Timeouts exceptions with RetryReasons: KvNotMyVBucket. Note: In my local cluster working in Docker I was able to reproduce it only 2 times, but it’s always reproduced in remote environments.

Cluster and SDK information:

SDK version: 3.3.2
Cluster version: 7.1.2 build 3454
There are 4 nodes in the cluster (3 in my local env)
Every bucket configured with 2 replicas

According to the logs from my local env, a new node can’t be initialized because SelectBucket operations fails with KeyNotFound error:

15:47:17 dbug: Couchbase.Core.ClusterNode[104] Op failed: Couchbase.Core.IO.Operations.Authentication.SelectBucket Couchbase.Core.Exceptions.KeyValue.DocumentNotFoundException: Exception of type ‘Couchbase.Core.Exceptions.KeyValue.DocumentNotFoundException’ was thrown. at Couchbase.Core.ClusterNode.ExecuteOp(Func`4 sender, IOperation op, Object state, CancellationTokenPair tokenPair)
-----------------------Context Info---------------------------
{
“dispatchedFrom”:“e0d74601479e”,
“dispatchedTo”:“172.16.101.14:11207”,
“documentKey”:“travel-sample-magma”,
“clientContextId”:“705270”,
“cas”:0,
“status”:“keyNotFound”,
“bucketName”:“travel-sample-magma”,
“collectionName”:null,
“scopeName”:null,
“message”:“KV Error: {Name=\u0022KEY_ENOENT\u0022, Description=\u0022Not Found\u0022, Attributes=\u0022item-only\u0022}”,“opCode”:“selectBucket”,“retryReasons”:
}

The most interesting thing happens next after the error. The SDK tries to perform a connection reinitialization:

15:47:17 dbug: Couchbase.Core.ClusterNode[0] Starting connection reinitialization on server 172.16.101.14:11207.
15:47:17 dbug: Couchbase.Core.ClusterNode[100] Executing op Helo on 172.16.101.14:11207 with key {“i”:“bf9049a231a53c4c/3f5737c6d5a269d2”,“a”:“couchbase-net-sdk/3.3.2.0 (clr/.NET 6.0.13) (os/Linux 5.10.102.1-microsoft-standard-WSL2 #1 SMP Wed Mar 2 00:30:59 UTC 2022)”} and opaque 705376.

And after this I can’t find anything related to this opaque 705376, probably meaning that we never receive a response from the cluster. And seems that a cancellation token isn’t passed to the reinitialization flow in the ClusterNode:

public async Task HelloHello()
{
   foreach (var connection in ConnectionPool.GetConnections())
   {
      ...
      using var rootSpan = RootSpan("reinitialize_connection");
      var serverFeatureList = await Hello(connection, rootSpan).ConfigureAwait(false);
      ...
   }
}

After this I don’t see records related to the cluster nodes updates like this:

15:47:17 dbug: Couchbase.CouchbaseBucket[7000] Updating cluster nodes for rev#9090 on travel-sample-magma

And seems like I ended up in the situation when cluster vbucket map indicates that some vbuckets are on the new node, but this node isn’t the cluster node list, so SDK takes a random node and sends a request to it which definitely responses with NotMyVBucket error. After 6 attempts to retry the operation fails.

I’m going to test the same with the latest SDK version, but that would be great to know what might cause this issue in 3.3.2 and if there are any workarounds.

The link to the archive with two logs files when I was able to reproduce the issue:

jmorris · February 1, 2023, 12:22am

@eugene-shcherbo

Thanks for posting. This looks like it might be related to NCBC-3215; I would try a later version (preferably the latest) and see if it has been resolved as there have been quite a few bug fixes since 3.3.2.

Hopefully its resolved, otherwise we’ll open a Jira ticket.

Jeff

eugene-shcherbo · February 1, 2023, 8:01am

Hi @jmorris ,

Thank you for looking into this. Sure going to try with the newest is my plan. Will post updates here as soon as I have results.

eugene-shcherbo · February 1, 2023, 4:51pm

Hi @jmorris ,

The latest version of SDK, unfortunately, doesn’t fix the issue. All the same: retries due to KvNotMyVBucket errors.

BTW I also noticed that the timeout errors shows strange total operation time in that cases:

The Get operation 533339/{key} timed out after 00:00:00.0028336. It was retried 7 times using

When using exponential backoff it’s not possible to retry 7 times and elapse 0 seconds. I believe this is because when this happens for the first time we treat an operation as completed and StopRecording is called on the operation when it’s completed for the first time.

public void HandleOperationCompleted(in SlicedMemoryOwner<byte> data)
{
   _dispatchSpan?.Dispose();
   var prevCompleted = Interlocked.Exchange(ref _isCompleted, 1);
   if (prevCompleted == 1)
   {
      StopRecording();
      data.Dispose();
      return;
    }
...

This is not directly related to the current issue, but just shows the confusing total time in timeout errors.

jmorris · February 1, 2023, 5:55pm

@eugene-shcherbo

During a rebalance, it’s perfectly normal for NotMyVBucket to happen as the topology has changed ahead of the SDK knowing it has changed. In most cases they will resolve by being retried w/a slight delay as the SDK will likely be caught up by the time the retry(s) happen, however, in some cases it may end up timing out.

As for the strange timeout, we can look into that; possibly just a formatting issue w/logs?

eugene-shcherbo · February 1, 2023, 6:34pm

@jmorris

Yeah I agree with that it’s normal for NotMyVBucket to happen during a rebalance, but we see the timeouts even after the rebalance is completed and way more until an application is restarted. I checked the logs and seems like every time when the error occurs the client selects a wrong node to send a request to. VBucketMap shows one node, but the client sends to another which seems to be a random.

As for the strange timeout, we can look into that; possibly just a formatting issue w/logs?

I don’t think that it’s a formatting issue, because when we see timeouts due to networking issues, then it shows correct 2.5 seconds in the logs. But here it seems that when a server returns a response (with retriable status) then the operation is considered as completed and timer stops.

jmorris · February 1, 2023, 9:58pm

That would not be expected behavior; can you create a Jira ticket and include the SDK logs?

eugene-shcherbo · February 1, 2023, 11:11pm

@jmorris

Sure I will. Thanks

The link to the ticket https://issues.couchbase.com/browse/NCBC-3334

jmorris · February 3, 2023, 11:58pm

KeyNotFound can be returned temporarily when select_bucket is called during or rebalance and especially if the bucket is still in it’s “warm up” phase. The code will then try to recreate the connections and start the process over again. So, nothing too unusual here.

You can see evidance of that here:

Couchbase.Core.ClusterNode[8000] The Bucket [travel-sample-magma] could not be selected. Either 
it does not exist, is unavailable or the node itself does not have the Data service enabled.

This is likely because for whatever reason, the operation never returned and is an “orphan”; this can happen in many different cases, but likely the connection was closed because of the earlier failure, because it will be recreated from the beginning. I see evidence here:

15:47:17 dbug: Couchbase.Core.ClusterNode[104] Op failed: Couchbase.Core.IO.Operations.Hello System.ObjectDisposedException: Cannot access a disposed object. Object name: 'MultiplexingConnection'.    at Couchbase.Utils.ThrowHelper.ThrowObjectDisposedException(String objectName)    at Couchbase.Core.IO.Connections.MultiplexingConnection.SendAsync(ReadOnlyMemory`1 request, IOperation operation, CancellationToken cancellationToken)    at Couchbase.Core.IO.Operations.OperationBase.SendAsync(IConnection connection, CancellationToken cancellationToken)    at Couchbase.Core.ClusterNode.ExecuteOp(Func`4 sender, IOperation op, Object state, CancellationTokenPair tokenPair)

Looking at your logs, I do see a very large amount of NotMyVBucket’s, but I am not sure why. I am not sure about your log formatting, but every failure is on Couchbase.Core.ClusterNode[104] (this includes Helo, GET and SelectBucket commands). Any insight into what the 104 value is? Managed thread id?

I am wondering if it’s possibly that the bucket is or was deleted/removed from CB server during the rebalance?
The time span between when the NMVB are encountered and when the log ends, is not very long: 15:45:58 -15:48:53. Depending on on various factors, a rebalance may take considerably longer.

eugene-shcherbo · February 6, 2023, 12:00pm

@jmorris

Thanks for the detailed analysis. This all makes sense , but for your questions:

Q: I am wondering if it’s possibly that the bucket is or was deleted/removed from CB server during the rebalance?
A: No, we didn’t do anything special.

Q: The time span between when the NMVB are encountered and when the log ends, is not very long: 15:45:58 -15:48:53. Depending on on various factors, a rebalance may take considerably longer.
A: I see the error in the very end of the *-magma log file:

15:53:07 Retrying op 724139/route_39938 because KvNotMyVBucket and always retry.

As for 104 I think this is an event id for logging failures in the ClusterNode class:

 [LoggerMessage(104, LogLevel.Debug, "Op failed: {op}")]
 private partial void LogKvOperationFailed(Exception ex, IOperation op);

eugene-shcherbo · February 7, 2023, 1:34pm

Hi @jmorris

Have more details on the issue. Seems like in our env we failed to authenticate on a new node for the first time (don’t know the reason yet, but will keep looking). So it fails with the OperationCancelledException, I think because of timeout:

Error processing new clusterOptions\nThe operation was canceled.\nType: System.OperationCanceledException\nSource: System.Private.CoreLib\nTargetSite: Void Throw()\nStackTrace:\n at Couchbase.Core.IO.Authentication.SaslMechanismBase.SendAsync[T](IOperation`1 op, IConnection connection, CancellationToken cancellationToken)\n at Couchbase.Core.IO.Authentication.PlainSaslMechanism.AuthenticateAsync(IConnection connection, CancellationToken cancellationToken)\n at Couchbase.Core.ClusterNode.Couchbase.Core.IO.Connections.IConnectionInitializer.InitializeConnectionAsync(IConnection connection, CancellationToken cancellationToken)\n at Couchbase.Core.IO.Connections.ConnectionPoolBase.CreateConnectionAsync(CancellationToken cancellationToken)\n at Couchbase.Core.IO.Connections.Channels.ChannelConnectionPool.<>c__DisplayClass28_0.<g__StartConnection|0>d.MoveNext()\n— End of stack trace from previous location where exception was thrown —\n at Couchbase.Core.IO.Connections.Channels.ChannelConnectionPool.InitializeAsync(CancellationToken cancellationToken)\n at Couchbase.Core.ClusterNode.InitializeAsync()\n at Couchbase.Core.DI.ClusterNodeFactory.CreateAndConnectAsync(HostEndpointWithPort endPoint, NodeAdapter nodeAdapter, CancellationToken cancellationToken)\n at Couchbase.Core.ClusterContext.ProcessClusterMapAsync(BucketBase bucket, BucketConfig config)\n at Couchbase.CouchbaseBucket.ConfigUpdatedAsync(BucketConfig newConfig)\n at Couchbase.Core.Configuration.Server.ConfigHandler.ProcessAsync(BucketConfig newMap)\n

But the main thing that makes me worry is that when SDK attempts to add a new node, because a cluster map has been changed and adding fails, then it doesn’t try to add it again until the cluster map changes again. This is due to the how ConfigHandler and CouchbaseBucket work.

Let’s say we add a new node to the cluster and ConfigHandler.ProcessAsync(BucketConfig newMap) has been called with a new map. In this code it add or update the map:

var stored = _configs.AddOrUpdate(newMap.Name, key =>
{
   isNewOrUpdate = true;
   return newMap;
},
(key, map) =>
{
   if (newMap.Equals(map)) return map;

   isNewOrUpdate = true;
   return newMap.Rev > map.Rev ? newMap : map;
});

So note that regardless of whether the connection was established or not, we update _configs object, which means that on all subsequent NotMyVBucket responses (which has the cluster map in the body) we won’t try to re-add the node, because this code never see changes in the map.

Of course, eventually when rebalance finished this isNewOrUpdate variable will be true, so that the handler will notify the CouchbaseBucket with the updates, but in the bucket code we have the following:

if (CurrentConfig.ClusterNodesChanged)
{
  // Connect and Update bucket nodes                   
}

So bucket won’t do anything, because nodes in the config hasn’t been changed. Probably in the bucket code it makes sense to compare config nodes with the actual bucket nodes. Not sure, but I think we want to have a way to reconnect to the node if it is in the cluster, but not in the bucket nodes list.