Changes to DNS SRV require application restart

I’ve written a test client in .NET5 using CouchbaseNetClient 3.2.5 which performs a small, simple SELECT query every second.

I’m using kubernetes operator 2.2 and Couchbase 7.0.2 with a single node. The connection string is: couchbase://cb-srv

If I start the test client running and then make a change to the couchbase cluster that results in a new node being started and the old one being shutdown, then i start getting the following exception, forever…until i restart the client:

Couchbase.Core.Exceptions.RequestCanceledException: The query was canceled.
 ---> System.Net.Http.HttpRequestException: Name or service not known (cb-0004.cb.cb-test.svc:8093)
 ---> System.Net.Sockets.SocketException (0xFFFDFFFF): Name or service not known
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|283_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.DefaultConnectAsync(SocketsHttpConnectionContext context, CancellationToken cancellationToken)
   at System.Net.Http.ConnectHelper.ConnectAsync(Func`3 callback, DnsEndPoint endPoint, HttpRequestMessage requestMessage, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.ConnectHelper.ConnectAsync(Func`3 callback, DnsEndPoint endPoint, HttpRequestMessage requestMessage, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.GetHttpConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.SendAsyncCore(HttpRequestMessage request, HttpCompletionOption completionOption, Boolean async, Boolean emitTelemetryStartStop, CancellationToken cancellationToken)
   at Couchbase.Query.QueryClient.ExecuteQuery[T](QueryOptions options, ITypeSerializer serializer, IRequestSpan span)
   --- End of inner exception stack trace ---
   at Couchbase.Query.QueryClient.ExecuteQuery[T](QueryOptions options, ITypeSerializer serializer, IRequestSpan span)
   at Couchbase.Query.QueryClient.QueryAsync[T](String statement, QueryOptions options)
   at Couchbase.Cluster.<>c__DisplayClass33_0`1.<<QueryAsync>g__Func|0>d.MoveNext()
--- End of stack trace from previous location ---
   at Couchbase.Core.Retry.RetryOrchestrator.RetryAsync[T](Func`1 send, IRequest request)
   at Couchbase.Cluster.QueryAsync[T](String statement, QueryOptions options)
   at CouchbaseConnectionTester.Program.Main(String[] args) in C:\Users\PaulClark\source\repos\CouchbaseConnectionTester\Program.cs:line 30
-----------------------Context Info---------------------------
{"Statement":"[{\"statement\":\"SELECT meta().id FROM location_bucket LIMIT 1\",\"timeout\":\"75000ms\",\"client_context_id\":\"1c401494-33ec-4165-85af-8f4a99721511\"}]","ClientContextId":"1c401494-33ec-4165-85af-8f4a99721511","Parameters":"{\"Named\":{},\"Raw\":{},\"Positional\":[]}","HttpStatus":408,"QueryStatus":6,"Errors":null,"Message":null}

As you can see the client is still referencing the previous node: cb-0004.cb.cb-test.svc:8093.

I’d expect it to re-resolve cb-srv and start connecting to cb-0005. If this expectation is incorrect then i will need to restart my application every time a new node is spun up. This doesn’t seem right.

@pc

That behavior is expected if you’re running a single-node cluster. The Couchbase architecture is really designed for multi-node clusters in many respects, this is just one of them.

The cb-srv domain name is only resolved at bootstrap to find the cluster. After that, it is always communicating directly to the all of the nodes in the cluster via the nodes’ addresses. In a production scenario, if a node fails then communication is still available to the other nodes in the cluster. As a new replacement node comes online, they inform the SDK and it will connect to the new node and begin using it.

Your test is failing because you’re killing the entire cluster at once, so there is no communication path to inform the SDK about the new nodes. If you were to add more nodes and repeat the test, keeping at least one node live, it should work fine.

1 Like

Ok, that makes sense, I will write a wrapper that manages the connection for this scenario.

1 Like