QueryAsync HttpRequestExceptions - Response ended prematurely

ticontask · December 15, 2021, 8:09pm

Its pretty basic.

"Couchbase": {
  "ConnectionString": "couchbase://server",
  "UserName": "Administrator",
  "Password": "password",
  "BucketName": "data",
  "UseSsl": false
},

Initialized in Startup.cs using DI if that makes a difference:

            services.AddCouchbase(Configuration.GetSection("Couchbase"))
                .AddCouchbaseBucket<ISmsDataBucketProvider>(Configuration.GetValue<string>("Couchbase:BucketName"))
                .AddTransient<ISmsDataRepo, SmsDataRepo>();

Thanks,
T

jmorris · December 15, 2021, 8:15pm

This would have to be lower as the server times out at 5s IIRC - in 3.2.5 this defaults to 4.5s.

ticontask · December 15, 2021, 8:27pm

@jmorris
I set the queries to run every 3 seconds and still see the “RequestCancellationException”

Thanks,
T

jmorris · December 16, 2021, 12:00am

@ticontask

I pinged you on Gitter.im (i think), if not ping me jeffrymorris

ticontask · December 16, 2021, 11:47am

@jmorris

I’m not on gitter,

Thanks,
T

jmorris · December 16, 2021, 6:43pm

Is it possible to provide a small project that I can use to recreate the issue? For whatever reason, I cannot replicate the error, but I know it exists.

Jeff

ticontask · December 16, 2021, 6:54pm

@jmorris

Jeff,

I’ll try to put something together over the holidays.

Thanks,
T

btburnett3 · December 21, 2021, 1:26pm

@ticontask

In order to help us diagnose this, I’d like to summarize what the IdleHttpConnectionTimeout value is meant to accomplish.

The error you are seeing is due to the connection being passively closed on the server, without actively notifying the HTTP client that it has been closed. The HTTP client reusues connections for subsequent HTTP requests for performance gains, but in this case it thinks it can reuse a connection it actually can’t. When it tries to use the connection the Socket discovers it’s dead and throws an IOException which is bubbling up and eventually becoming a RequestCanceledException.

There are many possible causes for passive TCP stream closes like this, including network hardware like firewalls/routers between the client and the server. However, one known issue is newer versions of Couchbase Server (6.0.4 and later, I’m told). As a mitigation against slowloris DoS attacks the server may passively close HTTP connections after 5 seconds of idle time. Therefore, our change in 3.2.5 is to set the client-side idle timeout to 4.5 seconds. In theory, this means that we’ll throw away the connection before the server does.

Based on your comments above, it sounds like you’ve been experimenting with raising the value. In reality, if our theories are correct, the experiment should be lowering the value. It’s possible our 4.5 second idle timeout isn’t quite sufficiently low in some cases, especially in cases with high latency.

You also mention running the query every 3 seconds. Note that this would not necessarily ensure that the HTTP connection did not become idle. The internals of SocketsHttpHandler may be operating more than one connection, each with its own idle time. Additionally, if you have more than one query node then the queries are being spread randomly across multiple query nodes, each with its own connection pool.

On another note, I recently completed an upgrade from SDK 2 to 3.2.5 of a .NET 6 service that uses the Query service extensively (using Server 6.0.4). After several days, I have seen no instances of the errors you mentioned in our logs. This is far from definitive, as you are on Server 6.6.3 and .NET Core 3.1 and querying with a different frequency, but it does point to the possibility of a network-layer issue rather than an idle timeout issue. Just saying we should keep that in mind. Can you provide any details about your network topology between the clients and the servers?

btburnett3 · December 21, 2021, 4:51pm

@ticontask

I’ve spent some time looking at this myself in Wireshark, trying to reproduce the problem. The more I see, the more I’m thinking this may be network-related.

I’ve been testing using SDK 3.2.5, .NET Core 3.1, and Server 6.6.3 to try to replicate your environment. In my case, the server is running in Linux via WSL 2 Docker, but I’m running the test from Windows. When I test the behaviors around idle HTTP connections, I can see the Couchbase Server closing the connection after 5s of inactivity. However, in my case, I’m receiving a graceful TCP FIN packet from the server immediately at the 5 second point. Even if the idle timeout is pushed up high (say, 1 minute) the connection is closed gracefully. When the next query request is sent, a new TCP connection (with a new client port #) is being created.

However, in your case, you said you were getting TCP RST packets when the next request was sent. As if something was causing the FIN packet to either never be sent or to never arrive. Another possibility is that your problem isn’t related to connection reuse at all, that new connections are sometimes failing with this error. I’d be very interested in seeing a Wireshark trace of the traffic to port 8093 when this issue occurs, including some of requests leading up to the error.

ticontask · December 21, 2021, 5:34pm

@btburnett3

Thanks for the additional information this morning. My test setup:
SDK 3.2.5, Net Core 3.1, Server 6.6.3 build 9800. The server is running in a docker container on Centos7, the client on a Windows 10 VM. running the app from the command line.

I am attaching a Wireshark capture of a session from start to finish.
There are 2 events that I saw, one at 2021/12/21 12:10:28.822 & 2121/12/23 12:12:23.838.

Thanks for the help.
T
CouchbaseErrors.zip (3.9 MB)

btburnett3 · December 21, 2021, 6:58pm

@ticontask

What value for IdleHttpConnectionTimeout was configured at the time you took that trace? Looking at the failure, in both cases it appears that the final HTTP request went out just microseconds before the 5 second idle window. As if, perhaps, it was sending the outgoing request at the same time the FIN was coming back from the server, a form of race condition.

If your IdleHttpConnectionTimeout was the default 4.5 seconds, can you try lowering it to 4 seconds and see what that does?

ticontask · December 22, 2021, 1:59pm

@btburnett3

IdleHttpConnectionTimeout was set to 2500. I changed the value to 4500 this morning and captured another Wireshark trace. The event happened at 08:44:59.516.

Thanks,
T

CouchbaseErrors-Timeout-4000.zip (982.0 KB)

btburnett3 · December 22, 2021, 3:08pm

@ticontask

I’m a bit befuddled at the moment. Can you please send me the exact code (appsettings.json snippet or Startup.cs snippet) where you’re setting the IdleHttpConnectionTimeout? What I’m seeing seems to indicate it’s not being set correctly.

In my testing (on .NET Core 3.1 and SDK 3.2.5) I’ve tried setting a low value for IdleHttpConnectionTimeout, like you are doing. Then I’m running a query every 4 seconds. In my Wireshark traces, it is behaving as I would expect. The next query doesn’t occur before the idle timeout is reached, so .NET closes the connection and a new TCP connection is started for the next query.

However, in your traces it is continuing to reuse the same connection past the idle time you have configured. This is leading me to believe that the value is somehow not being set correctly down to the SocketsHttpHandler or you’re hitting some bug in SocketsHttpHandler. Can you also configure the exact patch version of .NET Core 3.1 you’re app is using?

ticontask · December 22, 2021, 3:28pm

The app is using DependencyInjection. In ConfigurServices(IServiceCollection services)

            services.AddCouchbase(Configuration.GetSection("Couchbase"))
                .AddCouchbaseBucket<ISmsDataBucketProvider>(Configuration.GetValue<string>("Couchbase:BucketName"))
                .AddTransient<ISmsDataRepo, SmsDataRepo>();

In the configurations file that is passed on startup:

  "Couchbase": {
    "ConnectionString": "couchbase://server",
    "UserName": "Administrator",
    "Password": "password",
    "BucketName": "sms-data",
    "BucketPassword": "password",
    "UseSsl": false,
    "IdleHttpConnectionTimeout": 4000
  },

I appreciate all the help.
Thanks,
T

btburnett3 · December 22, 2021, 3:30pm

I think I see your problem. IdleHttpConnectionTimeout is a TimeSpan, not a number of milliseconds. The .NET config binding is interpreting that as 4000 days. Try this instead:

"IdleHttpConnectionTimeout": "00:00:04"

ticontask · December 23, 2021, 12:23pm

@btburnett3

Changed appsettings and ran all night without an error. I will run a longer test over the holiday weekend

Thanks for the help,
T

jmorris · December 23, 2021, 6:20pm

@ticontask good to hear! Thanks for the patience and feedback!

btburnett3 · December 24, 2021, 3:57pm

You’re very welcome, thanks for the patience and all your assistance with diagnosis.

ticontask · December 27, 2021, 5:11pm

@btburnett3 @jmorris

The application has been running since 12/23 without errors.

Again, thanks for all the help.
T