Is it possible to provide a small project that I can use to recreate the issue? For whatever reason, I cannot replicate the error, but I know it exists.
In order to help us diagnose this, Iâd like to summarize what the IdleHttpConnectionTimeout value is meant to accomplish.
The error you are seeing is due to the connection being passively closed on the server, without actively notifying the HTTP client that it has been closed. The HTTP client reusues connections for subsequent HTTP requests for performance gains, but in this case it thinks it can reuse a connection it actually canât. When it tries to use the connection the Socket discovers itâs dead and throws an IOException which is bubbling up and eventually becoming a RequestCanceledException.
There are many possible causes for passive TCP stream closes like this, including network hardware like firewalls/routers between the client and the server. However, one known issue is newer versions of Couchbase Server (6.0.4 and later, Iâm told). As a mitigation against slowloris DoS attacks the server may passively close HTTP connections after 5 seconds of idle time. Therefore, our change in 3.2.5 is to set the client-side idle timeout to 4.5 seconds. In theory, this means that weâll throw away the connection before the server does.
Based on your comments above, it sounds like youâve been experimenting with raising the value. In reality, if our theories are correct, the experiment should be lowering the value. Itâs possible our 4.5 second idle timeout isnât quite sufficiently low in some cases, especially in cases with high latency.
You also mention running the query every 3 seconds. Note that this would not necessarily ensure that the HTTP connection did not become idle. The internals of SocketsHttpHandler may be operating more than one connection, each with its own idle time. Additionally, if you have more than one query node then the queries are being spread randomly across multiple query nodes, each with its own connection pool.
On another note, I recently completed an upgrade from SDK 2 to 3.2.5 of a .NET 6 service that uses the Query service extensively (using Server 6.0.4). After several days, I have seen no instances of the errors you mentioned in our logs. This is far from definitive, as you are on Server 6.6.3 and .NET Core 3.1 and querying with a different frequency, but it does point to the possibility of a network-layer issue rather than an idle timeout issue. Just saying we should keep that in mind. Can you provide any details about your network topology between the clients and the servers?
Iâve spent some time looking at this myself in Wireshark, trying to reproduce the problem. The more I see, the more Iâm thinking this may be network-related.
Iâve been testing using SDK 3.2.5, .NET Core 3.1, and Server 6.6.3 to try to replicate your environment. In my case, the server is running in Linux via WSL 2 Docker, but Iâm running the test from Windows. When I test the behaviors around idle HTTP connections, I can see the Couchbase Server closing the connection after 5s of inactivity. However, in my case, Iâm receiving a graceful TCP FIN packet from the server immediately at the 5 second point. Even if the idle timeout is pushed up high (say, 1 minute) the connection is closed gracefully. When the next query request is sent, a new TCP connection (with a new client port #) is being created.
However, in your case, you said you were getting TCP RST packets when the next request was sent. As if something was causing the FIN packet to either never be sent or to never arrive. Another possibility is that your problem isnât related to connection reuse at all, that new connections are sometimes failing with this error. Iâd be very interested in seeing a Wireshark trace of the traffic to port 8093 when this issue occurs, including some of requests leading up to the error.
Thanks for the additional information this morning. My test setup:
SDK 3.2.5, Net Core 3.1, Server 6.6.3 build 9800. The server is running in a docker container on Centos7, the client on a Windows 10 VM. running the app from the command line.
I am attaching a Wireshark capture of a session from start to finish.
There are 2 events that I saw, one at 2021/12/21 12:10:28.822 & 2121/12/23 12:12:23.838.
What value for IdleHttpConnectionTimeout was configured at the time you took that trace? Looking at the failure, in both cases it appears that the final HTTP request went out just microseconds before the 5 second idle window. As if, perhaps, it was sending the outgoing request at the same time the FIN was coming back from the server, a form of race condition.
If your IdleHttpConnectionTimeout was the default 4.5 seconds, can you try lowering it to 4 seconds and see what that does?
IdleHttpConnectionTimeout was set to 2500. I changed the value to 4500 this morning and captured another Wireshark trace. The event happened at 08:44:59.516.
Iâm a bit befuddled at the moment. Can you please send me the exact code (appsettings.json snippet or Startup.cs snippet) where youâre setting the IdleHttpConnectionTimeout? What Iâm seeing seems to indicate itâs not being set correctly.
In my testing (on .NET Core 3.1 and SDK 3.2.5) Iâve tried setting a low value for IdleHttpConnectionTimeout, like you are doing. Then Iâm running a query every 4 seconds. In my Wireshark traces, it is behaving as I would expect. The next query doesnât occur before the idle timeout is reached, so .NET closes the connection and a new TCP connection is started for the next query.
However, in your traces it is continuing to reuse the same connection past the idle time you have configured. This is leading me to believe that the value is somehow not being set correctly down to the SocketsHttpHandler or youâre hitting some bug in SocketsHttpHandler. Can you also configure the exact patch version of .NET Core 3.1 youâre app is using?
I think I see your problem. IdleHttpConnectionTimeout is a TimeSpan, not a number of milliseconds. The .NET config binding is interpreting that as 4000 days. Try this instead: