Existing connection forcibly closed by remote host

In the company where I work we have problems with Couchbase SDK for .NET, we now use 2.2.1 version with Couchbase Server 4.5.
I don’t believe we have any unusual configuration for Couchbase Server.
Couchbase buckets are replicated to Elasticsearch.
Couchbase is deployed in a cluster, host machines use Windows Server 2012.
We have a .NET application that works with Couchbase (backend services and ASP.NET frontend).
We use .NET 4.5.2 when building and running the project, and on the server it’s .NET 4.0 updated to 4.5.2. My dev machine shows I have .NET 4.0 updated to 4.6.2 (according to dotnetversiondetector app). Dont know if that’s relevant.
This problem usually appears in ASP.NET project.
On a developer machine in debug mode, everything works just fine - we can connect, get/update the documents, etc.
However, when we deploy the app to a test server, there are exception that says “System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host”. But the thing is - we make several round trips to the db and exception usually occurs after we make several queries.
The most frequent source of issues is when we attempt to get pretty heavy documents (some are 800-850Kb or more) by their list of ids.
I figured increasing waitTimeout on the client might help - but it didn’t, still the same error.
I thought it may be it’s a problem with Release builds - but then I tried debug binaries on the server and the problem is still there.
We restarted IIS pool, the whole IIS server, and the VM that hosts the app - without any results.
Sometimes there are another error - "Couchbase.IO.RemoteHostTimeoutException: The connection has timed out while an operation was in flight. The default is 15000ms"
​After I increased the timeout, the first error appeared.​
We didn’t have any changes in network configuration recently.
I can try updating a client to a recent version (ours is almost a year old), but that’s the only option I see left.

I enabled logging in the code that queries Couchbase to display IOperationResult state - on my dev machine logs look like this:
DEBUG assert_success - op.success is True, op.status is Success
DEBUG assert_success - op.success is True, op.status is Success
DEBUG assert_success - op.success is True, op.status is Success
on the server, they look like:
DEBUG assert_success - op.success is True, op.status is Success
DEBUG assert_success - op.success is True, op.status is Success
DEBUG assert_success - op.success is True, op.status is Success
DEBUG assert_success - op.success is True, op.status is Success
DEBUG assert_success - op.success is True, op.status is Success
System.Net.Http.HttpRequestMessageExtensions.DisposeRequestResources DEBUG Disposing
DEBUG assert_success - op.success is False, op.status is ClientFailure

Any ideas?

@gchermennov

This isn’t really my area of specialty, but I can point you at a couple of things to look at.

  1. Couchbase isn’t really designed to run on Windows for production, just development. In production they strongly recommend Linux.

  2. I’d also look at your firewall settings. Normally the connection pool will sit at a minimum number of connections to the server. As it gets load, it can scale up the number of connections to the maximum size configured. It sounds like it might be trying to open more connections under load, and failing to open those additional connections.

  3. I’ve also seen issues like this if you are not using a singleton to connect to Couchbase. You should only connect one copy of each Bucket, and keep the object in memory for the application lifetime. The ClusterHelper class is designed to help with this.

Brant

So you don’t think this is because we have old client library or there’s a slight mismatch in .NET version between dev and web server machines?

  1. noted. we don’t have a big cluster, 4 machines and recently expanded to 6. and we run without significant problems for more than a year (except when updating between versions) with machines under load almost 24x7 (we ran calculations that involve read/write operations)
  2. you mean Windows firewall on the web server? firewall doesn’t limit the traffic, it can close/open ports for particular programs/protocols
  3. I thought about this too, currently we instantiate Bucket objects as needed - so there may be resource leak. will go to singleton and report back

@btburnett3 This is not correct Couchbase Server is supported on Windows. We have many users that have production workloads running successfully on Windows environments.

1 Like

@gchermennov -

Also, I would upgrade the client to the latest stable if you can (2.3.8 at the time of this post). There are a lot of bugs that if been fixed in the dozen or so releases since 2.2.1!

-Jeff

@pvarley My apologies, that wasn’t the understanding I had previously. Maybe it used to be the case with older versions, but isn’t anymore. I know on my dev machine I used to have problems upgrading between versions and was told that Windows was dev only so I needed to uninstall and reinstall because upgrades weren’t well supported. Thanks for letting me know!

I am also experiencing this issue with Couchbase Server 4.5.0-2601 Enterprise Edition (build-2601) and the latest .NET SDK (2.3.8).

My connections (simple get calls) work for 30-40 minutes just fine, then I begin to see this error:

System.Net.Sockets.SocketException (0x80004005): An existing connection was forcibly closed by the remote host
 at Couchbase.IO.Connection.Send(Byte[] buffer)
 at Couchbase.IO.Services.PooledIOService.Execute[T](IOperation`1 operation, IConnection connection)
 at Couchbase.Authentication.SASL.ScramShaMechanism.Authenticate(IConnection connection, String username,    String password)
 at Couchbase.IO.Services.PooledIOService.Authenticate(IConnection connection)
 at Couchbase.IO.Services.PooledIOService.Execute[T](IOperation`1 operation)

This happens almost always 30-40 minutes after an app pool recycle, and can be resolved by doing another app pool recycle. I am also using the ClusterHelper to manage the bucket connections. In addition, this is only happening with one bucket. Other controllers/bucket interactions remain working during this time.

Ryan, may be can isolate this? What kind of actions do you perform right
before you get this exception?
I get list of pretty heavy documents (800+ Kb each) by their list of ids.

@rmendoza -

What do the server logs indicate? Generally this error is caused when (no surprises here) the connection is terminated by the server, but I believe it could also be anything between the server and app machine that closes the connection.

-Jeff

Will try to get you the logs.
By the way, how do I know which node’s logs to look into?

It would be the node that SDK cannot maintain a connection with. You should be able to deduce it from turning up the verbosity of the logging on the client and then looking through the logs.

-Jeff

Ok, I’ll dig into the client logs. Had this config somewhere

@jmorris I don’t think this is a real connectivity issue. It doesn’t affect any other buckets that have FAR more traffic to them. I was able to resolve this by removing one node from the cluster (random choice) and bringing it back in to force a rebalance.

1 Like

@rmendoza - that’s one-off fix (e.g. if I have 10 buckets, your solution
will have to be applied to each of them), I’d prefer a permanent one. I’m
back in the office tomorrow, will investigate that further.

Hi there.

Did you manage to get to the bottom of this issue? We’re seeing the same issue and it doesn’t look like a real network error either. We do create two cluster objects in our process (to the same cluster) and on each cluster instance open a single bucket (same bucket in Couchbase). We do not use the ClusterHelper class.

Thanks.
Emile

@emilevr -

To help, you’ll need to provide more information:

  • A Wire shark capture taken when the error occurs

  • Client logs taken when the error occurs

Generally when you see an IO error with the message “Existing connection forcibly closed by remote host” you suspect something between the app server and the cluster (including the cluster). This could be a LB timing out idle connections, server configuration, or a number of other things.

-Jeff