Couchbase Server - Connection leaks

.net
connections
#1

I have a setup with Couchbase Server v. 4.0 in a 3 node cluster on Windows Servers. They run fine, and we connect a lot of .NET Windows Services using the Couchbase DotNet client to connect to the cluster. However, over time we can see a connection leak issue. I believe when an error occurs at some point during a query or document operation the connection is reset somehow on the client side and we get a new one the next time. The server side does not pick this up and we have a leak.

After a couple of weeks we go from 130 connections in total to 1600 connections in a linear increase. We have been struggling bad with connection leaks with Couchbase and we really need a better handling of this.

The Windows Server is just full of TCP connections pointing to itself in the end and not from remote clients. How can this be?

Sample of TCP connections from a collection of 1600 in total:

PROCESS PID LOCAL PORT REMOTE
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51372 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51354 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51192 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51191 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51190 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51155 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51152 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51147 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51142 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51139 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51136 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51131 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51130 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51120 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51115 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51114 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51110 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51108 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51100 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51093 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51085 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51083 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51081 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51080 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51078 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51077 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51074 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51069 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51063 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51059 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51058 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51056 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51054 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51049 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51040 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51035 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51031 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51025 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51019 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51013 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51012 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51010 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51008 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 51002 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50997 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50991 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50985 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50976 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50973 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50963 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50940 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50937 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50930 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50910 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50904 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50890 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50885 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50877 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50869 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50866 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50864 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50861 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50857 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50846 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50844 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50839 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50828 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50826 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50812 - -
memcached.exe 1852 10.166.75.139 11210 10.166.75.139 50794 - -

and the list goes on…

#2

Hey Chris,

Is your application making use of N1QL to retrieve data from the Couchbase cluster?

There is a known issue in Couchbase Server 4.0 and 4.1, where the cbq-engine.exe will consume a large number of TCP connections, this has been fixed and shall be released as part of Couchbase Server 4.1.1 due within the next month or two.

#3

Yes, absolutely. We use mostly N1QL and we can see the same number for cbq-engine.exe TCP connections too. It’s great news that it is fixed. We currently use the community edition which is lagging behind. Any idea when a fix for the community edition will arrive? It is such a major issue, so we actually consider replacing Couchbase with another solution. Simply because when we run for a week or two without the solution being unresponsive. Couchbase halts at 10000 + connections.

#4

This fix will also be included as part of the 4.5 final GA release, which I would imagine will be released as a community edition too, assuming that it follows the format of previous release cycles.

That said, the Engineering team has discovered that if a query uses any Secondary Index rather than the Primary Index then the issue will not occur. This is the case even if the query engine needs to then perform KV operations, this leads to a potential workaround.

It would be possible to create substitute Primary Index by creating a Secondary Index which only indexes on the document key, for example: CREATE INDEX my_index ON my_bucket(meta().id);. This will include every document in the bucket (as every document has to have a key). Unfortunately as this is a Secondary Index, the query engine cannot assume that the index contains every document; this can be circumvented by altering the queries to add to the WHERE clause the requirement meta().id IS NOT MISSING.

A slightly messier alternative would be to periodically kill the cbq-engine.exe process, at which point all of the held (unused) connections are released, although do note that you will lose all queries currently in-flight at that point in time (although your application should have logic for handling failed queries as a best practice anyway).
The Couchbase Server babysitter process should automatically restart the process once it has been killed.

#5

Thanks for your reply. We are doing the last fix you suggested for now. We created a new scheduled task on all our Couchbase nodes that kills cbq-engine.exe every 24 hours. This works pretty well and the kill / resume process takes only a second it seems. All failed queries are handled on the client side so the solution is quite a good workaround.

Thanks, and looking forward to 4.5 GA!