Memory leak in Couchbase .NET SDK

Hi everyone,

For the past few weeks I’ve encountered memory leak issue when using Couchbase .NET SDK. The app I’m using is a worker which receive a few hundreds messages per minute from Azure Service Bus and it throw OutOfMemoryException after 2-3 days of running.
It uses Azure.Messaging.ServiceBus package to receive messages and insert the record into Couchbase.
I’m using Couchbase Client latest version 3.2.3, Couchbase server Enterprise Edition 6.0.3 build 2895
The reason I think this is related to Couchbase client is by comment out all the code related to Couchbase, the problem go away. Here all 3 lines of code that related to Couchbase in that worker

var bucket = await _namedBucketProvider.GetBucketAsync();
var collection = bucket.DefaultCollection();
await collection.UpsertAsync("some_document_key", dataAsJsonString,
    _defaultUpsertOptions.CancellationToken(cancellationToken));

Comment out just the third or both second or third line also have the leak. Interestingly, I can’t reproduce the leak locally (on Windows 10 machine) this only happened on our k8s pod.
I’ve taken a few memory dumps for this issue

Just a couple of quick thoughts:

  • Your code is the owner of the cancellation token source here, is it possible that your not disposing it after using it?
  • Have you tried removing the code where the cancellation token is passed in?

Jeff

@thinh-ng

I’ve glanced at the memory dumps, it looks to me like there is an issue with CancellationTokens somehow being used recursively. I’m not sure how that happens, but something is definitely chaining callbacks together over and over again to create all the CallbackNode delegates referencing each other:

Note how that chain shows over 30k CallbackNode references. However, they’re rooted back to the AsyncMethodBuilder for the ServiceBusProcessor, so I’m not sure it’s in the Couchbase SDK. Seems more likely that it’s in ServiceBusProcessor and how it’s using CancellationTokens. If you’re willing to share that code I’m happy to take a look at it, though.

1 Like

This could be confirmed by removing _defaultUpsertOptions.CancellationToken(cancellationToken) options.

@btburnett3
I’m not the owner of the CancellationTokenSource, .NET Core is, that was passed to BackgroundService by .NET Core and passed down to my Couchbase service

@jmorris
I ran the service without the Upsert (and even the _bucket.DefaultCollection() call entirely before, that means just GetBucketAsync() call left, the leak still occurred. However, I remembered the leak was less severe. So I ran another time without CancellationToken passed to Couchbase and get the dump and here the result:
Dump at T0: dump-1635820876.zip - Google Drive
Dump at T0 +3h: dump-1635830967.zip - Google Drive
Dump at T0 +5h: dump-1635840140.zip - Google Drive

I can still see CancellationTokenSource being leak, but with much smaller number after longer time compare to the dump above:

And here is the memory graph over 6 hours period (at some point there are no messages coming in, so the memory stay the same):

My guess was that some linked CancellationTokenSource was not dispose correctly somewhere in Couchbase, and when I passed in my CTS in UpsertOption it magnify the memory leak problem as I shoved too many messages to Couchbase (a few millions documents), thus resulted in so many CTS callback in the first set of dumps. Just a wild guess, maybe it is in CancellationTokenPair ?

Let me know what I can do to further troubleshoot this problem

1 Like

@thinh-ng

I’ve dug into this a bit deeper, and here’s what I discovered.

First, the very large chain of CallbackNode instances is due to the way they are stored on the CancellationTokenSource. In order to avoid array resizing costs, CallbackNodes are stored as a linked list. This makes it fast to call CancellationToken.Register to add a callback to be triggered when the token is canceled. Therefore, we can infer that the issue is that a large number of callbacks are being registered on the same CancellationToken.

The next item of interest is, in fact, the link you provided above CancellationTokenPair. I had originally looked at that and dismissed it, based on my comment there that a Dispose is not required because we’re never setting a timer using CancelAfter. And that is true for a plain CancellationTokenSource. However, digging into the internals a bit more I found that calling CreateLinkedTokenSource with two tokens which are not CancellationToken.None (CanBeCanceled == true) doesn’t create a plain CancellationTokenSource. It creates a Linked2CancellationTokenSource, an internal .NET class that adds some additional behaviors.

When this type is created, it registers callbacks on the source tokens (adding CallbackNode instances to the linked list mentioned before). When disposed, it also disposes of those registrations (removing CallbackNode instances from the linked list. So, in summary, I believe you are correct and we DO need to dispose of our CancellationTokenPair in cases where it’s linking two tokens.

I’ve created an issue to track this: [NCBC-2993] CancellationTokenPair should Dispose the linked CancellationTokenSource it creates - Couchbase database

For now, the workaround is to avoid using a long-lived CancellationToken on calls to Key/Value operations on ICouchbaseCollection. You can either not pass the CancellationToken, or create your own short-lived CancellationTokenSource for each message on your service bus via using var cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken); and pass that token instead.

I’m also not 100% sure why you still saw leaks when you didn’t pass the token. It’s possible that there are some other internal background tasks within the SDK that are triggering this leak more slowly. But I’m not seeing the same symptoms in your later memory dump, seems like maybe it’s related to the OrphanTraceListener and OrphanReporter? I think we should probably treat that as a separate issue and return to it after the first, more significant issue is resolved.

@thinh-ng

I’ve put up a changeset that I believe fixes your key problem: http://review.couchbase.org/c/couchbase-net-client/+/165009

If you have time, it would be helpful if you could build the SDK from the source and try it out to confirm that it’s fixed.

@thinh-ng

I believe I’ve also narrowed down the smaller memory leak you observed. It’s actually a design flaw in .NET that we’re encountering in OrphanReporter (some would argue it’s a bug in .NET). We’ll need to redesign our approach a bit to work around this limitation. Details are in this new bug report: [NCBC-2995] Slow memory leak in OrphanReporter - Couchbase database

@btburnett3

Sorry, I haven’t check this thread until now. The workaround of not passing long-lived cancellation token is good enough as when the load go down, the memory is able to be GC’d, or so it seems at the moment.

I’ll try to build from the SDK, but also I wonder when will 3.2.5 will be publish? Because if it will be publish soon then I’ll just use that instead, since we’ll have to re-setup the load test machines

@thinh-ng

The normal release cadence is monthly, so there should be another release in early December.