ExecutionContext should be suppressed / LoggerFactory injection

Hi,

we are using the .net client (3.2.0). When registered via the DI extensions, this automatically hooks up the ILoggerFactory.
When using the AddCouchbase() extension, the options.WithLogging() however is ignored, so there’s no way to override the loggers being used.

Reason I want to do this is due to our custom Logger using AsyncLocal and there are several parts in the SDK that create long running task or start timers where the ExecutionContext is not suppressed.
For example:

and

I believe such should be started in a suppressed context like
using (ExecutionContext.SuppressFlow())
{
_timer = new Timer(GenerateAndLogReport, logger, thresholdOptions1.EmitInterval, thresholdOptions1.EmitInterval);
}

As a reference: Timers that capture the execution context, evil AsyncLocal and the quick and ugly solution used by Microsoft devs which holds links to several similar topics.

Anyone experiencing the same problems?
For me specifically, this is causing ObjectDisposedExceptions during logging operations, causing the timers / long running tasks to crash.
Reason is, the AsyncLocal has a reference on these objects which are created in a different context (therefore context should be suppressed)
Also, potentially, there can be memory leaks in these parts of the sdk implementation as the tasks always keep references to any asynclocals that existed when the task started.

Then for the second issue, if I manually set a ILoggerFactory via the Couchbase options and use the DI extensions, my setup will be ignored. Reference:

It will always override it

@jbroumels

We are definitely using ExecutionContext.SuppressFlow() in several places due to the exact problem you mentioned, seems like we have missed a few spots.

I’ve filed this bug for tracking: https://issues.couchbase.com/browse/NCBC-2965

As to your logging problem, I’d like to understand it a bit better. It’s currently working as we designed it, which is to use the same logger from DI for Couchbase. Are you saying that you want to wire in a different ILoggerFactory than the one used by the rest of the application? If so, is this because of the above ExecutionContext bug? Would fixing that bug suffice, or are there other reasons?

I’ve seen indeed that in several spots the ExecutionContext is suppressed, but not in all places where it should have been done.

A way for me to work around it, would have been something like doing:
services.AddCouchbase(options => options.WithLogging(LoggerFactory.Create()));
Basically an empty factory, mitigating the problem, ignoring all Couchbase logs.

But the ClusterProvider will always override this option with the default ILoggerFactory, regardless what is set like above. I guess it would be better to only do this in case nothing is set at all.

What currently happens in my applications is that they run just fine, until the server cluster has an issue.
Then the background tasks will start logging errors/exceptions, which crashes because of disposed scopes (this of course is due to the way we do our logging, but would not have been an issue if context was suppressed)
This basically gets these tasks into an exception and stops the entire application.

There are several logging implementations using AsyncLocals, which most likely all have this problem, I presume.
Also, any AsyncLocal being set somewhere in call stacks that open Couchbase (create initial instances) will live forever in those local async contexts (kind of a memory leak as they don’t dispose)

So I think it’s kind of 2 issues, not sure you agree?:

  • Context suppression missing in several spots
  • AddCouchbase allows you to set a LoggerFactory, but it is ignored regardless
1 Like

@jbroumels

Okay, this makes sense. In your case you don’t need to override if the first issue is fixed. But, logically, it makes sense to offer the ability to override.

I filed this issue to track: https://issues.couchbase.com/browse/NCBC-2966

2 Likes

@jbroumels

Here’s a PR for the logging change (it was the simpler of the two). http://review.couchbase.org/c/couchbase-net-client/+/161769

1 Like

@btburnett3 Thanks for the quick fix :slight_smile:
I can indeed imagine the other one is a lot harder

@broumels

I’ve also pushed up a PR for the other bug, if you want to give it a try and see if it resolves your problem.

http://review.couchbase.org/c/couchbase-net-client/+/161775

Also, while you’re waiting for these changes to be officially released, I do have a workaround for you. If you copy/paste the ClusterProvider code into your project, you can then change it as you need and register it on DI using services.AddSingleton<IClusterProvider, ClusterProvider>(). So long as you make this call BEFORE you call AddCouchbase it should take precedence over the internal implementation.

1 Like

@btburnett3
I tried out your fixes, I still had some problems however.

2 spots I found where the context still flows, where I believe it shouldn’t:

@jbroumels

I’ve pushed up the fix for your first issue. For the second issue, I’ve been staring at it for a while, and I only have one explanation I’ve arrived at so far.

That particular branch of DataFlowConnectionPool is being awaited correctly, so there shouldn’t (normally) be any case where it logs after the request is out of scope. However, that branch of code is specifically applicable when there is an issue with the cluster causing connections to die and need to be reopened. This scenario would also tend to result in circuit breakers being open and needing to go half-open for testing.

My theory is that the only time you’re having a problem in DataFlowConnectionPool is actually when operations are being sent to the pool from the same branch of code in ClusterNode you identified. Therefore, I’m hoping that my fix to that logic also fixes DataFlowConnectionPool.

On another, unrelated note, I’d really love it if you could try the new ChannelConnectionPool. It gives some significant performance and throughput gains over the DataFlowConnectionPool. It’s especially noticable if you’re reading large documents, but my benchmarks show it improving across all operations. You can enable it in your configuration like this:

{
    "Couchbase": {
        "ConnectionString": "couchbase://localhost",
        "Username": "Administrator",
        "Password": "password",
        "Experiments": {
            "ChannelConnectionPools": true
        }
    }
}
1 Like

@btburnett3 I’ve tried out the expirimental ChannelConnectionPools.
In my test setup, I run just 1 Couchbase server. When I restart it (what I do to trigger errors regarding logging), it will however never re-connect.
Issue getting Cluster Map on server [::1]:11210!

 message: "Issue getting Cluster Map on server [::1]:11210!", exceptionmessage: "The operation /488 timed out after 00:00:02.5000000. It was retried 0 times using Couchbase.Core.Retry.BestEffortRetryStrategy.", rootexceptionmessage: "[Couchbase.Core.Exceptions.UnambiguousTimeoutException]: The operation /488 timed out after 00:00:02.5000000. It was retried 0 times using Couchbase.Core.Retry.BestEffortRetryStrategy.", stacktrace: "   at Couchbase.Utils.ThrowHelper.ThrowTimeoutException(IOperation operation, IErrorContext context) in D:\\dotnet\\couchbase-net-client\\src\\Couchbase\\Utils\\ThrowHelper.cs:line 74\r\n   at Couchbase.Core.ClusterNode.ExecuteOp(Func`4 sender, IOperation op, Object state, CancellationTokenPair tokenPair) in D:\\dotnet\\couchbase-net-client\\src\\Couchbase\\Core\\ClusterNode.cs:line 591\r\n   at Couchbase.Core.ClusterNode.GetClusterMap() in D:\\dotnet\\couchbase-net-client\\src\\Couchbase\\Core\\ClusterNode.cs:line 317\r\n   at Couchbase.Core.Configuration.Server.ConfigHandler.<Poll>b__13_0() in D:\\dotnet\\couchbase-net-client\\src\\Couchbase\\Core\\Configuration\\Server\\ConfigHandler.cs:line 109"

With the normal pool, it does reconnect automatically after failures

Back to the problem with the ExecutionContext;

I dove bit further and found it eventually runs into the problem here:

When run from the original CleanupDeadConnectionsAsync

Callstack: (bottom to top)

 	[External Code]	
 	Couchbase.NetClient.dll!Couchbase.Core.IO.Connections.ConnectionFactory.CreateAndConnectAsync(System.Net.IPEndPoint endPoint = {System.Net.IPEndPoint}, Couchbase.HostEndpoint hostEndpoint = {Couchbase.HostEndpoint}, System.Threading.CancellationToken cancellationToken = IsCancellationRequested = false) Line 61	C#
 	[Resuming Async Method]	
 	[External Code]	
 	[Async Call Stack]	
 	[Async] Couchbase.NetClient.dll!Couchbase.Core.IO.Connections.ConnectionPoolBase.CreateConnectionAsync(System.Threading.CancellationToken cancellationToken = IsCancellationRequested = false) Line 70	C#
 	[Async] Couchbase.NetClient.dll!Couchbase.Core.IO.Connections.DataFlow.DataFlowConnectionPool.AddConnectionsAsync.__StartConnection|0() Line 272	C#
 	[External Code]	
 	[Async] Couchbase.NetClient.dll!Couchbase.Core.IO.Connections.DataFlow.DataFlowConnectionPool.AddConnectionsAsync(int count = 2, System.Threading.CancellationToken cancellationToken = IsCancellationRequested = false) Line 314	C#
 	[Async] Couchbase.NetClient.dll!Couchbase.Core.IO.Connections.DataFlow.DataFlowConnectionPool.CleanupDeadConnectionsAsync() Line 415	C#

@jbroumels

Do you have that call stack going any further back to trace back to what was calling DataFlowConnectionPool.SendAsync? That way I can review the call sites and see if I can find some other spot we’re choosing not to await.

>	Couchbase.NetClient.dll!Couchbase.Core.IO.Connections.DataFlow.DataFlowConnectionPool.SendAsync(Couchbase.Core.IO.Operations.IOperation operation = {Couchbase.Core.IO.Operations.Set<System.DateTime>}, System.Threading.CancellationToken cancellationToken = IsCancellationRequested = false) Line 117	C#
 	Couchbase.NetClient.dll!Couchbase.Core.ClusterNode.ExecuteOp.AnonymousMethod__129_0(Couchbase.Core.IO.Operations.IOperation op2 = {Couchbase.Core.IO.Operations.Set<System.DateTime>}, object state = {Couchbase.Core.IO.Connections.DataFlow.DataFlowConnectionPool}, System.Threading.CancellationToken effectiveToken = IsCancellationRequested = false) Line 624	C#
 	Couchbase.NetClient.dll!Couchbase.Core.ClusterNode.ExecuteOp(System.Func<Couchbase.Core.IO.Operations.IOperation, object, System.Threading.CancellationToken, System.Threading.Tasks.Task> sender = {Method = {System.Reflection.RuntimeMethodInfo}}, Couchbase.Core.IO.Operations.IOperation op = {Couchbase.Core.IO.Operations.Set<System.DateTime>}, object state = {Couchbase.Core.IO.Connections.DataFlow.DataFlowConnectionPool}, Couchbase.Core.IO.Operations.CancellationTokenPair tokenPair = {Couchbase.Core.IO.Operations.CancellationTokenPair}) Line 495	C#
 	Couchbase.NetClient.dll!Couchbase.Core.ClusterNode.ExecuteOp(Couchbase.Core.IO.Connections.IConnectionPool connectionPool = {Couchbase.Core.IO.Connections.DataFlow.DataFlowConnectionPool}, Couchbase.Core.IO.Operations.IOperation op = {Couchbase.Core.IO.Operations.Set<System.DateTime>}, Couchbase.Core.IO.Operations.CancellationTokenPair tokenPair = {Couchbase.Core.IO.Operations.CancellationTokenPair}) Line 624	C#
 	Couchbase.NetClient.dll!Couchbase.Core.ClusterNode.SendAsync(Couchbase.Core.IO.Operations.IOperation op = {Couchbase.Core.IO.Operations.Set<System.DateTime>}, Couchbase.Core.IO.Operations.CancellationTokenPair tokenPair = {Couchbase.Core.IO.Operations.CancellationTokenPair}) Line 402	C#
 	Couchbase.NetClient.dll!Couchbase.CouchbaseBucket.SendAsync(Couchbase.Core.IO.Operations.IOperation op = {Couchbase.Core.IO.Operations.Set<System.DateTime>}, Couchbase.Core.IO.Operations.CancellationTokenPair tokenPair = {Couchbase.Core.IO.Operations.CancellationTokenPair}) Line 221	C#
 	Couchbase.NetClient.dll!Couchbase.Core.Retry.RetryOrchestrator.RetryAsync(Couchbase.Core.BucketBase bucket = Name = "default", BucketType = Couchbase, Couchbase.Core.IO.Operations.IOperation operation = {Couchbase.Core.IO.Operations.Set<System.DateTime>}, Couchbase.Core.IO.Operations.CancellationTokenPair tokenPair = {Couchbase.Core.IO.Operations.CancellationTokenPair}) Line 181	C#
 	Couchbase.NetClient.dll!Couchbase.Core.BucketBase.RetryAsync(Couchbase.Core.IO.Operations.IOperation operation = {Couchbase.Core.IO.Operations.Set<System.DateTime>}, Couchbase.Core.IO.Operations.CancellationTokenPair tokenPair = {Couchbase.Core.IO.Operations.CancellationTokenPair}) Line 184	C#
 	Couchbase.NetClient.dll!Couchbase.KeyValue.CouchbaseCollection.UpsertAsync<System.DateTime>(string id = "c4a1936d-e51a-4fc9-867b-909bd89fb34f", System.DateTime content = {System.DateTime}, Couchbase.KeyValue.UpsertOptions options = {Couchbase.KeyValue.UpsertOptions}) Line 552	C#
 	Couchbase.NetClient.dll!Couchbase.KeyValue.CollectionExtensions.UpsertAsync<System.DateTime>(Couchbase.KeyValue.ICouchbaseCollection collection = {Couchbase.KeyValue.CouchbaseCollection}, string id = "c4a1936d-e51a-4fc9-867b-909bd89fb34f", System.DateTime content = {System.DateTime}, System.Action<Couchbase.KeyValue.UpsertOptions> configureOptions = {Method = {System.Reflection.RuntimeMethodInfo}}) Line 97	C#

So it’s just a regular Upsert being executed when the server is down

@jbroumels

I’m a bit stumped by this, however I recognized that the execution context doesn’t really need to flow into CleanupDeadConnectionsAsync either, since it isn’t going to flow into sends. So I changed the code there to wrap it all in one global suppress so it applies to CleanupDeadConnectionsAsync and the ContinueWith. I also made it a bit smarter to recognize if flow was already suppressed, which could happen in the circuit breaker scenario mentioned earlier.

Can you try this out and see if it helps?

1 Like

Hi @btburnett3

with these last changes I have no more errors. So this seems to have fixed it indeed.
Thanks :slight_smile:

For the additional checking of the current context suppression. I often use this helper:

public class SafeExecutionContextSuppressor
    {
        public static IDisposable SuppressFlow()
        {
            return ExecutionContext.IsFlowSuppressed() ? NullDisposable.Instance : ExecutionContext.SuppressFlow();
        }
    }

so with that, you can just always do:

using (SafeExecutionContextSuppressor.SuppressFlow())
{
  Task.Run(() => {} );
}
1 Like

@jbroumels

I considered something like that, but it has performance implications. ExecutionContext.SuppressFlow returns a structure, so it’s stack allocated. But if you cast it to IDisposable it will box it onto the heap. It also prevents inlining of the Dispose method call. Since SendAsync is such a hot path, I’d prefer to keep it on the stack.

2 Likes

@jbroumels

Thanks for bringing this problem with ChannelConnectionPools to my attention, turns out it was a pretty silly oversight. A pull request is up to fix it: http://review.couchbase.org/c/couchbase-net-client/+/161915

1 Like

@btburnett3
I can confirm now it reconnects correctly with the ChannelConnectionPools.
And indeed, agree with the helper suppressing context is indeed less optimal for hot code paths :slight_smile:

Thanks for all the work done fixing these issues!

1 Like