ExecutionContext should be suppressed / LoggerFactory injection

jbroumels · September 17, 2021, 7:40am

Hi,

we are using the .net client (3.2.0). When registered via the DI extensions, this automatically hooks up the ILoggerFactory.
When using the AddCouchbase() extension, the options.WithLogging() however is ignored, so there’s no way to override the loggers being used.

Reason I want to do this is due to our custom Logger using AsyncLocal and there are several parts in the SDK that create long running task or start timers where the ExecutionContext is not suppressed.
For example:

github.com

couchbase/couchbase-net-client/blob/master/src/Couchbase/Core/Diagnostics/Tracing/OrphanResponseReporting/OrphanReporter.cs#L47

    
      
          /// <summary>
          /// Internal total count of all pending operation contexts to have been recorded.
          /// </summary>
          internal uint TotalCount => _kvOrphanCount + _viewOrphanCount + _queryOrphanCount + _searchOrphanCount + _analyticsOrphanCount;
          
          
public OrphanReporter(ILogger<OrphanReporter> logger, OrphanOptions options)
          {
              _logger = logger;
              Interval = (int)options.EmitInterval.TotalMilliseconds;
              SampleSize = options.SampleSize;
              Task.Factory.StartNew(DoWork, TaskCreationOptions.LongRunning);
          }
          
          
private async Task DoWork()
          {
              while (!_source.IsCancellationRequested)
              {
                  try
                  {
                      // determine if we need to write to log yet
                      if (DateTime.UtcNow.Subtract(_lastRun) > TimeSpan.FromMilliseconds(Interval))

and

github.com

couchbase/couchbase-net-client/blob/master/src/Couchbase/Core/Diagnostics/Tracing/ThresholdTracing/ThresholdTraceListener.cs#L26

    
      
          /// </summary>
          internal class ThresholdTraceListener : TraceListener
          {
              private readonly IReadOnlyDictionary<string, TimeSpan> _serviceThresholds;
              private readonly Timer _timer;
          
          
    public ThresholdTraceListener(ILoggerFactory loggerFactory, ThresholdOptions options)
              {
                  var thresholdOptions1 = options;
                  var logger = loggerFactory.CreateLogger<RequestTracer>();
                  _timer = new Timer(GenerateAndLogReport, logger, thresholdOptions1.EmitInterval, thresholdOptions1.EmitInterval);
                  ThresholdServiceQueue.SetSampleSize((int)thresholdOptions1.SampleSize);//change to uint
          
          
        _serviceThresholds = options.GetServiceThresholds();
                  Start();
              }
          
          
    private static void GenerateAndLogReport(object state)
              {
                  ILogger logger = null;
                  try

I believe such should be started in a suppressed context like
using (ExecutionContext.SuppressFlow())
{
_timer = new Timer(GenerateAndLogReport, logger, thresholdOptions1.EmitInterval, thresholdOptions1.EmitInterval);
}

As a reference: Timers that capture the execution context, evil AsyncLocal and the quick and ugly solution used by Microsoft devs which holds links to several similar topics.

Anyone experiencing the same problems?
For me specifically, this is causing ObjectDisposedExceptions during logging operations, causing the timers / long running tasks to crash.
Reason is, the AsyncLocal has a reference on these objects which are created in a different context (therefore context should be suppressed)
Also, potentially, there can be memory leaks in these parts of the sdk implementation as the tasks always keep references to any asynclocals that existed when the task started.

Then for the second issue, if I manually set a ILoggerFactory via the Couchbase options and use the DI extensions, my setup will be ignored. Reference:

github.com

couchbase/couchbase-net-client/blob/master/src/Couchbase.Extensions.DependencyInjection/Internal/ClusterProvider.cs#L21-L41

    
      
              _loggerFactory = loggerFactory ?? throw new ArgumentNullException(nameof(loggerFactory));
          
          
    _cluster = new AsyncLazy<ICluster>(() => CreateClusterAsync(options.Value));
          }
          
          
public virtual ValueTask<ICluster> GetClusterAsync()
          {
              if (_disposed)
              {
                  throw new ObjectDisposedException(nameof(ClusterProvider));
              }
          
          
    return new ValueTask<ICluster>(_cluster.Value);
          }
          
          
/// <summary>
          /// Seam for injecting mock.
          /// </summary>
          protected virtual Task<ICluster> CreateClusterAsync(ClusterOptions clusterOptions)
          {
              clusterOptions.WithLogging(_loggerFactory);

It will always override it

btburnett3 · September 17, 2021, 12:28pm

@jbroumels

We are definitely using ExecutionContext.SuppressFlow() in several places due to the exact problem you mentioned, seems like we have missed a few spots.

I’ve filed this bug for tracking: https://issues.couchbase.com/browse/NCBC-2965

As to your logging problem, I’d like to understand it a bit better. It’s currently working as we designed it, which is to use the same logger from DI for Couchbase. Are you saying that you want to wire in a different ILoggerFactory than the one used by the rest of the application? If so, is this because of the above ExecutionContext bug? Would fixing that bug suffice, or are there other reasons?

jbroumels · September 17, 2021, 12:41pm

I’ve seen indeed that in several spots the ExecutionContext is suppressed, but not in all places where it should have been done.

A way for me to work around it, would have been something like doing:
services.AddCouchbase(options => options.WithLogging(LoggerFactory.Create()));
Basically an empty factory, mitigating the problem, ignoring all Couchbase logs.

But the ClusterProvider will always override this option with the default ILoggerFactory, regardless what is set like above. I guess it would be better to only do this in case nothing is set at all.

What currently happens in my applications is that they run just fine, until the server cluster has an issue.
Then the background tasks will start logging errors/exceptions, which crashes because of disposed scopes (this of course is due to the way we do our logging, but would not have been an issue if context was suppressed)
This basically gets these tasks into an exception and stops the entire application.

There are several logging implementations using AsyncLocals, which most likely all have this problem, I presume.
Also, any AsyncLocal being set somewhere in call stacks that open Couchbase (create initial instances) will live forever in those local async contexts (kind of a memory leak as they don’t dispose)

So I think it’s kind of 2 issues, not sure you agree?:

Context suppression missing in several spots
AddCouchbase allows you to set a LoggerFactory, but it is ignored regardless

btburnett3 · September 17, 2021, 1:30pm

@jbroumels

Okay, this makes sense. In your case you don’t need to override if the first issue is fixed. But, logically, it makes sense to offer the ability to override.

I filed this issue to track: https://issues.couchbase.com/browse/NCBC-2966

btburnett3 · September 17, 2021, 2:27pm

@jbroumels

Here’s a PR for the logging change (it was the simpler of the two). http://review.couchbase.org/c/couchbase-net-client/+/161769

jbroumels · September 17, 2021, 2:35pm

@btburnett3 Thanks for the quick fix
I can indeed imagine the other one is a lot harder

btburnett3 · September 17, 2021, 4:08pm

@broumels

I’ve also pushed up a PR for the other bug, if you want to give it a try and see if it resolves your problem.

http://review.couchbase.org/c/couchbase-net-client/+/161775

Also, while you’re waiting for these changes to be officially released, I do have a workaround for you. If you copy/paste the ClusterProvider code into your project, you can then change it as you need and register it on DI using services.AddSingleton<IClusterProvider, ClusterProvider>(). So long as you make this call BEFORE you call AddCouchbase it should take precedence over the internal implementation.

jbroumels · September 18, 2021, 4:54pm

@btburnett3
I tried out your fixes, I still had some problems however.

2 spots I found where the context still flows, where I believe it shouldn’t:

ClusterNode.cs
A non awaited task, although not long running, which runs out of scope of any request:
https://github.com/couchbase/couchbase-net-client/blob/master/src/Couchbase/Core/ClusterNode.cs#L442
and
DataFlowConnectionPool.cs
https://github.com/couchbase/couchbase-net-client/blob/master/src/Couchbase/Core/IO/Connections/DataFlow/DataFlowConnectionPool.cs#L379
This last one only happens when called from
https://github.com/couchbase/couchbase-net-client/blob/master/src/Couchbase/Core/IO/Connections/DataFlow/DataFlowConnectionPool.cs#L117
But I can’t really wrap my head around that one

btburnett3 · September 19, 2021, 9:49pm

@jbroumels

I’ve pushed up the fix for your first issue. For the second issue, I’ve been staring at it for a while, and I only have one explanation I’ve arrived at so far.

That particular branch of DataFlowConnectionPool is being awaited correctly, so there shouldn’t (normally) be any case where it logs after the request is out of scope. However, that branch of code is specifically applicable when there is an issue with the cluster causing connections to die and need to be reopened. This scenario would also tend to result in circuit breakers being open and needing to go half-open for testing.

My theory is that the only time you’re having a problem in DataFlowConnectionPool is actually when operations are being sent to the pool from the same branch of code in ClusterNode you identified. Therefore, I’m hoping that my fix to that logic also fixes DataFlowConnectionPool.

On another, unrelated note, I’d really love it if you could try the new ChannelConnectionPool. It gives some significant performance and throughput gains over the DataFlowConnectionPool. It’s especially noticable if you’re reading large documents, but my benchmarks show it improving across all operations. You can enable it in your configuration like this:

{
    "Couchbase": {
        "ConnectionString": "couchbase://localhost",
        "Username": "Administrator",
        "Password": "password",
        "Experiments": {
            "ChannelConnectionPools": true
        }
    }
}

jbroumels · September 20, 2021, 8:23am

@btburnett3 I’ve tried out the expirimental ChannelConnectionPools.
In my test setup, I run just 1 Couchbase server. When I restart it (what I do to trigger errors regarding logging), it will however never re-connect.
Issue getting Cluster Map on server [::1]:11210!

 message: "Issue getting Cluster Map on server [::1]:11210!", exceptionmessage: "The operation /488 timed out after 00:00:02.5000000. It was retried 0 times using Couchbase.Core.Retry.BestEffortRetryStrategy.", rootexceptionmessage: "[Couchbase.Core.Exceptions.UnambiguousTimeoutException]: The operation /488 timed out after 00:00:02.5000000. It was retried 0 times using Couchbase.Core.Retry.BestEffortRetryStrategy.", stacktrace: "   at Couchbase.Utils.ThrowHelper.ThrowTimeoutException(IOperation operation, IErrorContext context) in D:\\dotnet\\couchbase-net-client\\src\\Couchbase\\Utils\\ThrowHelper.cs:line 74\r\n   at Couchbase.Core.ClusterNode.ExecuteOp(Func`4 sender, IOperation op, Object state, CancellationTokenPair tokenPair) in D:\\dotnet\\couchbase-net-client\\src\\Couchbase\\Core\\ClusterNode.cs:line 591\r\n   at Couchbase.Core.ClusterNode.GetClusterMap() in D:\\dotnet\\couchbase-net-client\\src\\Couchbase\\Core\\ClusterNode.cs:line 317\r\n   at Couchbase.Core.Configuration.Server.ConfigHandler.<Poll>b__13_0() in D:\\dotnet\\couchbase-net-client\\src\\Couchbase\\Core\\Configuration\\Server\\ConfigHandler.cs:line 109"

With the normal pool, it does reconnect automatically after failures

jbroumels · September 20, 2021, 8:45am

Back to the problem with the ExecutionContext;

I dove bit further and found it eventually runs into the problem here:

github.com

couchbase/couchbase-net-client/blob/master/src/Couchbase/Core/IO/Connections/ConnectionFactory.cs#L61

    
      
                              throw new SocketException(connectionTimedOut);
                          }
                      }
                      catch
                      {
                          socket.Dispose();
                          throw;
                      }
          
          
#if NETCOREAPP_GTE_3_0
                      _multiplexLogger.LogDebug("Setting TCP Keep-Alives using SocketOptions - enable keep-alives {EnableTcpKeepAlives}, time {TcpKeepAliveTime}, interval {TcpKeepAliveInterval}.",
                          _clusterOptions.EnableTcpKeepAlives, _clusterOptions.TcpKeepAliveTime, _clusterOptions.TcpKeepAliveInterval);
          
          
            if (!socket.TryEnableKeepAlives(_clusterOptions.EnableTcpKeepAlives,
                          (int)_clusterOptions.TcpKeepAliveTime.TotalSeconds,
                          (int)_clusterOptions.TcpKeepAliveInterval.TotalSeconds, out string setKeepAliveMessage)
                      )
                      {
                          _multiplexLogger.LogWarning(setKeepAliveMessage);
                      }
          #else

When run from the original CleanupDeadConnectionsAsync

Callstack: (bottom to top)

 	[External Code]	
 	Couchbase.NetClient.dll!Couchbase.Core.IO.Connections.ConnectionFactory.CreateAndConnectAsync(System.Net.IPEndPoint endPoint = {System.Net.IPEndPoint}, Couchbase.HostEndpoint hostEndpoint = {Couchbase.HostEndpoint}, System.Threading.CancellationToken cancellationToken = IsCancellationRequested = false) Line 61	C#
 	[Resuming Async Method]	
 	[External Code]	
 	[Async Call Stack]	
 	[Async] Couchbase.NetClient.dll!Couchbase.Core.IO.Connections.ConnectionPoolBase.CreateConnectionAsync(System.Threading.CancellationToken cancellationToken = IsCancellationRequested = false) Line 70	C#
 	[Async] Couchbase.NetClient.dll!Couchbase.Core.IO.Connections.DataFlow.DataFlowConnectionPool.AddConnectionsAsync.__StartConnection|0() Line 272	C#
 	[External Code]	
 	[Async] Couchbase.NetClient.dll!Couchbase.Core.IO.Connections.DataFlow.DataFlowConnectionPool.AddConnectionsAsync(int count = 2, System.Threading.CancellationToken cancellationToken = IsCancellationRequested = false) Line 314	C#
 	[Async] Couchbase.NetClient.dll!Couchbase.Core.IO.Connections.DataFlow.DataFlowConnectionPool.CleanupDeadConnectionsAsync() Line 415	C#

btburnett3 · September 20, 2021, 11:54am

@jbroumels

Do you have that call stack going any further back to trace back to what was calling DataFlowConnectionPool.SendAsync? That way I can review the call sites and see if I can find some other spot we’re choosing not to await.

jbroumels · September 20, 2021, 12:44pm

>	Couchbase.NetClient.dll!Couchbase.Core.IO.Connections.DataFlow.DataFlowConnectionPool.SendAsync(Couchbase.Core.IO.Operations.IOperation operation = {Couchbase.Core.IO.Operations.Set<System.DateTime>}, System.Threading.CancellationToken cancellationToken = IsCancellationRequested = false) Line 117	C#
 	Couchbase.NetClient.dll!Couchbase.Core.ClusterNode.ExecuteOp.AnonymousMethod__129_0(Couchbase.Core.IO.Operations.IOperation op2 = {Couchbase.Core.IO.Operations.Set<System.DateTime>}, object state = {Couchbase.Core.IO.Connections.DataFlow.DataFlowConnectionPool}, System.Threading.CancellationToken effectiveToken = IsCancellationRequested = false) Line 624	C#
 	Couchbase.NetClient.dll!Couchbase.Core.ClusterNode.ExecuteOp(System.Func<Couchbase.Core.IO.Operations.IOperation, object, System.Threading.CancellationToken, System.Threading.Tasks.Task> sender = {Method = {System.Reflection.RuntimeMethodInfo}}, Couchbase.Core.IO.Operations.IOperation op = {Couchbase.Core.IO.Operations.Set<System.DateTime>}, object state = {Couchbase.Core.IO.Connections.DataFlow.DataFlowConnectionPool}, Couchbase.Core.IO.Operations.CancellationTokenPair tokenPair = {Couchbase.Core.IO.Operations.CancellationTokenPair}) Line 495	C#
 	Couchbase.NetClient.dll!Couchbase.Core.ClusterNode.ExecuteOp(Couchbase.Core.IO.Connections.IConnectionPool connectionPool = {Couchbase.Core.IO.Connections.DataFlow.DataFlowConnectionPool}, Couchbase.Core.IO.Operations.IOperation op = {Couchbase.Core.IO.Operations.Set<System.DateTime>}, Couchbase.Core.IO.Operations.CancellationTokenPair tokenPair = {Couchbase.Core.IO.Operations.CancellationTokenPair}) Line 624	C#
 	Couchbase.NetClient.dll!Couchbase.Core.ClusterNode.SendAsync(Couchbase.Core.IO.Operations.IOperation op = {Couchbase.Core.IO.Operations.Set<System.DateTime>}, Couchbase.Core.IO.Operations.CancellationTokenPair tokenPair = {Couchbase.Core.IO.Operations.CancellationTokenPair}) Line 402	C#
 	Couchbase.NetClient.dll!Couchbase.CouchbaseBucket.SendAsync(Couchbase.Core.IO.Operations.IOperation op = {Couchbase.Core.IO.Operations.Set<System.DateTime>}, Couchbase.Core.IO.Operations.CancellationTokenPair tokenPair = {Couchbase.Core.IO.Operations.CancellationTokenPair}) Line 221	C#
 	Couchbase.NetClient.dll!Couchbase.Core.Retry.RetryOrchestrator.RetryAsync(Couchbase.Core.BucketBase bucket = Name = "default", BucketType = Couchbase, Couchbase.Core.IO.Operations.IOperation operation = {Couchbase.Core.IO.Operations.Set<System.DateTime>}, Couchbase.Core.IO.Operations.CancellationTokenPair tokenPair = {Couchbase.Core.IO.Operations.CancellationTokenPair}) Line 181	C#
 	Couchbase.NetClient.dll!Couchbase.Core.BucketBase.RetryAsync(Couchbase.Core.IO.Operations.IOperation operation = {Couchbase.Core.IO.Operations.Set<System.DateTime>}, Couchbase.Core.IO.Operations.CancellationTokenPair tokenPair = {Couchbase.Core.IO.Operations.CancellationTokenPair}) Line 184	C#
 	Couchbase.NetClient.dll!Couchbase.KeyValue.CouchbaseCollection.UpsertAsync<System.DateTime>(string id = "c4a1936d-e51a-4fc9-867b-909bd89fb34f", System.DateTime content = {System.DateTime}, Couchbase.KeyValue.UpsertOptions options = {Couchbase.KeyValue.UpsertOptions}) Line 552	C#
 	Couchbase.NetClient.dll!Couchbase.KeyValue.CollectionExtensions.UpsertAsync<System.DateTime>(Couchbase.KeyValue.ICouchbaseCollection collection = {Couchbase.KeyValue.CouchbaseCollection}, string id = "c4a1936d-e51a-4fc9-867b-909bd89fb34f", System.DateTime content = {System.DateTime}, System.Action<Couchbase.KeyValue.UpsertOptions> configureOptions = {Method = {System.Reflection.RuntimeMethodInfo}}) Line 97	C#

So it’s just a regular Upsert being executed when the server is down

btburnett3 · September 20, 2021, 3:00pm

@jbroumels

I’m a bit stumped by this, however I recognized that the execution context doesn’t really need to flow into CleanupDeadConnectionsAsync either, since it isn’t going to flow into sends. So I changed the code there to wrap it all in one global suppress so it applies to CleanupDeadConnectionsAsync and the ContinueWith. I also made it a bit smarter to recognize if flow was already suppressed, which could happen in the circuit breaker scenario mentioned earlier.

Can you try this out and see if it helps?

jbroumels · September 20, 2021, 3:38pm

Hi @btburnett3

with these last changes I have no more errors. So this seems to have fixed it indeed.
Thanks

For the additional checking of the current context suppression. I often use this helper:

public class SafeExecutionContextSuppressor
    {
        public static IDisposable SuppressFlow()
        {
            return ExecutionContext.IsFlowSuppressed() ? NullDisposable.Instance : ExecutionContext.SuppressFlow();
        }
    }

so with that, you can just always do:

using (SafeExecutionContextSuppressor.SuppressFlow())
{
  Task.Run(() => {} );
}

btburnett3 · September 20, 2021, 4:01pm

@jbroumels

I considered something like that, but it has performance implications. ExecutionContext.SuppressFlow returns a structure, so it’s stack allocated. But if you cast it to IDisposable it will box it onto the heap. It also prevents inlining of the Dispose method call. Since SendAsync is such a hot path, I’d prefer to keep it on the stack.

btburnett3 · September 21, 2021, 3:15am

jbroumels:

@btburnett3 I’ve tried out the expirimental ChannelConnectionPools.
In my test setup, I run just 1 Couchbase server. When I restart it (what I do to trigger errors regarding logging), it will however never re-connect.
Issue getting Cluster Map on server [::1]:11210!

 message: "Issue getting Cluster Map on server [::1]:11210!", exceptionmessage: "The operation /488 timed out after 00:00:02.5000000. It was retried 0 times using Couchbase.Core.Retry.BestEffortRetryStrategy.", rootexceptionmessage: "[Couchbase.Core.Exceptions.UnambiguousTimeoutException]: The operation /488 timed out after 00:00:02.5000000. It was retried 0 times using Couchbase.Core.Retry.BestEffortRetryStrategy.", stacktrace: "   at Couchbase.Utils.ThrowHelper.ThrowTimeoutException(IOperation operation, IErrorContext context) in D:\\dotnet\\couchbase-net-client\\src\\Couchbase\\Utils\\ThrowHelper.cs:line 74\r\n   at Couchbase.Core.ClusterNode.ExecuteOp(Func`4 sender, IOperation op, Object state, CancellationTokenPair tokenPair) in D:\\dotnet\\couchbase-net-client\\src\\Couchbase\\Core\\ClusterNode.cs:line 591\r\n   at Couchbase.Core.ClusterNode.GetClusterMap() in D:\\dotnet\\couchbase-net-client\\src\\Couchbase\\Core\\ClusterNode.cs:line 317\r\n   at Couchbase.Core.Configuration.Server.ConfigHandler.<Poll>b__13_0() in D:\\dotnet\\couchbase-net-client\\src\\Couchbase\\Core\\Configuration\\Server\\ConfigHandler.cs:line 109"

With the normal pool, it does reconnect automatically after failures

@jbroumels

Thanks for bringing this problem with ChannelConnectionPools to my attention, turns out it was a pretty silly oversight. A pull request is up to fix it: http://review.couchbase.org/c/couchbase-net-client/+/161915

jbroumels · September 21, 2021, 6:54am

@btburnett3
I can confirm now it reconnects correctly with the ChannelConnectionPools.
And indeed, agree with the helper suppressing context is indeed less optimal for hot code paths

Thanks for all the work done fixing these issues!