Frozen application after application pool starts or recycles under load

Summary
We have had some problems with our application after the application pool started or recycled. When the application pool started or recycled under load it became completely frozen. Some information about our application:

  • Couchbase SDK: CouchbaseNetClient 2.4.8 (at the time)
  • .NET framework: .NET 4.6.1 (64-bit)
  • Web framework: ASP.NET Core
  • OS: Windows Server 2012 R2
  • Web server: IIS/Kestrel

The application is an identity and access management application which handles over a million user authentications daily and is continually growing. This results in over 500 requests per second at peak times. We use a load balancer to share the load across multiple IIS web servers which in turn pass requests to the Kestrel process.

To determine what the problem was we analyzed our logging events (we use extensive logging of all dependent services, like Couchbase and SQL Server for example, using syslog messages) and made process dumps (which of course can be shared with you if requested) which we analyzed using WinDbg. We also asked Microsoft for help from an escalation engineer who also analyzed the dumps using WinDbg. I will describe the findings we made, what we did to prevent this and how we think it can be fixed permanently.

The problem
We use lazy initialization of the Couchbase Cluster and IBucket objects, which means that initialization happens upon first use instead of during application initialization. Upon first use a Cluster is created, which is stored and reused throughout the application lifetime, then OpenBucket is called. The resulting IBucket from the OpenBucket call is also stored and reused throughout the application lifetime. This logic is similar to the ClusterHelper logic.

When our application starts it immediately receives 100+ requests per second which mostly depend on Couchbase, thus running into the Couchbase initialization through calling OpenBucket (as the IBucket is not yet created and stored for later use). In OpenBucket a lock statement ensures that bucket initialization only happens once, so the first thread obtains the lock and the other threads must wait until it is released. The first thread starts initializing the bucket which will eventually call the UriExtensions.GetIpAddress method (as we use hostnames for our Couchbase nodes). Inside the UriExtensions.GetIpAddress method the asynchronous method Dns.GetHostEntryAsync is called and, because the calling method is not asynchronous, the Result property is used to obtain the resulting host entry. This is where it starts to go south.

The resulting task from the Dns.GetHostEntryAsync method, although called in a synchronous context, needs another thread to execute on. Because the thread pool is just fresh and still growing, all threads are in use and waiting for the lock inside the OpenBucket method. Currently there is no thread to execute the Dns.GetHostEntryAsync task. Every time the thread pool decides to expand the available threads, all newly available threads are immediately used by Kestrel for executing other queued requests. These new threads will also run into bucket initialization eventually and end up waiting for the lock statement inside the OpenBucket method. The result is an application which is completely locked.

Possible fixes
We have made a temporary fix in a wrapper method around the OpenBucket method which uses a SemaphoreSlim for locking during bucket initialization (which of course also can be shared with you if requested). Because our application is mostly asynchronous we can use the SemaphoreSlim.WaitAsync method which, when used in an asynchronous context, will return the waiting thread to the thread pool while waiting for the lock. Because these threads are returned to the thread pool there is an available thread for executing the task from the Dns.GetHostEntryAsync method.

As said, this is just a temporary fix. The real fix would be to use the synchronous Dns.GetHostEntry method, but this method is only available in .NET Standard 2.0 and above which would result in another target framework. Another approach would be to make bucket initialization asynchronous all the way by providing an OpenBucketAsync method. Pull request NCBC-1549 by @dlemstra, a colleague, is a first step towards providing the asynchronous creation of buckets and contains essentially the fix we made in our wrapper method mentioned earlier.

This would fix the problem in asynchronous contexts, but doesn’t prevent that the Dns.GetHostEntryAsync method has to be used in the synchronous OpenBucket method. It seems that the use of asynchronous methods in a synchronous context happens quite a lot in the CouchbaseNetClient library, but Dns.GetHostEntryAsync is the only one that has given us serious trouble (until now…).

@arjenpost

Thanks for this excellent summary of the situation. @jmorris and I were just discussing the implications of adding OpenBucketAsync yesterday due to this problem.

I do have a question about your use case that I’d like to clarify. Why do you open the bucket on demand at the first request? Is there a special reason for this based on your specific needs? The more common use case is to connect to the cluster and open the bucket during application initialization.

For example, the pattern I’ve used most recently is to open the bucket as part of the health check at a /health URL. Then the load balancer doesn’t route traffic to it until it’s fully healthy, including bootstrapped to Couchbase. This has the additional advantage that requests don’t get hung up waiting for bootstrapping, all traffic is going to the other servers until this newly started one is ready to serve responses rapidly.

Thanks,
Brant

Thanks for your reply. The reason for opening the buckets upon first use and not during application initialization is due to fast application startup. The load balancer currently monitors /, the suggestion of monitoring a health endpoint sounds like a very good idea. The logic of opening the bucket can remain the same because only the health endpoint will be hit at first. I’ll discuss that with my team today.

We were having a discussion here on contributing to the Couchbase SDK by sending more pull requests and iteratively move to a completely asynchronous opening of buckets. We’re not sure if it is possible for us to make a big investment at this time because we are very busy, but we do have the intention to do so. Could you please enlighten us on your position on the subject of opening buckets asynchronously?

Thanks again!

@arjenpost

I think the biggest concern that we have with OpenBucketAsync is that it may encourage misuse of the SDK. We regularly have issues with users confused by poor performance, and we find out they are using OpenBucket on each call rather than caching it or using ClusterHelper. If we add OpenBucketAsync, this may exacerbate that problem.

We’ve got our heads together on it, and we’ll see what ideas we can come up with. One idea I had was to make OpenBucketAsync internal, and then just expose GetBucketAsync on ClusterHelper which would ensure a cache. But since you aren’t using ClusterHelper this wouldn’t help you.

Any suggestions you have would be welcome as well.

I’m also going to look into what we can do to help prevent the thread pool depletion problem in Dns.GetHostEntryAsync. In the worst case we can add a netstandard20 build to the NuGet package that uses Dns.GetHostEntry.

I understand your concerns on the misuse of the SDK, but I’m not sure why adding the OpenBucketAsync method would exacerbate the misuse. Making the OpenBucketAsync available through GetBucketAsync on ClusterHelper would already be nice though.

Adding netstandard20 as a target framework and using a directive to call Dns.GetHostEntry does not seem like an intrusive modification but it does provide benefit for applications using netstandard20. Of course, the problem still exists for applications using netstandard16 or lower.

@arjenpost

I actually had some trouble trying to add the call to Dns.GetHostEntry in netstandard20. The documentation says it’s there, but Visual Studio kept claiming it wasn’t on build. I might go back and look at it some more at some point.

Regarding the misuse problem, we may have come up with another approach to help prevent misuse. This may alleviate the concern so that we could move forward with OpenBucketAsync. I’m interested in @jmorris’s opinion, as he was the one who originally raised the concern with me. The idea I had was using Roslyn analyzers for VS2015 and 2017 to help detect misuse and provide squiggles in the editor as a warning. Work in progress can be viewed here: http://review.couchbase.org/#/c/85103/

@btburnett3 @arjenpost-

Sorry for the late reply:

@btburnett3 I like the approach you took with the Roslyn analyzers, that is probably the best we can do for alerting users that they may be using the API wrong. I have some more input for your PR which all make on the CR in Gerrit.

@arjenpost -

We’ve discussed async OpenBucket in the past and the reason it wasn’t implemented is two-fold:

  1. Our best practices state that OpenBucket should be called by the main thread in application startup and the reference Disposed by the main thread when the application shuts down. This is because bootstrapping is a synchronous process done once and then the bucket reference can be used across worker threads. For this we have always suggested bootstrapping in Application_Start in global.asax or using Setup.cs’s Configure or ConfigureServices, since they are called only once in the application when it starts up - the same for disposing.

  2. I can’t say that its never been requested, but it has seldom been requested so the prioritization has been lower than other features.

That being said, I think your request/proposal/pr adds a lot of value and we are more than happy to work with you to get this feature into the SDK!

The synchronous API came about before the asynchronous API and that boat sailed long ago! What we have been doing is making any synchronous code in an asynchronous path asynchronous, but it takes time. Definitely room for improvement there. The 3.0 version, whenever we get around to it, will most likely be purely async.

-Jeff