How to handle infinite retry loop cause by TemporaryFailure with the .NET client

I’m doing some proof-of-concept tests with Couchbase in Docker and an ASP .NET Core 3.1 application using the :NET SDK v.3.0.4.811 and I’m getting stuck in a sort of infinite loop of retries.

I think the issue is that I’m barraging the server with too many messages too quickly and I’m runing into a out-of-memory situation on the server side, ref https://docs.couchbase.com/couchbase-devguide-2.5/#handling-temporary-out-of-memory-errors. The line of code causing my issues is the following:

result = await _bucket.DefaultCollection().InsertAsync("mykey", myObject).ConfigureAwait(false);

After some tens of thousands of inserts this starts to fail and the console is flooded with the following logs:

[16:20:41 DBG] Server [::1]:11210 returned TemporaryFailure for op Add with key myKey and opaque 28316.
[16:20:41 DBG] Retrying op 28316/myKey because KvTemporaryFailure.

It’s trying to perform the same operation over and over again. The InsertAsync() method apparantly never returns and no exception is thrown. How am I supposed to detect this sort of situation? The result returned by InsertAsync() does not have any indication of success/failure, and without any exceptions I don’t see how I can detect the problem and back off the client.

I’m guessing there is some sort of config option somewhere that I am missing. Something to configure the client to not retry forever and eventually throw an exception?

Yes, that sounds like the issue you are running into.

The operations should be timing out and that is the exception (TimeoutException) you should be receiving. I am not sure why you are not; I’ll look into what the problem is.

Can you post your code?

-Jeff

@jmorris it could be that the ops aren’t timing out and just taking longer, which is really working as designed, right? These appear to be debug level logs.

@olavaukan, if the cluster’s data service is under memory pressure, it returns a temporary failure (sometimes called TMP OOM, technically erroneously) and the SDK will backoff and retry automatically up until the timeout. If you’re doing something like a bulk load, you can increase the timeout.

Of course, you can also speed it up by giving the cluster more IO (disks), especially if you’re doing this regularly.

Thanks for the quick response @jmorris and @ingenthr. Unfortunately I can’t just share my code directly, but I’ve created a very basic .NET Core console application that can reproduce the issue.

At fist all is well:

[17:13:58 DBG] Completed executing op [::1]:11210 on Add with key item-23853 and opaque 24208
[17:13:58 DBG] CB: Current state is Closed.
[17:13:58 DBG] CB: Sending 24209 to [::1]:11210.
[17:13:58 DBG] Executing op Add on [::1]:11210 with key item-23854 and opaque 24209.
Items saved: 23 855
[17:13:58 DBG] Completed executing op [::1]:11210 on Add with key item-23854 and opaque 24209
[17:13:58 DBG] CB: Current state is Closed.
[17:13:58 DBG] CB: Sending 24210 to [::1]:11210.
[17:13:58 DBG] Executing op Add on [::1]:11210 with key item-23855 and opaque 24210.
Items saved: 23 856
[17:13:58 DBG] Completed executing op [::1]:11210 on Add with key item-23855 and opaque 24210
[17:13:58 DBG] CB: Current state is Closed.
[17:13:58 DBG] CB: Sending 24211 to [::1]:11210.
[17:13:58 DBG] Executing op Add on [::1]:11210 with key item-23856 and opaque 24211.
Items saved: 23 857
[17:13:58 DBG] Completed executing op [::1]:11210 on Add with key item-23856 and opaque 24211
[17:13:58 DBG] CB: Current state is Closed.
[17:13:58 DBG] CB: Sending 24212 to [::1]:11210.
[17:13:58 DBG] Executing op Add on [::1]:11210 with key item-23857 and opaque 24212.
Items saved: 23 858

Then after a while it starts a retrying:

[17:15:01 DBG] Executing op Add on [::1]:11210 with key item-25256 and opaque 25633.
[17:15:01 DBG] Server [::1]:11210 returned TemporaryFailure for op Add with key item-25256 and opaque 25633.
[17:15:01 DBG] Retrying op 25633/item-25256 because KvTemporaryFailure.
[17:15:02 DBG] CB: Current state is Closed.
[17:15:02 DBG] CB: Sending 25633 to [::1]:11210.
[17:15:02 DBG] Executing op Add on [::1]:11210 with key item-25256 and opaque 25633.
[17:15:02 DBG] Server [::1]:11210 returned TemporaryFailure for op Add with key item-25256 and opaque 25633.
[17:15:02 DBG] Retrying op 25633/item-25256 because KvTemporaryFailure.
[17:15:03 DBG] CB: Current state is Closed.
[17:15:03 DBG] CB: Sending 25633 to [::1]:11210.
[17:15:03 DBG] Executing op Add on [::1]:11210 with key item-25256 and opaque 25633.
[17:15:03 DBG] Server [::1]:11210 returned TemporaryFailure for op Add with key item-25256 and opaque 25633.
[17:15:03 DBG] Retrying op 25633/item-25256 because KvTemporaryFailure.
[17:15:04 DBG] Executing op GetClusterConfig on [::1]:11210 with key  and opaque 25665.
[17:15:04 DBG] Completed executing op [::1]:11210 on GetClusterConfig with key  and opaque 25665
[17:15:04 DBG] {"NetworkResolution":"auto","SurrogateHost":null,"rev":48,"name":"test","uri":"/pools/default/buckets/test?bucket_uuid=6de709bb6bc95b4485a7df46d75a9a41","streamingUri":"/pools/default/bucketsStreaming/test?bucket_uuid=6de709bb6bc95b4485a7df46d75a9a41","nodes":[{"couchApiBase":"http://$HOST:8092/test%2B6de709bb6bc95b4485a7df46d75a9a41","couchApiBaseHttps":"https://$HOST:18092/default","hostname":"localhost:8091","ports":{"direct":11210,"proxy":11211,"sslDirect":11207,"httpsCAPI":18092,"httpsMgmt":18091},"services":null,"version":null}],"nodesExt":[{"thisNode":true,"services":{"mgmt":8091,"mgmtSSL":0,"indexAdmin":9100,"indexScan":9101,"indexHttp":9102,"indexStreamInit":9103,"indexStreamCatchup":9104,"indexStreamMaint":9105,"indexHttps":0,"kv":11210,"kvSSL":0,"capi":8092,"capiSSL":0,"projector":9999,"n1ql":8093,"n1qlSSL":0,"cbas":0,"cbasSSL":0,"fts":8094,"ftsSSL":0,"moxi":0},"hostname":"localhost","alternateAddresses":null,"HasAlternateAddress":false}],"nodeLocator":"vbucket","uuid":"6de709bb6bc95b4485a7df46d75a9a41","ddocs":{"uri":"/pools/default/buckets/test/ddocs"},"vBucketServerMap":{"hashAlgorithm":"CRC","numReplicas":0,"serverList":["localhost:11210"],"vBucketMap":[[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0]],"vBucketMapForward":[]},"bucketCapabilitiesVer":"","bucketCapabilities":["durableWrite","tombstonedUserXAttrs","couchapi","dcp","cbhello","touch","cccp","xdcrCheckpointing","nodesExt","xattr"],"clusterCapabilitiesVer":[1,0],"clusterCapabilities":{"n1ql":["enhancedPreparedStatements"]},"VBucketMapChanged":false,"ClusterNodesChanged":false}

I’m running this against a bucket configured with (the minimum) RAM of 100MB and the issue starts showing up after adding about 25.000 items. I undestand that the retry behavior is “by design” but shouldn’t it eventually time out? I’m not setting any explicit timeout values here, so I guess it’s using the default behavior? After ten minutes it’s still retrying the same operation (item-25256):

[17:27:46 DBG] Executing op Add on [::1]:11210 with key item-25256 and opaque 25633.
[17:27:46 DBG] Server [::1]:11210 returned TemporaryFailure for op Add with key item-25256 and opaque 25633.
[17:27:46 DBG] Retrying op 25633/item-25256 because KvTemporaryFailure.
[17:27:47 DBG] CB: Current state is Closed.
[17:27:47 DBG] CB: Sending 25633 to [::1]:11210.
[17:27:47 DBG] Executing op Add on [::1]:11210 with key item-25256 and opaque 25633.
[17:27:47 DBG] Server [::1]:11210 returned TemporaryFailure for op Add with key item-25256 and opaque 25633.
[17:27:47 DBG] Retrying op 25633/item-25256 because KvTemporaryFailure.
[17:27:48 DBG] CB: Current state is Closed.
[17:27:48 DBG] CB: Sending 25633 to [::1]:11210.
[17:27:48 DBG] Executing op Add on [::1]:11210 with key item-25256 and opaque 25633.
[17:27:48 DBG] Server [::1]:11210 returned TemporaryFailure for op Add with key item-25256 and opaque 25633.
[17:27:48 DBG] Retrying op 25633/item-25256 because KvTemporaryFailure.

To reproduce create a new ASP .NET Core 3.1 console app with the following files.

Program.cs:

using System;
using System.Threading.Tasks;
using Microsoft.Extensions.DependencyInjection;

namespace ConsoleApp4
{
    class Program
    {
        public static async Task Main(string[] args)
        {
            Console.WriteLine("[DEBUG] Initializing application...");
              
            var services = new ServiceCollection();
            var startup = new Startup();

            startup.ConfigureServices(services);

            var serviceProvider = services.BuildServiceProvider();
            var app = serviceProvider.GetService<IApp>();

            Console.WriteLine("[DEBUG] Application Initialized");
            Console.WriteLine("[DEBUG] Running Tasks...");

            await app.ExecuteAsync();

            Console.WriteLine("[DEBUG] All Tasks Completed");
            Console.ReadKey();
        }
    }
}

Startup.cs

using System.IO;
using Couchbase.Extensions.DependencyInjection;
using Microsoft.Extensions.Configuration;
using Microsoft.Extensions.DependencyInjection;
using Serilog;

namespace ConsoleApp4
{
    public class Startup
    {
        public IConfiguration Configuration { get; }

        public Startup()
        {
            var builder = new ConfigurationBuilder()
                .SetBasePath(Directory.GetCurrentDirectory())
                .AddJsonFile("appsettings.json");

            Configuration = builder.Build();
        }

        public void ConfigureServices(IServiceCollection services)
        {
            // Configuration & Settings
            services.AddSingleton(Configuration);

            // Serilog
            var loggerConfiguration = new LoggerConfiguration().ReadFrom.Configuration(Configuration);
            services.AddLogging(builder =>
            {
                builder.AddSerilog(loggerConfiguration.CreateLogger(), true);
            });

            // Couchbase
            services.AddCouchbase(Configuration.GetSection("Couchbase"));

            // Main Application
            services.AddTransient<IApp, App>();
        }
    }
}

IApp.cs

using System.Threading.Tasks;

namespace ConsoleApp4
{
    public interface IApp
    {
        Task ExecuteAsync();
    }
}

App.cs

using System;
using System.Text;
using System.Threading.Tasks;
using Couchbase.Extensions.DependencyInjection;

namespace ConsoleApp4
{
    public class App : IApp
    {
        private IBucketProvider _bucketProvider;

        public App(IBucketProvider bucketProvider)
        {
            _bucketProvider = bucketProvider;
        }

        public async Task ExecuteAsync()
        {
            Console.WriteLine("Starting main application...");

            var bucket = await _bucketProvider.GetBucketAsync("test");
            var collection = bucket.DefaultCollection();

            var itemsSaved = 0;
            var rnd = new Random(DateTime.Now.Millisecond);

            var exiting = false;
            while (!exiting)
            {
                var key = $"item-{itemsSaved}";
                var data = new DummyObject
                {
                    DummyInt1 = rnd.Next(10000),
                    DummyInt2 = rnd.Next(10000),
                    DummyInt3 = rnd.Next(10000),
                    DummyInt4 = rnd.Next(10000),
                    DummyInt5 = rnd.Next(10000),
                    DummyString1 = RandomString(rnd, 50000),
                    DummyString2 = RandomString(rnd, 50000),
                    DummyString3 = RandomString(rnd, 50000),
                    DummyString4 = RandomString(rnd, 50000),
                    DummyString5 = RandomString(rnd, 50000)
                };

                try
                {
                    await collection.InsertAsync(key, data);

                    itemsSaved++;
                }
                catch (Exception e)
                {
                    Console.WriteLine($"[EXCEPTION] {e.Message}");
                    exiting = true;
                }

                Console.WriteLine($"Items saved: {itemsSaved:N0}");
            }

            Console.WriteLine("Closing main application...");
        }

        // Generates a random string with a given size.    
        private static string RandomString(Random rnd, int size, bool lowerCase = false)  
        {  
            var builder = new StringBuilder(size);
            char offset = lowerCase ? 'a' : 'A';  
            const int lettersOffset = 26;
            for (var i = 0; i < size; i++)  
            {  
                var @char = (char)rnd.Next(offset, offset + lettersOffset);  
                builder.Append(@char);  
            }
            return lowerCase ? builder.ToString().ToLower() : builder.ToString();  
        } 
    }
}

DummyObject.cs

namespace ConsoleApp4
{
    public class DummyObject
    {
        public int DummyInt1 { get; set; }
        public int DummyInt2 { get; set; }
        public int DummyInt3 { get; set; }
        public int DummyInt4 { get; set; }
        public int DummyInt5 { get; set; }

        public string DummyString1 { get; set; }
        public string DummyString2 { get; set; }
        public string DummyString3 { get; set; }
        public string DummyString4 { get; set; }
        public string DummyString5 { get; set; }
    }
}

appsettings.json

{
  "Couchbase": {
    "ConnectionString": "couchbase://localhost",
    "Username": "username",
    "Password": "password"
  },
  "Serilog": {
    "MinimumLevel": {
      "Default": "Debug",
      "Override": {
        "Microsoft": "Information",
        "System": "Information"
      }
    },
    "WriteTo": [
      {
        "Name": "Async",
        "Args": {
          "configure": [
            { "Name": "Console" }
          ]
        }
      }
    ]
  }
}

@olavaukan -

Thanks for the detail in your post; the operations should be timing out, if not its a bug in the SDK. I’ll use your example to verify and create a bug ticket.

-Jeff

@olavaukan -

I ran your application and I see a TimeoutException being thrown after the internal timeout threshold is met:

This is expected; are you seeing something different when you run your app?

-Jeff

@jmorris That is very strange. I’m not seeing any exceptions at all. What version of the SDK are you using?

Not sure if it make a difference but my Couchbase instance is running in Docker using the default docker networking. I seem to remember reading on the DockerHub page that this might cause some issues?

@olavaukan -

It turns out that was against HEAD of master branch on github. I then tested against the last release 3.0.5 and was able to recreate the infinite bootstrap behavior you described above. It turns out its fixed by this commit for ticket NCBC-2557 which improves the timeout handling across the entire SDK.

The fix will be on NuGet for release 3.0.6 which is scheduled for the first week of October, 2020. Note that some of the tickets in the release will likely be moved into a later release.

-Jeff

1 Like