Couchbase 2.2.1 Performance Degregation vs 2.1.4 + Durability Bug


#1

tl;dr - There appears to be a missing ! on line in the 2.2.1 CouchbaseNetClient which causes all operations with durability to execute minimally for ObserveTimeout milliseconds and always reports back failed durability when running Couchbase Server without SupportsEnhancedDurability (<4.x?).

Running Couchbase Server 3.1.0-1776, we recently upgraded the CouchbaseNetClient from version 2.1.4 to 2.2.1 and imediately started noticing a significant performance degregation during upsert operations with ReplicateTo.One, as well as false durability failures. Source to reproduce is -

using System;
using System.Collections.Generic;
using System.Diagnostics;
using Couchbase;
using Couchbase.Configuration.Client;

namespace Petridish
{
    class Program
    {
        static void Main(string[] args)
        {
            var doc = $"{{\"doc\":\"{new string('a', 5000)}\"}}";

            var config = new ClientConfiguration
            {
                BucketConfigs = new Dictionary<string, BucketConfiguration>
                {
                    { "test_1", new BucketConfiguration { BucketName = "test_1" } }
                },
                Servers = new List<Uri> { new Uri("http://111.111.111.111:8091/") },
            };

            using (var cluster = new Cluster(config))
            {
                using (var bucket = cluster.OpenBucket("test_1"))
                {
                    var stopwatch = Stopwatch.StartNew();

                    var result = bucket.Upsert(Guid.NewGuid().ToString(), doc, ReplicateTo.One);

                    Console.WriteLine($"Duration(ms): {stopwatch.ElapsedMilliseconds}");
                    Console.WriteLine($"Success: {result.Success}; Durability: {result.Durability}");
                }
            }
        }
    }
}

Again, the important thing to consider when reproducing are the durability constraint of ReplicateTo.One which require testing with a multi node cluster, and Couchbase Server < 4.X. The above outputs -

> Duration(ms): 579
> Success: True; Durability: NotSatisfied

There are two things which appear to be an indicator of a deeper bug: 1. The duration of just over 500 ms, 2. The Durability of NotSatisfied.

From dotTrace, we found the following difference (2.1.4 on the left, 2.2.1 on the right) -

You’ll notice highlighted in yellow that in 2.2.1 there were 16 observe operations sent to the server vs 2 in 2.1.4, which is what accounts for the performance degregation. Looking into KeyObserver.Observe, we noticed a peculiar difference between 2.1.4 on the left and 2.2.1 on the right

where in 2.2.1 ObserveParams.Reset was being executed, which reset the PersistedToCount and ReplicatedToCount that were being incremented inside of the CheckReplica method. This appears to be a bug fix for 2.1.4 where if you were to specify a ReplicateTo of anything more than 1, you could technically get back that you satisfied your durability constraints falsly because you will double count a single replica node each iteration of the do/while loop, eventually escaping the loop so long as the operation has replicated to at least 1 node. With the introduction of the p.Reset(); highlighted in yellow, the do/while loop never terminates early, only terminating after operation.TimedOut(), which is why we see a duration of just over 500ms = default configured ObserveTimeout, and a Durability of NotSatisfied in the above code snippet. The fact that the do/while is never able to terminate, points to what appears to be a bug in KeyObserver.CheckReplicas, see below -

I don’t understand what “key has mutated” enough to say with 100% confidence, but from the comment on line 378 that CheckReplica returns true if not mutated, it appears that line 368 is really calculating if not mutated; i.e. line 368 should be var mutated = !replicas.All(index => CheckReplica(observeParams, operation, index));. This appears to fix the issues, allowing the do/while loop to be broken out of without timing out, which reports accurately the durability and fixes the timings issues.


#2

@mhederi -

Indeed it does appear to be a bug, I created a ticket to track here. I’ll take a deeper look ASAP and confirm. If you would like you can always submit a pull request since it looks like you have a fix for it. If you need help with that, lmk!

-Jeff


#3

Thanks Jeff! I wish I had some spare cycles, but realistically, it’d probably be a while before I could get around to it. We went ahead and downgraded back to 2.1.4 in the mean time.

  • Miller