What's the most efficient way to bulk update the expiry of documents?


#1

I first stored documents in a bucket without expiry, but now I retroactively want to set and explicit expiry for all my documents.

The only solution I found so far is to create the following view:

function (doc, meta) {
  emit(null, null);
}

Retrieve this view from a C# Console application and iterate over the rows, and call Touch for all the documents.

This approach works, but based on my first try, this way I could process ~130 documents per second. This would be too slow for my use case, since I have more than 15 million documents in my bucket.
Is there a more efficient approach? Or am I on the right path and I just have to fine-tune my Console app doing the processing to be more efficient?


#2

Btw. I tried multiple approaches in my Console app:

  • just naively iterate over the documents and call the blocking Touch() on them one by one.
  • use Parallel.ForEach() and do the same thing
  • use TouchAsync() instead, start up a bounch of updates in parallel from one thread, and then wait for all of them with Task.WaitAll()

The first two gives comparable results, the parallelized one is somewhat faster.
The third option on the other hand always throws an exception after a couple of operations saying “Couchbase Failed to acquire a pooled client connection on 12.23.34.45 after 5 tries”.


#3

I think I would have expected option 3 to work well. Do you have a code snippet? Which version of the client?

Also, do you have muxio on? See the docs or the blog from when the feature shipped. It should help a lot with the asynchronous approach.


#4

@markvincze

Parallel.ForEach constrains at the number of cores on your computer, so it’s good for processor intensive stuff but not great for things that rely on network waits.

My suggestion would be a modification of option 3. Instead of spinning up all the tasks at once, use a SemaphoreSlim to make sure you’re not doing more than a few dozen at a time. I’ve seen that trying to start too many tasks at once can have a negative perf impact.

Also, make sure you’re using multiplex IO (default after 2.4.0) or that you have a very high maximum number of connections.

Brant


#5

@ingenthr, @btburnett3,
It’s interesting, I don’t understand yet why option 3 with the async call does not work. I even tried only starting 1 task at a time, and I still get the same exception.

This is a simplified version of the code (I stripped the printing and error handling):

var batchSize = 100;
var defaultExpiry = TimeSpan.FromDays(10);
var bucket = ClusterHelper.GetBucket(bucketName);

while (true)
{
    var query = bucket.CreateQuery("all_keys", "all_keys", false).Skip(cnt * batchSize).Limit(batchSize);

    var queryResult = bucket.Query<object>(query);
    var rows = queryResult.Rows.ToList();

    foreach (var r in rows)
    {
        var result = bucket.Touch(r.Id, defaultExpiry);

        if (!result.Success)
        {
            throw new Exception($"Touch operation failed. Message: {result.Message}", result.Exception);
        }
    }

    if (rows.Count() < batchSize)
    {
        break;
    }

    cnt++;
}

This is the simplest approach, doing the Touch() call one by one, and this works.
However, if I replace the line calling Touch() with this:

var result = bucket.TouchAsync(r.Id, defaultExpiry).Result;

So I’m not even doing anything in parallel, but waiting for every task synchronously, however, after a couple of iterations I receive the response “Touch operation failed. Message: Failed to acquire a pooled client connection on 146.148.21.160:11210 after 5 tries.”. (And of course if I try to start more of them in parallel, I get the same thing)

The version of my client library is 2.4.2. @btburnett3 what do you mean by multiplex IO exactly? Is that a different interface available on the SDK?


#6

@markvincze -

Can you include your configuration?

Multiplex IO (MUX) is the “newer” IO engine for the SDK. It’s the default IO engine if you are using 2.4.0 or greater and do not have SSL enabled. There is an overview in this post; however, you no longer have to explicitly configure it.

This makes me think you are using the older pooled IO engine and need to increase the PoolConfiguration.MaxSize to a higher value. Your configure will help :slight_smile:

-Jeff


#7

@jmorris: shouldn’t operations try until timeout, not a fixed number of tries though? Is that message wrong, or is there a bug?


#8

@ingenthr -

It likely timed out and the response status should indicate that it did; that message should help the user identify the issue - connection pool starvation for whatever reason.


#9

I have played around quite a lot with those numbers, this is the current configuration I’m testing with:

var config = new ClientConfiguration()
{
    Servers = new List<Uri>
    {
        new Uri(serverAddress)
    },
    UseConnectionPooling = true,
    DefaultConnectionLimit = 100,
};

ServicePointManager.DefaultConnectionLimit = 100;

config.BucketConfigs.First().Value.PoolConfiguration.MaxSize = 100;
config.BucketConfigs.First().Value.PoolConfiguration.MinSize = 100;

ClusterHelper.Initialize(config);

So it turns out I’m not using MUX, since if I understand correctly, setting UseConnectionPooling to true falls back to the old polling behavior, right?
The weird thing is that if I change it to false, then immediately the first bucket.TouchAsync(r.Id, defaultExpiry).Result call returns an exception with the message “The operation has timed out.” and nothing else. Maybe the async call and the synchronization context are causing a deadlock? It’s only happening with the Async version, the synchronous version works fine. I’m still trying to experiment with various different ways to call this.


#10

That is possible. Try awaiting on the task and see what happens.

Your MaxSize is large, so I am really surprised to see that error even with the pooled connections. It does seem to be something odd going on there. OTH, if you have a huge number of scheduled tasks, that may be the result in certain circumstances. There are resource limits.

BTW, using the standard sync operations will pretty much always be better than usingTask.Result, because of the overhead of running an async task synchronously. The best bet would probably be to partition your tasks into discrete sets and use async Task.WhenAll(tasksToBeRun) on each set.


#11

I managed to reproduce this second error with the following Console application:

static void Main(string[] args)
{
    var config = new ClientConfiguration()
    {
        Servers = new List<Uri>
        {
            new Uri("http://myserver:8091/pools")
        }
    };

    ClusterHelper.Initialize(config);

    var bucket = ClusterHelper.GetBucket("myBucket");

    var result = bucket.TouchAsync("ExistingKey", TimeSpan.FromDays(1)).Result;

    Console.WriteLine(result.Message);
}

The result is an error saying “The operation has timed out.”


#12

I tried in a couple of different ways, but still the same error. Also, in a Console app on the top level you’ll ultimately have to call Wait or Result, this is what I tried:

static void Main(string[] args)
{
    Task.Run(Repro).Wait();
}

static async Task Repro()
{
    var config = new ClientConfiguration()
    {
        Servers = new List<Uri>
        {
            new Uri("http://myServer:8091/pools")
        }
    };

    ClusterHelper.Initialize(config);

    var bucket = ClusterHelper.GetBucket("myBucket");

    var result = await bucket.TouchAsync("00017031-636c-4f2d-8018-18c91f079e28", TimeSpan.FromDays(1));

    Console.WriteLine(result.Message);
}

This also returns “The operation has timed out.”, and interestingly, it returns this almost immediately, I measured, it returns in 94ms.

Yep, definitely, I was planning to spin up multiple tasks in parallel, and wait for them in chunks, but I’m stuck with even running them one by one.


#13

@markvincze -

I’ll see if I can reproduce, it looks suspect!

  • Is this Core or .NET Full?
  • Windows or?

-Jeff


#14

@jmorris thanks a lot for helping, I’m sure it’ll turn out I’m making some trivial mistake somewhere, just can’t see it :slight_smile:
I’m running this now on .NET Full on Windows. (I also tried .NET Core on Windows before, and that produced the same problems, but I haven’t tried this small repro code there.)


#15

@markvincze -

I tried both awaiting the task and using Task.Result on .NET Core and Full Framework using VS2015 and all returned:

Can you provide an example project and/or upload your logs?

Thanks,

Jeff


#16

@jmorris, If I tried to call TouchAsync with a key that doesn’t exist, I got the same output, the problem only happened if I passed in an actually existing key. Can you try with a key of an existing document?
If it still doesn’t happen, I’ll try to isolate it even more and upload the example project to Github.


#17

@markvincze -

I tried using an existing key and I couldn’t replicate. I think enabling logging and attaching the logs might help us pinpoint the issue.

-Jeff


#18

@jmorris,
It seems that it’s related to the connection to my Couchbase cluster.
I tried to reproduce it with a CB cluster installed on my machine, but I couldn’t, the code works that way for me too. The problem only happens if I try to access my actual cluster hosted in Google Cloud. The thing I don’t understand is even if there is something wrong with the connection to the Google Cloud machine, why is it only happening with TouchAsync() and not Touch().
I uploaded the logs to this gist: https://gist.github.com/markvincze/f283fa03813335f929822ad7cce1e074, I don’t see anything obvious there, only that the connection was disconnected.
Based on this, can you guess what can go wrong? Or If you can give me some pointers about which part to debug, I can step over the code of the SDK to see what happens.


#19

@markvincze

A couple of questions that might help @jmorris when he gets online in a bit:

  1. What versions of Couchbase Server are you using on your machine versus in the cloud? Any difference there?
  2. What OS is Couchbase Server running on for your machine versus the cloud?
  3. What OS is the client running on for your machine versus the cloud?

Just trying to narrow down potential differences.

Thanks,
Brant


#20

They are different

  • locally: Windows, and Couchbase 4.6.1-3652 Enterprise Edition (build-3652)
  • on GCloud: Linux (Debian), and 4.5.0-2601 Community Edition (build-2601)