[4.1.X issue?] No purge for 128Mb-bucket + full ejection even for manual compaction

Hello!
CB-server 4.1.X, couple of questions:

I do know about

The Metadata Purge Interval setting runs as part of auto-compaction.

from Couchbase SDKs , but i observe different behavior with “metadata purge interval” = 0.04 even 2…7 hours after expiration: metadata remains until manual compaction is called (and a “sign of non-purged metadata” is not a bucket data size).

So, my questions are:

  1. Is there a way to ensure that metadata purge was executed (for example log entries with corresponding log level)?
  2. Does metadata purge always happen during auto-compaction or there are “hidden special conditions” needed to run metadata purge during compaction?

AFIAK metadata purge happens during all types of compaction (manual or auto), but note that it only purges tombstones which are at least as old as the purge interval.

In other words, if purge interval is set to 3 days and you delete an item, that item’s tombstone will not be removed until 3 days after it was deleted, irrespective of how often you run compaction.

@drigby,
so, if i set “Metadata purge interval” = 0.04 before inserting any data, then, after 1+ hour after last item deletion, all items should be purged, correct ?

Assuming compaction is run after 1 hour - Yes.

@drigby
that’s not true, at least for builds 6070 and 6074, i’ve checked.
Should i raise an MB or you’ve got any ideas about additional checks ?
BTW, setting “Metadata purge interval” = 0.04 was a workaround for MB-19758 :wink:

@drigby,
you know, let’s clarify, your statement

Assuming compaction is run after 1 hour - Yes.

is a little bit ambiguous.

Consider this scenario with “database fragmentation = 10%” and “metadata purge interval = 0.04” set for autocompation:

  1. documents (with expiration = 1 minute) are upserted with rate 100K per hour for 5 hours
  2. after 5 hours all upsertions stopped.

Should there be any documents in bucket 1+ hours later after upsertions stopped ?

So the algorithm is (as far as I know - note it’s in the cluster manager which isn’t my area of expertise):

  1. Whenever compaction is run, Calculate the purge before timestamp as current time - purge interval.
  2. For each deleted item (tombstone) found in the couchstore file, check if it was deleted before the purge before timestamp. If true, remove the tombstone. If false, leave it.

How are you verifying that compaction isn’t removing your items? If you are making use of views and using development design docs, you might be hitting Loading... .

If you are making use of views and using development design docs, you might be hitting Loading... .

No, my DD’s are production ones.

How are you verifying that compaction isn’t removing your items?

It’s funny, but that’s where MB-19758 comes into play. As i understand, because of COUCHSTORE_NO_OPTIONS instead of COUCHSTORE_NO_DELETES in

when i press “Documents” in UI, i see messages for deleted, but not purged items in logs:

memcached<0.75.0>: 2016-07-18T08:31:35.648089Z WARNING (bucket) Warning: failed to fetch data from database, vBucket=165 key=key_was_here error=document not found [none]

and “error” message in UI because of “timeout for data retrieval”.

repro-alg is easy:

1 . Make a cluster and default bucket (with 10% DB-frag, MPI = 0.04)
2 . simply insert 100K items with 1 min TTL

package expireddocserrors;

import com.couchbase.client.java.Bucket;
import com.couchbase.client.java.Cluster;
import com.couchbase.client.java.CouchbaseCluster;
import com.couchbase.client.java.document.JsonDocument;
import com.couchbase.client.java.document.json.JsonObject;
import com.couchbase.client.java.env.CouchbaseEnvironment;
import com.couchbase.client.java.env.DefaultCouchbaseEnvironment;
import java.util.LinkedList;
import java.util.concurrent.Phaser;

public class ExpiredDocsErrors extends Thread {

private final int tid;

// configure here
private static final int threads = 10;
private static final int docsPerThread = 10000;
private static final int docTTLms = 60 * 1000;
private static final int dumpToConsoleEachNDocs = 1000;

private static final Phaser phaser = new Phaser(threads + 1);
private static final CouchbaseEnvironment ce;
private static final Cluster cluster;
private static final String bucket = "default";
private static final boolean is6070 = false;

static {
    ce = DefaultCouchbaseEnvironment.create();
    final LinkedList<String> nodes = new LinkedList();
    nodes.add(is6070 ? "A.node" : "X.node");
    nodes.add(is6070 ? "B.node" : "Y.node");
    nodes.add(is6070 ? "C.node" : "Z.node");
    cluster = CouchbaseCluster.create(ce, nodes);
    final Bucket b = cluster.openBucket(bucket);
}
public ExpiredDocsErrors(final int tid) {
    this.tid = tid;
}
public final void run() {
    try {
        Bucket b = null;
        synchronized(cluster) { b = cluster.openBucket(bucket); }
        final long stm = System.currentTimeMillis();
        final JsonObject jo = JsonObject.empty();
        for(int i = 0; i< docsPerThread; i++) { 
            b.upsert(JsonDocument.create(
                    tid + ":" + System.currentTimeMillis() +  ":" + System.currentTimeMillis() , 
                    (int)((System.currentTimeMillis() + docTTLms) / 1000),
                    jo)
            );
            if (i % dumpToConsoleEachNDocs == 0) System.out.println("T[" + tid + "] = " + i);
        }
    } catch(final Exception e) {
        e.printStackTrace();
    } finally {
        phaser.arriveAndAwaitAdvance();
    }   
}
public static void main(String[] args) {
    for(int i = 0; i< threads; i++) new ExpiredDocsErrors(i).start();
    phaser.arriveAndAwaitAdvance();
    System.out.println("DONE");
}

}

3 . Wait for an 1+ hours
4 . Click UI and catch MB-19758

Any ideas ?

As your repo is described, no purge will happen - as there’s no guarantee compaction will happen. Assuming you don’t have a constant mutation load running (and don’t manually run compaction after 1 hr), I expect the following will happen:

  • At (2) - insert 100k items - auto-compaction will probably run (although depends on data size / fragmentation settings). However at this point nothing has been deleted yet, so nothing is even considered for purging, let alone actually purged.

  • At (3) - wait 1 hour - unless you actually have mutations happening (which increase fragmentation above the threshold) - or you have timed compaction or manual compaction - no check for items to purge will run.

Try running a manual compaction after step 3 (i.e. after waiting 1 hour after items were deleted), and see if that makes a difference.

“Le ballet de la Merlaison”, part 1 (spoiler: there are less then 16 parts):

  1. ~11:00-11:10 GMT i created new cluster (3 nodes, all services, 4VCPU/4G RAM), set autocompation globally (DB-frag = 10% or 1MB) and created default bucket (2400M per node, 1 replica, value ejection, view index replicas, flush = enable) and populated it with 100K items for ~49 sec.
    2.~ 14:10 GMT via UI->Data Buckets, i saw “0”, and that was expected (all items are deleted)
    3.~ 14:10 i’ve checked “were there any compactions with purging” via catching MB-19758 (i.e. pressing “Documents”) … and nothing happended! No any entries in logs, nothing at all. No docs, everything is purged. Litlle bit unexpected, but let’s suppose, that we followed under one of two autocompaction-run criterias (10% or 1MB db-defrag size)

“Le ballet de la Merlaison”, part 2:

  1. ~14:30…14:35 GMT i removed default bucket and re-created it with followind params: 128M per node, 1 replica, full ejection, view index replicas, flush = enable. 100K items was populated too. Global autocompaction settings has not been changed.
  2. ~16:15 GMT via UI->Data Buckets, i saw “0”, and again, that is expected.
  3. Clicked on UI->Data Buckets-> “Documents” … Action!:

root@xxx:~# tail -f /opt/couchbase/var/lib/couchbase/logs/babysitter.log -n 10
memcached<0.75.0>: 2016-07-18T16:16:48.841629Z WARNING (default) Warning: failed to fetch data from database, vBucket=300 key=9:1468852437391:1468852437391 error=document not found [none]
memcached<0.75.0>: 2016-07-18T16:16:48.841713Z WARNING (default) Warning: failed to fetch data from database, vBucket=306 key=9:1468852435554:1468852435554 error=document not found [none]
memcached<0.75.0>: 2016-07-18T16:16:48.841940Z WARNING (default) Warning: failed to fetch data from database, vBucket=259 key=9:1468852437438:1468852437438 error=document not found [none]
memcached<0.75.0>: 2016-07-18T16:16:48.842071Z WARNING (default) Warning: failed to fetch data from database, vBucket=299 key=9:1468852435565:1468852435565 error=document not found [none]
memcached<0.75.0>: 2016-07-18T16:16:48.842164Z WARNING (default) Warning: failed to fetch data from database, vBucket=44 key=9:1468852437488:1468852437488 error=document not found [none]
memcached<0.75.0>: 2016-07-18T16:16:48.842377Z WARNING (default) Warning: failed to fetch data from database, vBucket=26 key=9:1468852435599:1468852435599 error=document not found [none]
memcached<0.75.0>: 2016-07-18T16:16:48.842511Z WARNING (default) Warning: failed to fetch data from database, vBucket=13 key=9:1468852437519:1468852437519 error=document not found [none]
memcached<0.75.0>: 2016-07-18T16:16:48.842592Z WARNING (default) Warning: failed to fetch data from database, vBucket=10 key=9:1468852435684:1468852435684 error=document not found [none]

ok, that was babysitter + pressed “UI” twice (after first one; kinda nervously “clicking”), let’s get real number of non-purged items:

root@xxx:~# awk -F"(default)" ‘{ print $2; }’ /opt/couchbase/var/lib/couchbase/logs/memcached.log.* | grep “Warning: failed to fetch data from database” | sort | uniq | wc -l
33247

Oops, 33.25% of all items were not purged. (well, really it should be devided by 2, because i pressed “Documents” twice => ~16%)

But maybe there were something from first run ?

root@xxx:~# grep “Warning: failed to fetch data from database” /opt/couchbase/var/lib/couchbase/logs/memcached.log.* | grep “2016-07-18T11:” | wc -l
0

root@xxx:~# grep “Warning: failed to fetch data from database” /opt/couchbase/var/lib/couchbase/logs/memcached.log.* | grep “2016-07-18T14:” | wc -l
0

No, 33247 (/2) were not purged at second attempt.

So, what do you think about that ? Are there any reasonable explanations ?

And, the final chord: ~17:50 manual compaction was initiated. And, UI “clicking” result in log is (see “jump” between 17:46 - before compation, and 17:51 - after compation):

[ns_server:warn,2016-07-18T17:46:08.154Z,babysitter_of_ns_1@127.0.0.1:<0.75.0>:ns_port_server:log:215]Dropped 265 log lines from memcached
[ns_server:info,2016-07-18T17:51:19.120Z,babysitter_of_ns_1@127.0.0.1:<0.75.0>:ns_port_server:log:210]memcached<0.75.0>: 2016-07-18T17:51:18.919014Z WARNING (default) Warning: failed to fetch data from database, vBucket=20 key=0:1468852325469:1468852325469 error=document not found [none]
memcached<0.75.0>: 2016-07-18T17:51:18.919238Z WARNING (default) Warning: failed to fetch data from database, vBucket=2 key=0:1468852327580:1468852327580 error=document not found [none]
memcached<0.75.0>: 2016-07-18T17:51:18.919506Z WARNING (default) Warning: failed to fetch data from database, vBucket=306 key=0:1468852330430:1468852330430 error=document not found [none]
memcached<0.75.0>: 2016-07-18T17:51:18.919506Z WARNING (default) Warning: failed to fetch data from database, vBucket=20 key=0:1468852325469:1468852325469 error=document not found [none]
memcached<0.75.0>: 2016-07-18T17:51:18.919762Z WARNING (default) Warning: failed to fetch data from database, vBucket=2 key=0:1468852327580:1468852327580 error=document not found [none]
memcached<0.75.0>: 2016-07-18T17:51:18.919837Z WARNING (default) Warning: failed to fetch data from database, vBucket=147 key=0:1468852332468:1468852332468 error=document not found [none]
memcached<0.75.0>: 2016-07-18T17:51:18.920052Z WARNING (default) Warning: failed to fetch data from database, vBucket=282 key=0:1468852336614:1468852336614 error=document not found [none]
memcached<0.75.0>: 2016-07-18T17:51:18.920154Z WARNING (default) Warning: failed to fetch data from database, vBucket=306 key=0:1468852330430:1468852330430 error=document not found [none]

ok, let’s click “Documents” once again (~18:00):

memcached<0.75.0>: 2016-07-18T18:00:23.653114Z WARNING (default) Warning: failed to fetch data from database, vBucket=60 key=9:1468852415096:1468852415096 error=document not found [none]
memcached<0.75.0>: 2016-07-18T18:00:23.653307Z WARNING (default) Warning: failed to fetch data from database, vBucket=283 key=9:1468852417214:1468852417214 error=document not found [none]
memcached<0.75.0>: 2016-07-18T18:00:23.653495Z WARNING (default) Warning: failed to fetch data from database, vBucket=165 key=9:1468852419549:1468852419549 error=document not found [none]
memcached<0.75.0>: 2016-07-18T18:00:23.653686Z WARNING (default) Warning: failed to fetch data from database, vBucket=278 key=9:1468852423098:1468852423098 error=document not found [none]
memcached<0.75.0>: 2016-07-18T18:00:23.653904Z WARNING (default) Warning: failed to fetch data from database, vBucket=15 key=9:1468852423427:1468852423427 error=document not found [none]
memcached<0.75.0>: 2016-07-18T18:00:23.655489Z WARNING (default) Warning: failed to fetch data from database, vBucket=89 key=9:1468852424529:1468852424529 error=document not found [none]
memcached<0.75.0>: 2016-07-18T18:00:23.655676Z WARNING (default) Warning: failed to fetch data from database, vBucket=1 key=9:1468852425478:1468852425478 error=document not found [none]
memcached<0.75.0>: 2016-07-18T18:00:23.655865Z WARNING (default) Warning: failed to fetch data from database, vBucket=99 key=9:1468852429052:1468852429052 error=document not found [none]
memcached<0.75.0>: 2016-07-18T18:00:23.656056Z WARNING (default) Warning: failed to fetch data from database, vBucket=36 key=9:1468852433623:1468852433623 error=document not found [none]
memcached<0.75.0>: 2016-07-18T18:00:23.656247Z WARNING (default) Warning: failed to fetch data from database, vBucket=40 key=9:1468852434487:1468852434487 error=document not found [none]
memcached<0.75.0>: 2016-07-18T18:00:23.656436Z WARNING (default) Warning: failed to fetch data from database, vBucket=33 key=9:1468852436122:1468852436122 error=document not found [none]
memcached<0.75.0>: 2016-07-18T18:00:23.657966Z WARNING (default) Warning: failed to fetch data from database, vBucket=245 key=9:1468852436217:1468852436217 error=document not found [none]
memcached<0.75.0>: 2016-07-18T18:00:23.658170Z WARNING (default) Warning: failed to fetch data from database, vBucket=139 key=9:1468852439203:1468852439203 error=document not found [none]
memcached<0.75.0>: 2016-07-18T18:00:23.658366Z WARNING (default) Warning: failed to fetch data from database, vBucket=59 key=9:1468852443223:1468852443223 error=document not found [none]
memcached<0.75.0>: 2016-07-18T18:00:23.658565Z WARNING (default) Warning: failed to fetch data from database, vBucket=19 key=9:1468852443660:1468852443660 error=document not found [none]
memcached<0.75.0>: 2016-07-18T18:00:23.658760Z WARNING (default) Warning: failed to fetch data from database, vBucket=123 key=9:1468852445921:1468852445921 error=document not found [none]
memcached<0.75.0>: 2016-07-18T18:00:23.658957Z WARNING (default) Warning: failed to fetch data from database, vBucket=152 key=9:1468852446269:1468852446269 error=document not found [none]

So, not purged even after manual compaction. Should i raise an MB ?

@drigby,
so, with 128M/full ejection there is a problem with purge even for manually-called compaction.
Should i make any additional test (2400M/full or 128M/value) ?
Would you like to take a look at the logs ?
MB ?
Or there is a reasonable explanation ?

UPDATE: docs are still not purged, even after ~ 12 hours:

memcached<0.75.0>: 2016-07-19T07:16:42.061187Z WARNING (default) Warning: failed to fetch data from database, vBucket=99 key=9:1468852429052:1468852429052 error=document not found [none]
memcached<0.75.0>: 2016-07-19T07:16:42.061380Z WARNING (default) Warning: failed to fetch data from database, vBucket=36 key=9:1468852433623:1468852433623 error=document not found [none]
memcached<0.75.0>: 2016-07-19T07:16:42.061569Z WARNING (default) Warning: failed to fetch data from database, vBucket=40 key=9:1468852434487:1468852434487 error=document not found [none]
memcached<0.75.0>: 2016-07-19T07:16:42.061758Z WARNING (default) Warning: failed to fetch data from database, vBucket=33 key=9:1468852436122:1468852436122 error=document not found [none]
memcached<0.75.0>: 2016-07-19T07:16:42.063314Z WARNING (default) Warning: failed to fetch data from database, vBucket=245 key=9:1468852436217:1468852436217 error=document not found [none]
memcached<0.75.0>: 2016-07-19T07:16:42.063506Z WARNING (default) Warning: failed to fetch data from database, vBucket=139 key=9:1468852439203:1468852439203 error=document not found [none]
memcached<0.75.0>: 2016-07-19T07:16:42.063695Z WARNING (default) Warning: failed to fetch data from database, vBucket=59 key=9:1468852443223:1468852443223 error=document not found [none]
memcached<0.75.0>: 2016-07-19T07:16:42.063915Z WARNING (default) Warning: failed to fetch data from database, vBucket=19 key=9:1468852443660:1468852443660 error=document not found [none]
memcached<0.75.0>: 2016-07-19T07:16:42.064102Z WARNING (default) Warning: failed to fetch data from database, vBucket=123 key=9:1468852445921:1468852445921 error=document not found [none]
memcached<0.75.0>: 2016-07-19T07:16:42.064290Z WARNING (default) Warning: failed to fetch data from database, vBucket=152 key=9:1468852446269:1468852446269 error=document not found [none]

UPDATE 2: just have run manual compaction once again, and pressed “Documents”:

memcached<0.75.0>: 2016-07-19T07:26:23.388158Z WARNING (default) Warning: failed to fetch data from database, vBucket=137 key=1:1468852339161:1468852339161 error=document not found [none]
memcached<0.75.0>: 2016-07-19T07:26:23.388347Z WARNING (default) Warning: failed to fetch data from database, vBucket=38 key=1:1468852342861:1468852342861 error=document not found [none]
memcached<0.75.0>: 2016-07-19T07:26:23.388540Z WARNING (default) Warning: failed to fetch data from database, vBucket=222 key=1:1468852343991:1468852343991 error=document not found [none]
memcached<0.75.0>: 2016-07-19T07:26:23.390160Z WARNING (default) Warning: failed to fetch data

There is definitely an issue.

Hi egrep,

Thanks for investigating. The behaviour you’re seeing with full eviction does appear unexpected. I had a brief look but couldn’t find an existing issue. Could you raise an MB with the contents of your comment here?

Thanks,

Dave

https://issues.couchbase.com/browse/MB-20197

And i would appreciate if this become a happy-end in an uneasy fate of MB-19758 and our little clash about it :wink:

For now it looks like purge process during compaction is unpredictable (or, at least “too sophisticated to be explained easily”) and does not have “clear workflow”.
Maybe someone would like to join to investigation of https://issues.couchbase.com/browse/MB-20197 ?