20k items stuck in Tap Queue

Hi folks,

first of all thank you for this product. Looks really promising. We are currently evaluating couchbase as a memcached replacement in the first place. Our setup looks like this:

php -> localhost moxi -> couchbase bucket (Total bucket size = 10240 MB (2048 MB x 5 nodes with replica count 1))
The Servers have 16GB RAM and are SSD backed.

We were inserting at about 400 ops/s and had no problem for a few days. When we reached about 13 million items. We found out that we forgot to implement the delete function in our testsetup and a lot of keys had no expiration set.

To start over again we flushed the bucket through the webinterface. This where our problems began.
We started to see that we had temp ooms, back-offs, and tap queue was filled with 20k items. the drain and fill rate was nearly the same. so attached screenshot

What also catched our eye was that node 4 had only 220k items, where everyone else had around 1.39M

Somehow it looks like the replication messed up something, but im relatively new to couchbase. Any hints, suggestions?

We could solve the problem by removing the suspicious node4. After rebalancing the items we removed from the tap queue, back-off and temp oom also gone now.

Could someone draw a scenario what happened here?

This seems to happen when inserting very big items > 1MB at a high rate. The Tap Queue now again has 20k items stuck.

Seeing as you just want to delete all items in the Bucket have you tried just deleting and re-creating the bucket?

This will be much faster than flush, as flush actually needs to send a delete request for every document in the bucket.

I can’t find it in the docs at the moment, but I think Flush is not really recommended with the latest versions.

Which version are you using, OS are you using?

In the same time, since it looks you can start from scratch, delete the bucket, recreate it. (this will not help to find the source of the issue but at least you wont be stuck)

Tug
@tgrall