Growing DCP Replication Items Remaining

I have a cluster of 3 nodes using the community edition and I got a lot of “DCP Replication Items Remaining”

I don’t know why this is happening, do you have any ideas?
I can give more informations.

Hello Martin_Mauchauffee

This is commonly an issue with either a Disk or Network issue between your Couchbase Server Nodes. You can get more details about your DCP Replication with cbstats command on each node:

/opt/couchbase/bin/cbstats localhost:11210 dcp -b gamestate -u Administrator -p password

That should let you know if it’s a specific node or vBucket that is having an issue or if it’s a wider resource contention. Once you identify the source of the contention, have a look at your Disk IO stats to see if there’s overhead available and you aren’t getting any other disk io or permissions issues at the OS level.
You can also check the /opt/couchbase/var/lib/couchbase/logs/memcached.log* for any errors mentioned.

Ian McCloy (Principal Product Manager, Couchbase)

Thank you for your answer.

We tried the cbstats that return +18000 lines and fail to found something interesting.
Anyway, we made some changes by reducing the number of replication and disable SWAP. It’s seemed better but we still have problems.

We need to import 600GB from another database (key/value) to a cluster that is already in production. We do that by limiting the speed at which we import stuff. This is about more than 660M keys. Obviously, all the keys will not remain in RAM and that ok for us as long as we have a copy on disk. The number of “active” keys are not so big in our case. Maybe 1M.

On this screen you can see at 1AM, couchbase start to respond to our request slower and slower, until it reach almost zero. The RAM was already full, the resident ratio was already low. We don’t have any explanation for that.
And you?

It feels like your cluster is undersized. I can’t tell from your screenshot what the colours represent, but are you showing a resident ratio dropping down to under 8% ? This will put a lot of strain on your disk IO and you’ll see a drop in performance when the cluster no longer has memory and has to rely on disk for the throughput. You mentioned an active working set of 1m of a total 660m keys, which is only 0.15%. We’d typically see working sets and resident ratios of 10% or higher in Couchbase Server use-case. Have a look at our blog post on sizing

Ian McCloy (Principal Product Manager, Couchbase)

I wanted to add, you want to check your application logs for “TMPFAIL” type errors, see

These indicate to the client that server is out of memory. You can also see these in the Couchbase Server UI as Tmp OOM errors.

You can slow down your ingestion rate, so that the rate doesn’t exceed your disk IO. Or add more memory, more disk IO or nodes to add additional capacity.

Ian McCloy (Principal Product Manager, Couchbase)

Thank you for the tip about OOM errors.

We arrive at the same conclusion: to limit the import rate.
Because +99% of our data will remains on disks and only -1% need to be resident, we will try to minimise removing active users from the database. Our plan is to use couchbase disk as a cold storage. Until users show off (eventually).
We have 70% of remaining data to import. We will monitor and add more nodes to the cluster if required.

Thank you for your time.

Hello Martin_Mauchauffee,
If you’re deployed in AWS and have mostly a cold storage requirement, it might be a good use-case for our S3 backed Analytics, see our blog post about this feature:

Ian McCloy (Principal Product Manager, Couchbase)


We are not using AWS, we rent our own servers.

I have some new informations.

We manage to import something like 400M new keys in our cluster, at a limited rate of 1K ops. It works well until yesterday at ~7:30PM when I noticed a reduction of the ops of the process and I decide to stop the import to let couchbase rest a bit at ~8:30PM.

The OOM remains flat all the times. The servers slowly regain some free RAM.

We did not understand the root cause, but we can notice a couple of behaviors:

  1. In the memcached logs at ~7:30PM we start to see “Slow operation” and “DCP (Producer) […] BufferLog is no longer full” messages a lot. We also have a lot of timeout on our client side that correspond.

  2. Our nominal rate is ~150 ops, but the cluster is able to threat less and less op, until it reach almost zero.

  3. At the same times, all graphs “Active Items, Replica Items, Active Resident Ratio, Replica Resident Ratio” are completly flat. We understand that at: no update query was processed at all. (but I can’t prove it.)

  4. And last, as soon as we restart one node that seems to be blocked, everything is working again.