Errors during XDCR replication


#1

Hi all,

I’ve two clusters running Couchbase 2.5.1 on Amazon Linux. I set up a XDCR replication from Cluster A to Cluster B, with default settings. I’ve 2.5 Millions docs in my bucket. It’s working fine, then after 1.7 millions, it just spit out error messages in the XDCR console (“cannot replicate Bucket…, check logs”).
On cluster A machines, logs shows:

** {failed_write,{"in batch of 78 docs: flushed: 0, rejected (eexists): 0; remote memcached errors: enoent: 0, not-my-vb: 0, invalid: 0, tmp fail: 78, enomem: 0, others: 0",
                  "Error with their keys:78 keys with etmpfail errors(dump first 10 keys: [<<\"doc_xxxxxxxxxxxxxxxxxxx\">>,\n              

What does this error message means, what is the cause of this error? What is "tmp fail" 

Then I updated some XDCR settings:
XDCR Batch Size (kB): 6144 instead of 2048 (default)
XDCR Batch Count: 1500 instead of 500 (default)

It’s now working with no errors, all documents have been replicated.

Why with the default configuration I have those XDCR Replication errors?

Cheers


#2

Hi, tmpfail errors mean that the destination cluster is not able to eject items fast enough to make room for new mutations (see http://docs.couchbase.com/admin/admin/Concepts/concept-workingset-mgmt.html). XDCR retries several times without spewing errors in such situation. But after (fixed) number of attempts, the errors are shown to the user. Nevertheless, if you gave it enough time, XDCR would eventually retry and be able to replicate rest of the data.
Thanks!

Anil Kumar


#3

Hi Anil,
Thanks a lot for your reply and your explanation. However I let it running a long time (probably more than 5 hours whereas a entire replication should take only few minutes) and it never managed to replicate more documents and it was still retrying. How long do you think I should have wait?

But the most important thing that I would like to understand: why what I change in the XDCR configuration helped to replicate without error?

Thanks


#4

Hi, To answer that question we would have to look into many things topology on both source and remote cluster does it have same number of nodes, is it uni or bi-directional replication etc. It will be great if you can open a JIRA issue and collect attach cbcollect_info from the clusters. That way we can investigate whats going on and help answer you.

Thanks!

Anil Kumar


#5

Hi,
Thanks for your reply.
It was a an uni-directional replication between 3 nodes to 3 nodes.
This is the cbcollect_info for the main cluster (let’s call it cluster A): https://www.dropbox.com/s/snjudb3ktd93du2/clusterA.zip?dl=0
And the second one:
https://www.dropbox.com/s/ff304mfue6jea2g/clusterB.zip?dl=0

Ok I cannot to reproduce the issue anymore. I would like to understand why the default configuration failed, any clue?