How do I resolve failed conflict resolutions? (with Revision-based strategy)


#1

My setup includes 5 Couchbase clusters that are connected by a ring of unidirectional XDCR. Clients write to those clusters simultaneously by design, and documents have a limited TTL that is prolonged by client app if a document shows regular usage. My expectation was that records will be eventually replicated or expired, but that seems not to be the case. Out of 4.4 billion documents in every cluster, I have some significant numbers of failed conflict resolutions.

The values of set_failed_cr_source, expiry_failed_cr_source, and docs_failed_cr_source are all about 700 million, which is pretty much and has grown gradually since a long time ago (years). At the same time, deletion_failed_cr_source is not greater that 12 (twelve).

How anyone is supposed to resolve those accumulated conflicts? At least, how to get the key of conflicted documents?

What do set_failed_cr_source, expiry_failed_cr_source, docs_failed_cr_source, and deletion_failed_cr_source mean exactly? I tried to follow the sources of goxdcr and haven’t come to a conclusion.


#2

the failed_cr_source stats is not an indication that some documents failed to get replicated. Unless you see different document counts in your clusters, you need not be worried.

For large documents, before xdcr replicates them to target, xdcr checks whether document of the same or higher revision already exists on target. If so, replication is not necessary. Xdcr will not replicate the document to target and will increment the failed_cr_source stats.

If you have a ring topology like A->B->C->A, failed_cr_source stats is guarenteed not to be zero. A document mutation originated from A will get replicated to B and then C. When xdcr on C tries to replicate the mutation back to A, it would detect that A already has the mutation and would get its failed_cr_source counter incremented.


#3

Thank you, that sounds reassuring as items count is almost the same in all clusters.

However, we have paid attention to these metrics because of some clearly unresolved conflicts, i.e., different clusters had different content of documents for the same key. We could not estimate the scale of conflicts as we had to take emergency measures (we recreated the ring from scratch, the decline in items count that was clearly visible was replaced by rapid growth to a 10-days-ago-level). But we know, from clients of our application, that those conflicts we mass ones, i.e., tens or hundreds of thousands, if not more.

Can we somehow estimate the current count of actual unresolved conflicts?