XDCR Inconsistency

k_reid · April 9, 2018, 5:09pm

Hello,

We have 5 nodes running in 3 different data centers with replication set to 3 in each. We use XDCR to keep the data centers consistent with each other. However in one of the data centers we are seeing a a particular document exists but it does not exist in the other 2. Under what circumstances will XDCR not be consistent? Are there any suggestions to fix this problem?

Version: 4.5.1-2844 Community Edition (build-2844)

Thanks,

K

matthew.groves · April 10, 2018, 2:26pm

Hi @k_reid,

I’m not sure if this is related to your problem, but the math isn’t adding up . If you have 3 data centers with replication set to 3 in each, that implies more than 5 total nodes.

How is your XDCR configured? Is it bidirectional between each combination of data centers?

unhuman · April 12, 2018, 8:39pm

I read it as 15 total nodes - 5 in each of 3 datacenters.

You do have to be careful with bi-directional XDCR and how you sync your clusters… Do you XDCR from every cluster to every other cluster? In a ring?

You do need to be careful with simultaneous updates (if you are updating the same document id in each data center) because Couchbase’s conflict resolution may not be something that you should rely on.

k_reid · April 24, 2018, 3:15pm

Yes @unhuman @matthew.groves we use bidirectional replication. It has been recommended that we change our replication to 2 from 3 and we will do this. However, I fail to see how this would cause the issue we are seeing. Any other suggestions are welcome.

Thanks,

-K

k_reid · May 2, 2018, 1:56pm

@ingenthr @vsr1 @daschl Can any of you help with this?

ingenthr · May 3, 2018, 1:30am

Assuming the three clusters each of 5 nodes topology @unhuman suggested, I can’t think of any reason a particular document would not replicate to the other clusters. I’d probably recommend verifying the replication is configured as expected. If so, then have a look at the XDCR logs to see if there is a clue.

One common problem, if it is across something like EC2 regions, may be blocked ports. That could cause this. Ports 8091 and 11210 for sure need to be open between the clusters.

Hope that helps!

ysui6888 · May 7, 2018, 6:28pm

@k_reid I am an engineer from xdcr team. It would be helpful if you could attach the goxdcr.log files from all three clusters. It would also help to extract the metadata of the document involved from the cluster where it is present, e.g., through couch_dbdump.