In my company we have a setup of 3 Couchbase clusters 3 nodes each, clusters are distributed between Amazon AWS regions.
We have 3 buckets, one of them (bucket1) has master->slave XDCR replication from clusterA to clusters B and C
Other tow buckets (bucket2 & bucket3) have master<->master XDCR between every region.
We have mentioned that at one moment some documents are missed on one slave (clusterB) for bucket1 and in another cluster (clusterA) for bucket3. Bucket2 is ok in every region, as well as problematic buckets have problems only in one of 3 clusters each.
Amount of missed documents is about 300 from 13000 for bucket1 and about 1 mln from 32 mln for a bucket3
According to CB web panel every XDCR is running well
We did not found any relevance between time when problem start and our maintainances/changes as well between start time for our buckets.
We’ve checked log files but did not found any specific into it (probably we are looking in a wrong place/wrong message)?
Stopping and starting XDCR did not helped.
We found that problem could be “solved” by creating a new bucket on the problematic cluster, CBtransfer data from the old bucket to the newly created and setting up XDCR to it. But we wonder to find the real cause since this hurt use quite a lot and we’d like to be ensured that we will not have similar issues one day.
I would kindly ask you guys to point me on the right way to find the real cause and fix it.