Massive amounts of inter-zone traffic when enabling XDCR

polfer · May 4, 2016, 8:02pm

We are currently running a 3 node cluster using 4.0.0-4051 Community Edition on Amazon Web Services. Each node runs in a different availability zone, and has been relatively stable for some time.

Several days ago we began an effort to shift to a new instance type with more RAM. We started a new cluster with three additional instances and established a one-way XDCR replication between the existing cluster and the new one. Replication seems to have gone well, views were built on the target, etc.

The problem is that we later discovered that inter-zone network traffic for the original cluster (our replication source, NOT the replication target) had shot through the roof. By that I mean we racked up over 10TB (!) in inter-zone traffic charges in a day and a half. Again, this occurred only the original cluster. The new cluster has a fraction of the traffic. The moment we killed XDCR everything went back to normal on the source cluster.

This is not a huge database. The database is relatively stable with just over 1.5M documents and a total size less than 25GB. Also, we’re not turning the data over frequently, and material is primarily added in a gradual fashion (not a high write load on the primary cluster). That excessive traffic had continued at a pretty constant rate even though the replication looked to have caught up long ago.

Any ideas or guidance on troubleshooting or understanding this? I haven’t run into something similar the forums, yet, and am about to start digging on logs. Having said that, I can’t believe this is expected or normal behavior. Thanks in advance for any ideas.

cihangirb · May 4, 2016, 10:59pm

Let me make sure I understand this right - you are saying the source cluster is placed across AZs with zone awareness (server groups) and the XDCR replication to the new cluster caused source cluster intra cluster data movement to shoot up?
is that accurate?
thanks
-cihan

polfer · May 5, 2016, 12:43am

Thank you for helping. In this case there are no server groups configured (community edition).

There are three servers in the source cluster each in a distinct AZ. Two buckets are in use, and each bucket is configured for a single replica. Although we do not have server groups, since there are only three servers and a single replica, we should be seeing reasonable partitioning of the data.

Having said that, you are correct. Configuring XDCR configuration to the new cluster caused the source cluster intra-cluster data to shoot up. If the original cluster had nodes A B C, and the new cluster had D E F, configuring XDCR from ABC to DEF caused traffic between nodes A B and C to shoot through the roof. D E F did not appear to have an unusual amount of traffic.

polfer · May 5, 2016, 6:47pm

For what it is worth, definitely traffic on port 11210 between nodes in the source cluster. Also, we’ve confirmed it is repeatable on that cluster. Turn XDCR back on and it spikes up again, turn it off and it drops back to almost nothing. Also note we have also reproduced the spike on a different test cluster.

nwood888 · May 8, 2016, 12:31am

I experienced this as well - High intra-cluster xdcr bandwidth usage

Tracked as fixed in this issue - https://issues.couchbase.com/browse/MB-17481 (4.1.1)

polfer · May 9, 2016, 2:34am

Very helpful! Thank you!