Some XDCR routes stop working when cluster is under load

craig · August 11, 2015, 3:55pm

I routinely see XDCR routes getting “stuck”; that is there are pending mutations, but the replication rate is at 0. Once in this state, the problem persists, even if all load stops. AFAIK the only fix is to delete the problematic XDCR routes and re-create them.

Has anyone else seen something like this, or found a need to recreate XDCR routes frequently?

Here are additional details:

We’re using 3.0.1 Community Edition (build-1444) and XDCR version 2 (xmem). All servers are virualized and have 8 vCPUs each. There is plenty of free disk space and RAM in each bucket.

I’ve noticed that this typically happens when the system is under load (put/delete operations).

There are three buckets in the system. The problem usually occurs when one bucket is busy (~2k ops/second) and the others are more or less idle (a few ops/minute). The busy bucket never seems to get stuck, but the idle buckets do every few minutes.

Here’s a screen cap showing what I refer to as a “stuck” XDCR route… Note that this typically only happens from one of the 3 nodes in the cluster; but it’s not always the same node.

Under light load (~500 rps) we don’t see this problem
We have an interesting topology: two main clusters of 3 nodes each,
connected with bi-directional XDCR. Then each main cluster fans out
with XDCR to several single-node “edge” instances on 3 buckets, and
one of those buckets also now has bi-directional XDCR set up back to
the main cluster. We believe we have seen these stuck routes however
even before setting up the bi-directional XDCR on that bucket.
When we write key/value pairs to CS, we use persistTo=0,
replicateTo=1. We wouldn’t expect this to affect XDCR, but it does
mean our ingest rates are lower than if we did not specify any
durability constraints.
In our experience, the “high load” can take many different forms. In
our most recent tests where we have the best data, we have been
writing values into “edge” instances and relying on XDCR to replicate
the values to the main clusters and out again. In previous tests, we
have done batch loading directly into the main clusters, or written
custom code to simulate load. It seems that no matter how we generate
the load, we still end up with stuck routes.
Under this “high load” we see CPU rates over 80% according to the CS
“minute” UI, but even under “light load” it can be around 70%.
The docs simple key/value pairs and are all very small (under 1KB)

Any help would be greatly appreciated. Thanks in advance!

craig · August 18, 2015, 8:18pm

Bump. Anyone have thoughts on this?

matthew · August 19, 2015, 4:32pm

Hey Craig,

I’m looking through bug reports to see if I can find anything similar.

Is there anything else you can tell us about your environment?

I’m guessing, from what you’ve described, that a rebalance isn’t happening just before the replication stops.

Could you share your logs with us?

craig · August 19, 2015, 5:12pm

I’m looking through bug reports to see if I can find anything similar.

[Craig] Thank you

Is there anything else you can tell us about your environment?

[Craig] I tried to be as complete as possible. If there’s something in particular you’d like let me know and I’ll see what I can get you.

I’m guessing, from what you’ve described, that a rebalance isn’t happening just before the replication stops.

[Craig] This occurs when there are no rebalances in progress.

Could you share your logs with us?

[Craig] Which logs would you like?

craig · September 4, 2015, 5:13pm

We have done some additional testing and re-configuration to try to improve system performance and see if that would eliminate the stuck route issue. In our original configuration we had 3-node clusters with each node allocated 8 vCPU in a virtual machine. We rarely saw stuck routes when the load average was below 7, but as it climbed above that (over 8 and into the teens), we saw increasingly more stuck routes. We rebuilt the CS clusters so that each has 3-nodes running on bare metal with 12 physical cores (24 with hyper-threading). We see improved data ingest performance and improved XDCR performance, and we see load averages as high as 23 - and we are still seeing stuck routes but not nearly as often as before. It still seems to be the case that as we continue to to drive the system harder, the likelihood of a stuck route increases.

craig · September 4, 2015, 5:57pm

One more thing… The routes that seem to get stuck the most often are the ones that are under the least amount of stress.