XDCR high cpu load

goxdcr CPU usage on clusters are permanent ~30-90%

We have very simple installation 4.0.0 community version (after 3.0.1 upgrade).
Cluster A - 1 node, 3 bucket
Cluster B - 1 node, 3 bucket
Both-directional XDCR Cluster A [3 bucket] <=> Cluster B [3 bucket]
centos6(64) (no any virtualization)
workflow is very low:

  • all count of items is less then 3k
  • count of set/delete is less then 0.1k per day

There is problem with 1 bucket:
in WEB on ClusterA and ClusterB we can see every minute
Number of mutations to be replicated to other clusters(measured from replication_changes_left) ~25k
and
Incoming XDCR total ops/sec. ~0.4k

file goxdcr.log on Cluster A has errors:

,“errMsg”:“dcp_a753ef60019bfd8ce8c4555a8003656b/freeswitchconf/freeswitchconf_10.2.1.201:11210_0:Dcp is stuck for dcp nozzle dcp_a753ef60019bfd8ce8c4555a8003656b/freeswitchconf/freeswitchconf_10.2.1.201:11210_0”

file goxdcr.log on Cluster B has errors:

"errMsg":"CheckpointMgr:Target bucket’s topology has changed"

and

"errMsg":"dcp_6ddca8d8e0311b2cb2b634cc8e5e57a6/freeswitchconf/freeswitchconf_10.2.1.202:11210_0:Dcp is stuck for dcp nozzle dcp_6ddca8d8e0311b2cb2b634cc8e5e57a6/freeswitchconf/freeswitchconf_10.2.1.202:11210_0"

network link is good ( ~1gb)

Keys on clusters are synchronized, but CPU load and errors are not meaningful

Seems it is about this issue Loading...
After reduce nozzles to 2, CPU usage was reduced too (to 5-15%).

But this error is still repeatable in goxdcr.log file on both clusters (with ~10sec period)

"errMsg":"dcp_6ddca8d8e0311b2cb2b634cc8e5e57a6/freeswitchconf/freeswitchconf_10.2.1.202:11210_0:Dcp is stuck for dcp nozzle dcp_6ddca8d8e0311b2cb2b634cc8e5e57a6/freeswitchconf/freeswitchconf_10.2.1.202:11210_0"

Hi, Issue you mentioned MB-16244 is already fixed in 4.1.0 release. Can you test and let us know if that fixes the issue you’re seeing?

Unfortunately 4.1.0 is not community edition, so there are many barriers for testing and bug reporting enterprise edition

we compilled 4.1.1 from source code and this issue seems ok, except one.

it was error at 14:48:

GenericSupervisor 2016-06-20T14:48:18.440+03:00 [ERROR] Received error report : map[CheckpointMgr:Target bucket’s topology has changed]

and then StatisticsManager print this OLD message(see message time and inside message error time) every few seconds to log up to server restart

StatisticsManager 2016-06-20T18:18:08.722+03:00 [INFO] Stats for pipeline a753ef60019bfd8ce8c4555a8003656b/current_calls/current_calls-608599951 {“CkptMgr”: {“num_checkpoints”: 5, “num_failedckpts”: 0, “time_comm
itting”: {“count”: 5, “max”: 0, “mean”: 0, “min”: 0}}, “Errors”: “[{"time":"2016-06-20T14:48:18.443761112+03:00","errMsg":"CheckpointMgr:Target bucket’s topology has changed"}]”,