XDCR Replication missing some items

We’re running Couchbase Server 5.1.1, and we’ve found that some items are not getting replicated to one of our other clusters. I would appreciate any assistance the community could provide on how to troubleshoot/resolve this issue.

Here’s my testing methodology:

  1. Upsert new items until the unique count of $document.vbucket_uuid is 1024
  2. Wait for XDCR replication (usually only 30 seconds)
  3. Get those same items from the remote cluster (bucket.get) and note the failures
  4. Generate a report consisting of source vbucket_uuid, number of items found, number of items missing

What we see is that 164 vBuckets have not replicated ANY of their items to the remote cluster. The other vBuckets have replicated ALL of their items to the remote cluster. I’ve tried waiting several days, and the results are the same.

We’ve tried several means to resolve this:

  • Pause/resume replication
  • Delete/recreate replication
  • Flush the remote bucket
  • Delete/recreate the remote bucket
  • Replicating from a different cluster (source->intermediate->remote)

To better explain that last one, we have 3 clusters, with items replicating from A->B and A->C. I stopped A->B and created a new replication from C->B, resulting in an A->C->B replication. A->C remained successful, but C->B had missing items.

In each of these tests, we also see that items created prior to the test are missing. For example, we had a “pre-flush” set of test items that were partially replicated upon re-enabling replication.

The problem turned out to be that goxdcr had gotten “stuck” a few weeks ago during a networking outage. I changed my python script to report the .tracing_output['r'] helped me narrow the search down to just the problematic nodes. In the couchbase UI, we also noticed those nodes as having outbound XDCR “percent complete” that either reported less than 10% complete or 119% complete.

In goxdcr.log, I found the “Replication Status” multiline log entry that showed the effected replications having a status={Pending} along with errors stating “no route to host” with a timestamp of the networking outage. Restarting couchbase-server on the affected servers one-by-one resolved the problem.

Is this a known issue that goxdcr can get stuck in this manner?

1 Like