Xdcr and persistence blocked each other


#1

We have a very heavy write once a day which lasts for 1-2 hours and no more write again. It is found that xdcr didn’t finish. In fact, persisting to disk can not get completed before I deleted xdcr (about 6hours). When xdcr is deleted, disk write queue grows from 800k to 20M quickly and then decreased to 0; beam.smp reduced from 4.5g to 1.6g memory usage; after a while the docs fragments decreased from 43% to 41%.

So I guess, xdcr and persisting has some competition, which blocks persisting, fragment compacting, xdcr and even writing data(first 3 are about disk, the last is about buffer? ).

It seems an easy way to work around is having a scheduled xdcr, use a timer to start. Or, couchbase can detect that competition, and stop xdcr until the write completed.

Is there any other solutions I can try? Thanks.


#2

XDCR replicates after the document has been written to disk, so XDCR itself shouldn’t affect how quickly a write goes to disk; however of course that will generate the equivalent write across the network to the other data centre. XDCR itself is continuous once enabled, so I don’t understand your question about a timer.

It sounds like you may not have a sufficiently sized cluster to handle the load you are putting on it. If you haven’t already, might be worth reviewing the sizing guidelines: http://docs.couchbase.com/couchbase-manual-2.2/#sizing-guidelines


#3

The problem is we are using hadoop to write couchbase, which makes disk io never enough. When XDCR is not enabled, it takes the cluster about 30 minutes to take the “disk write queue” down after hadoop job complete. When XDCR is enabled, it is still not completed after 5 hours. So is it possible to tell xdcr that, when the disk queue is too large, xdcr can sleep some time? I was thinking may be a timer can be used so xdcr can pause when the timer is started.


#4

The typical use-case for XDCR is to maintain an exact, up-to-date clone of one cluster at another (normally remote) cluster. Hence it would be undesirable to pause or stop XDCR once setup.

Note that XDCR newer “stops” in the general sense - it is a continuous stream from source to destination cluster. There is an initial burst of data to initially bring the clusters in sync, but after that any updates to the source will be streamed to the destination.

As mentioned previously, it sounds like you need a larger cluster to provide the additional resources needed by XDCR. There is a good blog post on sizing for XDCR at: http://blog.couchbase.com/how-many-nodes-part-2-sizing-couchbase-server-20-cluster