Write failures during live cbbackup

We are testing out our application using couchbase, and currently we have a 5 node, ~200GB total data cluster on amazon AWS. We did a backup using cbbackup (which failed at at 97% - seperate issue), however more importantly we had some critical write failures during the ~2 hour backup.

The documentation for cbbackup says you can backup a live cluster, but obviously write failures during backup are not ideal. The machine we did the backup process on is not in the cluster, but a remote machine that connected to the cluster over http, for performance reasons. We did notice that the TAP queue graph seemed to spike dramatically during this time period, which makes sense because replication is apparently TAP intensive (from backup couchbase blog)

I’m looking into the logs for more information related to the failures, but I’m wondering has anyone else had similar issues backing up what I’m assuming is a large couchbase instance?

Hi,

Write failures typically point to some issue with underlying I/O subsystem. I would check the /var/log/messages around the time failure happened. Are these instances running in virtual environment? We have seen cases Vmware marked disks as read-only because I/O controller saturation caused by the backup script.

Also could you confirm if the filesystem where you’re backing up isn’t NFS? sqlite files are known to have issues with NFS implementations - Sqlite NFS FAQ


Abhishek