Backup with S3 taking ages

matjazs · September 12, 2022, 12:37pm

Hi,

Just a little bit about our setup
We are running CB cluster on AWS EC2 Graviton2 (ARM arch) with version 7.1.0 build 2556. There are 6 nodes in cluster:

3x indexer
2x data+query
1x data+query+backup

We have setup a backup using S3 as an archive dir and using staging directory on our backup node. On every Sunday morning, we create new repo (archive previous one) and create hourly backup during the week…and this hourly backups are making my hairs grey

Every backup is taking longer to complete and in the final days (Friday, Saturday…before making new repo), it can take more then an hour in which case a backup scheduled is skipped.

This is a list of the last 10 backups… the one marked is the backup from log file attached…

If I look at the S3 it looks like folder on S3 was created few minutes before backup was completed…

Nodes are running on m6gd.xlarge instances and we have S3 gateway set up in VPC in which instances are running.

CBLogs.zip (3.5 KB)

Can someone please help us understand this? At least a hint, what is going on that 47+ min, before directory on S3 is create?

I can’t find a reason why backups are taking so long to complete.
I know we can create a new repo during the week as a work-around but I would like to at least understand before doing that…

Just FYI…we also have another cluster (same version, hourly backups using S3) also with 6 nodes with same services set up but it is running on x86_64 arch… backups are completed in few minutes all week around…

matt.hall · September 14, 2022, 4:34pm

Hi @matjazs. I’ve taken a look at the logs you have uploaded and can see the time taken that is reported in the UI is accurate. During that time we see a lot of lines like:

2022-09-12T10:02:35.814Z WARN (Worker) No progress given by cbbackupmgr {"cluster": "self", "repositoryID": "PersistentStorage_06_09_2022", "state": "active", "taskName": "Hourly"}

cbbackupmgr will report progress to the backup service periodically when the transfer from the cluster is in progress. It won’t report progress during any setup or teardown steps however, and when uploading to cloud storage we do download and upload some metadata files at the beginning and end respectively. This, along with the backup folder timestamp, does suggest the download of the metadata might be what is slow here.

Could you please confirm whether the x86_64 cluster is running in the exact same configuration. In particular:

a new repo is created every week, with hourly backups in it
backup is to S3
the EC2 instance type
the same node configuration (3x index, 2x data+query, 1x data+query+backup)

It’s perhaps worth noting that we generally advise against running the backup service on the same node as the data service as they will be competing for resources.

Finally may I ask if you are an enterprise customer? If so, it would be best to continue this on a support ticket (https://support.couchbase.com/hc/en-us) so that we can track things a little better.

Thanks

matjazs · September 16, 2022, 2:23pm

Hi Matt. Thank you for this info…at least now I know where to look for issues …will check if anything is not as it should be around S3 and our backup instance.

Regarding our x86_64 cluster:

same backup config (new repo every week and hourly during)
yes, backup to S3 (even same bucket, just another dir)
m5d.xlarge
same node configuration

I know about setting backup on same node as data…but we are not even close to hitting any resource limits…

Sadly no…I am not enterprise customer

Again thank you for this…now I know what to focus on