Java SDK 1.4 failing during cbbackup

travisgreer · January 8, 2015, 8:14pm

Hi,

I’m having issues taking a backup (using cbbackup) from a live production cluster. I’ve previously posted about this a couple months ago (thread link below) on the Couchbase Server category. After further investigation, it doesn’t seem like the couchbase processes are failing (from what I can tell), so I’m focusing on the Java client.

In a nutshell, we’re on AWS EC2, 5 instance cluster. As soon as cbbackup is started, the Java client immediately begins to fail - timeouts from reads, Temporary failure (ERR_TEMP_FAIL) on writes. I’ve pared the workload back to nearly zero (load avg < 0.25 on all instances) and the client still fails. Our app (i.e. the client) is running on the same 5 instances as the couchbase cluster. I realize (now) that’s not ideal, but we migrated from a similar memcached configuration.

We’re running Couchbase Server 2.2.0, Java SDK 1.4.2. The cbbackup is running on a separate EC2 instance from the cluster. Our EC2 instances are m3.larges - I realize they are only 2 virtual cpus, but at such low cpu, I would think I wouldn’t see issues.

I’m happy to post more info about our setup, but looking for ideas on what could be going wrong and steps to take to further debug and isolate the problem.

Thanks!
Travis

Previous post from Couchbase Server category

ingenthr · January 8, 2015, 10:11pm

Based on the description, it sounds like the system is out of memory. TEMPFAIL would be caused by that exactly.

How is the cluster configured? Does the trouble correlate to any paging activity? Also, if you observe the same higher latencies at the cluster with cbstats timings, that indicates the problem is definitely somewhere in the cluster node.

travisgreer · January 8, 2015, 10:54pm

Hi Matt,

Thanks for the reply. I do see that two of the five nodes did trigger paging activity when cbbackup started. I also noticed that memory (mem_used) also spiked at that time, along with TAP Queue: items, back-off rate, backfill remaining, and remaining on disk. TAP Queue drain rate was fairly constant during the course of the backup.

Does cbbackup incur additional memory resources on the cluster?

The bucket in question is allocated 2GB per node (x5 => 10GB), with one replica, read/write concurrency of 3. Our ‘working set’ is a very low percentage of total data. Most data is historical, so infrequently accessed. We average 0.7 cache miss ratio, 0.8% active docs resident, 8.05G lo water, 9.13G hi water, 8.7G mem used.

We haven’t had memory issues (that I knew of) so I thought we were ok on memory. Though I’m going back through our memory allocation planning in light of this potential issue.

I haven’t messed with cbstats yet, but figure I need to get familiar with it soon.

Any insights you have are welcomed.

Thanks!
Travis

As an aside, is there a better way to handle current vs historical data? Maybe separate them into different buckets or something? I’m sure there has been discussion on this, but I haven’t been able to locate much the specific topic.

ingenthr · January 8, 2015, 11:40pm

cbbackup will use additional memory but that should stay within the quota, if sized properly. The TMPFAIL is an indication that the quota is exhausted and if you see timeouts where you don’t normally in steady state, that means the backup is probably causing more ejections and later requests for items not in the working set have to backfill.

As far as managing older items, if you mean you want to get rid of them then it’s best to set an expiration and let the system get rid of them at that time. If you mean, make sure they’re not consuming memory, the system will automatically keep the working set hot in the cache.

Of course 3.0 releases are much better at this in general. 2.2 is quite old at this stage. It doesn’t look like you’re hitting an issue-- you are just shy of resources.

travisgreer · January 8, 2015, 11:59pm

Thank you for the input, Matt. I’ll continue to investigate our memory usage. Hopefully getting that sorted will allow us to get a nightly backup without affecting production service.

Then we can get to work upgrading to 3.0.

Thanks again!
Travis