Sizing & cbbackup

travisgreer · May 6, 2015, 9:03pm

Hi,

I’ve have never been able to get cbbackup to run without destabilizing my cluster. (See previous posts referenced at end of post) It seems the problem is lack of RAM. I plan to change my app to actively archive a significant portion of data in hopes this will free up RAM and allow cbbackup to run successfully. I’ve explained my situation and assumptions below, but would like to hear any insights into couchbase or suggestions on better solutions. Maybe my app is a unique use case (seems unlikely) or I’m doing something wrong.

I started with a cluster RAM of 10GB. While this ran high on memory, all reads and writes seemed to work well. Temp OOM was at zero. When cbbackup was tried and failed, I upped the RAM to 14GB. The cluster still ran high on memory and cbbackup still failed in the same way. I then jumped to 24GB RAM. The cluster still ran high, using near all available RAM and cbbackup still failed.

My application has a large number of small documents, most of which are historical and not accessed. There are approx 200K new documents a day, which are no longer accessed after about two days. While active docs resident is about 1%, cache miss ratio is 1%. Of note, the cluster is four EC2 m1.large instances running only couchbase (and only for this app) with all data in one couchbase bucket. There is also a very small memcached bucket (2GB).

Through my experimentation, it appears that couchbase must be holding on to too much of the historical data. It also seems that cbbackup must use a lot of memory in order to load data from disk and send to the (separate) backup machine. This makes some sense, though I’m not sure how much memory I need to allow for cbbackup to success without killing regular couchbase operations.

I’m curious if anyone can confirm my suspicions about what is happening or set me straight. Outside of getting backups, couchbase has worked fantastically well for my app. I just need to get backups working and things will be good. Any and all help is much appreciated.

Thanks,
Travis

ingenthr · May 7, 2015, 1:24am

Have you been able to upgrade to 3.0?

When we last traded notes on one of those other threads, you wanted to complete a backup and then do the upgrade. In this posting, you don’t indicate version/platform.

travisgreer · May 7, 2015, 5:12pm

Hi Matt,

Thank you for your responsiveness, I really appreciate your help.

The app instance I described is running 2.2.0. We do have another app instance that is running 3.0.1. I will be testing cbbackup against that later today. However, the operating profiles are nearly identical and I expect the same failures to happen when I run cbbackup against the 3.0.1 instance.

Due to AWS maintenance, I’ll actually be swapping out new machine instances for the 2.2.0 cluster tonight. I would upgrade, but I thought I had read not to take an upgrade from 2.2.0 to 3.x lightly. And I don’t have the resources to test an upgrade simulating production traffic and load. Is it (relatively) safe to swap in new 3.x instances and load (cbrestore) a full backup from 2.2.0 instances?

As always, I appreciate your feedback,
Travis

travisgreer · May 8, 2015, 9:47pm

Hi Matt,

I was able to run the backup test against an app instance running 3.0.1 today. The test failed in the same was as the test against the 2.2.0 app instance - causing Temp OOM errors and errors on the clients (Java SDK 1.4.x).

It would be great to figure out if I’m doing something wrong either procedurally or with our app architecture. In the absence of any more information, I feel my only option is to remove our historical data from couchbase in the hopes the active data will reduce stress on RAM. Then, maybe cbbackup will have the overhead needed to run without triggering errors.

Of course, any insight or guidance is welcome.

Thanks,
Travis

alexegli · August 11, 2015, 2:45pm

I am experiencing the same issue with couchbase server 3.0.1. I tried running a backup last night twice, and both times it failed. I used the command:

cbbackup http://<hostname>:8091 db_backup -u <couchbase username> -p <couchbase 
password>

We have three buckets and the two smaller ones are backed up fine, but when it hits the larger bucket (~2 million documents) it runs for a while, slowly using more and more CPU and RAM, until I get an error in the couchbase console that the cluster can no longer connect to the node I was running the backup on. I think at that point the couchbase process must restart because shortly afterward everything is working fine again. In the shell displaying the backup status information it never displays an error, it just seems to be stopped or frozen. The first time I ran it the backup failed after about an hour. I tried running a diff backup afterward and it took forever to start, then displayed an error that it couldn’t decode a JSON object:

Exception in thread w1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 504, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/opt/couchbase/lib/python/pump.py", line 279, in run_worker
    curx)
  File "/opt/couchbase/lib/python/pump_bfd2.py", line 20, in check_spec
    getattr(opts, "mode", "diff"))
  File "/opt/couchbase/lib/python/pump_bfd.py", line 255, in find_seqno
    json_data = json.load(json_file)
  File "/opt/couchbase/lib/python/simplejson/__init__.py", line 267, in load
    parse_constant=parse_constant, **kw)
  File "/opt/couchbase/lib/python/simplejson/__init__.py", line 307, in loads
    return _default_decoder.decode(s)
  File "/opt/couchbase/lib/python/simplejson/decoder.py", line 335, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/couchbase/lib/python/simplejson/decoder.py", line 353, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

  [#######             ] 34.7% (659955/estimated 1904339 msgs)

It then ran until 34.7% and seems to have failed sometime after that (I was running it overnight since it slows down the server to the point that it can’t be used). Our nodes have 4 cores at 2.0GHz each, and 7GB of RAM each (with 4 GB allocated to each couchbase node for buckets). I’ve run the backup before on a Couchbase 2.5.1 server successfully, but it only had just over 1 million documents at the time. Is this an issue with couchbase 3.0.1? Is there a better way to run the backup than the general full cluster backup that might be more successful?