Using binary attachment at high scale

I am considering if we can use Couchbase to store binary attachments up to size of 2 MB and an average size of 600KB.

I understand that it’s recommended to use object storage for this purpose, but I’d like to know if there have been any benchmark results conducted to see how things could go wrong with the massive use of binary attachments. I am expecting to store 4000 documents per second on average and each document will have a binary attachment associated with it. It’s expected that 7 days of data will be stored in Couchbase and deleted afterwards. The use of object storage will complicate the architecture significantly so I’m trying to understand how bad it is to use Couchbase for storing binary objects at this scale.

P.S: my use case will be ~97% write and only 3% read. Read queries will be generally quite simple.

Hi alin,

There’s no issue with storing documents as binary blobs compared to JSON from a Data Service pov - storing a 600kB binary Blob is broadly the same cost as storing a 600kB JSON value.

A couple of things to consider in general, however:

  • Binary blobs often compress less than JSON documents (depends on exact binary format, but there’s often greater entropy in binary compared to JSON). That means that you might find that 600kB binary documents actually take up more space in memory / disk compared to 600kB JSON documents. (Obviously a 600kB JSON document may not be equivalent in information to a 600 kB binary document - i.e. one might find that the same information represented in a binary format takes 600KB but in JSON takes 1024kB so the different becomes moot - but I just mention if you have any existing sizings using JSON documents.
  • Binary blobs cannot be manipulated in the same way as JSON documents in Couchbase - you are pretty much limited to key-value operations, and even then you can only read (GET) the entire document, SET the entire document, append/prepend binary bites to the document or delete it.[*]
  • Large values (100KB+) are not necessarily as efficient to store and retrieve compared to say 10x documents each of which are 10x smaller, for a few reasons:
    1. Couchbase manages document values as single objects internally, so a 600kB value will either take up ~zero memory if not resident, or 600kB if resident - even if logically the application reading it only cares about reading the first byte of it. Compare this to slicing the document into 10 smaller 60kB documents - Couchbase can only keep resident the particular slice(s) which are being actively used. Similary for updating documents - if a single byte of a 600KB document needs to be modified, then the entire 600KB needs to be re-written to disk. Compare to 10 smaller 60kB documents - changing 1 byte would only require re-writing 60KB (a single slice).

In terms of the numbers you quote - storing 4000 documents per second, on average 600kB each is 2.4GB of data per second. That by itself is a non-trivial bandwidth - it would require ~20Gb/s of network bandwidth to simply write that data to a Couchbase Cluster.

If you are then keeping that data for 7 days, and assuming those writes are all inserts (as opposed to updates), you would be looking at storage requirements of:

2.4GB * 60 * 60 * 24 * 7 = 1,451,520 GB 

Or 1.4 PB, which is a significant quantity of data. Perhaps you could confirm the total number of documents expected?

[*] Strictly speaking you can extract data from the value with the view engine and possibly N1QL, but ultimately you need to fetch the entire value from the Data Service to be able to do anything with it.

1 Like

Thank you very much @drigby a few points regarding the calculation. I was wrong on the number I shared with you as they were the peak numbers not the average ones. The avg is much lower:

100M objects with avg size of 100B per day
5.6M objects with avg size of 600KB per day
Total 7 days retention

Apart from the infrastructure and required capacity, are there any other considerations on storing binary attachments if it’s going to be only normal store and get (act as a key-value). Are there any benchmark results available on how performance could be reduced with increasing the size of the object?

As I described above, broadly-speaking the size of a document value doesn’t matter too much; KV-Engine mostly doesn’t care about the actual content of documents, until you attempt to manipulate that data say using the sub-doc API for JSON documents.

Most of our internal benchmarks focus on JSON documents as that is what most people use, but we do test binary also. Here are some numbers for small (256B) binary values - note the most recent builds are still development ones, so you probably want to focus on the 7.0.x builds: ShowFast

Note these are with a much smaller number of documents than you are talking about - just 20M.

Thank you very much. Just a few points of clarifications.

So to confirm there is not much difference between a large json object of size 600 KB and a small json object with an attachment of 600KB when it comes to the write. In the read operation it could create an issue with the memory because the entire 600KB object should be retrieved in the later case, right?

I couldn’t find the benchmark result that refers to having a binary attachment in the provided link. They are so many impressive benchmark results but not sure which one I have to take into consideration?

Apologies - I meant to link a particular job. Take a look at: Daily Performance

That’s wonderful. Do you know what hardware was used for this?