Checking document size

whollycow007 · May 3, 2015, 7:09pm

Is there anything in the SDK that will return the size of a document.

Does CB compress data and is the 20mb limit, then the compressed size?

I’m trying to write code that will split a document if its larger than 5mb. I’m currently using sys.getsizeof(object, [default]) to do it - is that the only method?

Thanks!

mnunberg · May 4, 2015, 3:11pm

sys.getsizeof will tell you how much memory the object occupies in Python, but not how large the JSON encoded size will be.

While CB may compress data internally, the 20MB limit is on the uncompressed size.

The simple approach to checking the document size would be to serialize the data yourself to JSON and check its resultant size:

size = len(json.dumps(data))

The size will then tell you how large your object is in bytes as far as Couchbase is concerned. “Splitting” JSON would be a bit complex since JSON is a structured data format; you can split the encoded form but then each section would not be readable without the other (this means it cannot be used with views or N1QL).

Another issue with the simple approach is, assuming data is a dictionary, the following will result in two calls to serialize the JSON:

size = len(json.dumps(data))
cb.upsert(key, data) # Serializes data implicitly

If only casually using the library this may not be an issue, but especially for large objects, JSON serialization in Python (using the default JSON libraries) may be expensive; so you may:

Store your data as “raw bytes” (FMT_RAW or FMT_UTF8) - and lose the ability to have the client automatically decode the JSON on get() calls
Implement a custom Transcoder (http://pythonhosted.org/couchbase/api/transcoder.html) which you could then indicate (via a wrapper object) that you are passing an already-serialized JSON string (and thus should be treated like JSON, except that it need not be serialized twice). Something like:

ALREADY_JSON = object()
def encode_value(self, value, format):
  if (format is ALREADY_JSON):
    value, _ = super(MyTranscoder, self).encode_value(value, FMT_UTF8)
    return value, FMT_JSON
  else:
    return super(MyTranscoder, self).encode_value(value, format)

This works by defining a custom “format” (ALREADY_JSON) which can be used to signal to your custom transcoder that it should not encode to JSON again.

The above (with the transcoder) is just an optimization however, using the default transcoder will still function correctly (if you pass the original object).

whollycow007 · May 5, 2015, 6:40am

ty!

This was super helpful

Do large documents affect the performance of the server in general?

We plan to only use large documents mainly for storing data that is retrieved rarely. Like history docs - just there in case we need it.

mnunberg · May 5, 2015, 2:25pm

I think so long as documents are retrieve rarely it shouldn’t be an issue. Also note that you can use client-side compression at your option (which is to say you can manually compress it before storing it, and manually decompress after storing it) – though in this case you won’t be able to use views, n1ql, etc. on the data.

whollycow007 · May 6, 2015, 10:21am

Ty. That makes sense.