How to properly compress documents?


#1

Hi everyone,

I’ve just started using Couchbase server with its .NET SDK. The data type I’m storing is a POCO C# class with a list inside, one document can be quite large, it can be hundreds of kilobytes if serialized to Json.

I’m only using Couchbase as a key-value store, I don’t care about the internal structure of the documents on the server. So I would prefer to store them compressed, that can save a lot in storage space (I tried one of my actual documents, if serialized to Json, it takes 237 KBytes, after zipping it was 10 KBytes). I would like to have compression already on the client side, so that I save not just on storage and memory on the server, but also on the network communication.

What is the idiomatic, simplest way to achieve this with the .NET SDK?

My code inserting documents is very simple.

bucket.Insert(
    new Document<MyCustomPoco>
    {
        Content = item,
        Id = key,
        Expiry = expiry
    }));

In the configuration I only specify connection information, I haven’t customized the serialization in any way.

In this blog post I read that by default we are using Json.NET for seralization, so I was expecting to see the inserted document on the server as Json. However, if I look at the inserted document with the management console, I see that it is in a binary format:

What is the reason for this? Is the SDK using a binary serializer by default? Or the document is getting compressed? If it’s the latter, does it get compressed already on the client side?

Maybe I’m missing something, but I couldn’t find any clear, official information about this. I’m looking for the approach which is the most idiomatic, and requires the least amount of custom coding.
Thanks for any suggestions!


#2

@markvincze -

The easiest way to do this would probably to store the content as a compressed byte array. A byte array will by-pass the serialization process and be stored on the server as a binary blob.

This is odd, assuming the code shown above, the MyCustomPoco object should have been converted to a JSON document via serialization. The only way a document will be stored as a binary blob is if it’s not a valid JSON doc…meaning it’s syntactically incorrect.

-Jeff


#3

@markvincze

On another note, I’m not sure compressing the documents before serializing to Couchbase is necessary. Couchbase already compresses the data store when it’s persisted to disk on the server.

So you could just serialize as JSON, and let the server pick up the processing load of compression/decompression. This has the advantage that it should be less performance impact on your application because:

  1. The compression/decompression load is on the Couchbase server
  2. It is cached in the server memory decompressed, so if you are reading multiple times it only decompresses once instead of every read.
  3. The disk persistence layer does compression, and writing to Couchbase doesn’t wait for persistence to complete. So your write operations will complete before any compression is done (unless you specifically request to wait for persistence)

Caveat on these statements: I’m not part of the Couchbase team, just a community member. This is just my understanding from going to the conferences, etc. So if it’s critical to you, I’d inquire with someone on the Server team to make sure.

Thanks,
Brant


#4

Hi @jmorris,

Thanks for the info! I based my custom compressing serializer solution on the blog post you’ve written about serializing with Jil, and it works nicely.

Regarding the issue about the default binary serialization, I’m starting to suspect I missed something and was looking at the wrong objects in the management console, because in an isolated environment I couldn’t reproduce the issue.

Cheers,
Mark


#5

Hey @btburnett3,

Thanks for the hints!

What do you think, how is the memory usage of the Couchbase server installation affected by storing the entries compressed?
I’m not worried about disk space anyway, because I’m only using Couchbase as a temporary key-value store, and the expiry of my documents is never longer than 1 hour. (Also, when I insert something, usually it will soon be read as well, so probably most items will be loaded into memory.)
So I’m really interested in the server’s memory usage, and what I was expecting is that it uses much less memory if I store my items in compressed binary, than if I store them in Json. Is that a wrong assumption?

Cheers,
Mark


#6

@btburnett3 @jmorris,

I made some benchmarks and wrote a blog post about it: http://blog.markvincze.com/simple-client-side-compression-for-couchbase-with-benchmarks/, I hope I didn’t overlook anything.
It seems that compression significantly decreases both the memory and disk usage (and we have to pay some penalty in computation on the client side, as you predicted).

Cheers,
Mark


#7

@markvincze -

Awesome blog post, thanks for sharing!

-Jeff


#8

@markvincze

I agree with @jmorris, awesome blog post!

It’s particularly interesting how much more effect it had on memory utilization than disk. Was this a memcached or persisted bucket?

If persisted, another factor to watch for on disk utilization might be letting the server idle for a while. I seem to recall hearing in a session at Couchbase Connect last year that the built-in disk compression is delayed until the server runs some maintenance in the background. Don’t hold me to that, though.

Thanks,
Brant