Uniform distribution of data

mr_x · April 17, 2021, 11:32pm

I understand that CB auto shards data into 1024 vbuckets. How do we know that data will be (more or less) evenly distributed across the cluster?

For example, say I have 3 data nodes. I insert 30K documents with random keys. Will there be approximately 10K documents per data node?

ianmccloy · April 18, 2021, 9:03am

Hello mr_x,

Couchbase Server shards data into what we call vBuckets. We do this by taking the document id, known as the key and applying a cryptographic hash function to the name of the key and apply modulo N (vBuckets). The vBuckets are split evenly across all nodes in a cluster, so in a cluster with 3 Data Nodes, each node will approximately get 341 vBuckets each. We use the CRC32 hash function which mathematically has very even distribution across a given data space. While no hash function is exactly perfect, we find that CRC32 is within single digit % of even distribution of data across the vBuckets. Each time you write a document to a Couchbase Server cluster the CRC32 hash is performed by the client SDK and that is how it knows which shard (vBucket) to place the data into. This article: The distribution of hash function outputs has a good description of how hash functions distribute evenly.

Thanks,
Ian McCloy (Principal Product Manager, Couchbase)

mr_x · April 20, 2021, 3:15am

Very nice. Thanks Ian.