Is there any way to distribute data among nodes in a cluster in a customized way?
What i want to achieve is that if i have 10 million user records and out of that only 10% are highly active users, can i distribute these 10% of users equally on nodes?
This should happen automatically for you. Your keys are hashed which distributes them reasonably evenly between the 1024 (by default) vbuckets, and the vbuckets are distributed evenly over your nodes.
Thanks. So, the distribution is based on usage(how frequently data is being accessed) or is it just distributing data evenly like distribute 10 million records across 5 nodes and each node having 2 million records?
It’s closest to the latter. The mapping is totally static. The hash function takes your key “user_02333423” and emit which vbucket it should go to (1-1024), and it will always return the same vbucket. Of course no hashing function is perfect, but the one we use should reasonably randomly distribute keys over those 1024 vbuckets. As the vbuckets are distributed evenly over the nodes (ignoring advanced stuff like rackspace awareness), your keys also end up distributed pretty evenly over the nodes.
That also means you can add or remove nodes and the vbuckets will be seamlessly redistributed on the next rebalance, which also seamlessly redistributes your data evenly across the new set of nodes.