Cluster still imbalanced after restore and rebalances


#1

I ran a cbbackup of a single bucket, 4 node, 2.2.0 community edition (build-837) cluster that was fairly imbalanced (718k, 565k, 383k, 383k items in each node), created a new 3 node cluster (same build), and ran cbrestore. The new cluster also came up imbalanced (1035k, 515k, 516k items in each node). In order to distribute the keys evenly, we added a node and ran rebalance, but a significant imbalance remained akin to the distribution in the old cluster. We removed the new node and rebalanced, and the item distribution went back to what it was. We actually did this twice just to see if it would make a difference. All rebalances completed successfully. Note that the replica distributions are similarly skewed where one node has about twice the number of items as the others.

Any ideas about how to correct the skew of data across servers? Any ideas of why it would skew in the first place? Is there risk that if one node goes down, there will be a loss of data?


#2

Can you check the number of active & replica vbuckets on each cluster node for your buckets? (under ‘Data Buckets’, click name of bucket, scroll to ‘VBucket resources’ section and finally click on the triangle and “show by server”)

You should have an equal (or within 1) number of vbuckets on each node, both for active vbuckets and replica vbuckets. If you have, then your item count skew must be due to some oddity with your keyspace somehow hitting some pathological issue in CRC32 (which is how items as assigned to vbuckets).

If the vbucket counts aren’t equal then you should try another rebalance.

Note that the actual number of replica vbuckets/items per node shouldn’t affect failover, as long as each active vbucket is replicated somewhere then you won’t loose data, also obviously the skew can mean that some nodes will be more heavily loaded (and hence may have varying performance).


#3

1024 total vbuckets. 342, 341, and 341 vbuckets on each node. same numbers for the replica as well, so yes, it looks like a oddity with the hashing. Guess there’s nothing we can do about it though it seems like a deeper issue that needs to be addressed by Couchbase.