Rebalance/Architecture

This is an admittedly general question, but looking for some advice. We have a three node couchbase 2.1 community cluster. We have big memory nodes and big buckets. Each node is 192GB and our largest bucket is ~130 million keys and takes up about 100GB. We have never successfully failed over a node our rebalanced the cluster with the big bucket. We recently rebooted a node and that bucket can’t warmup. It tries for 10 minutes or so, status shows it loading the keys, and then it throws a bunch of these:

Control connection to memcached on ‘ns_1@cclnxcouch1.pfizer.com’ disconnected: {{badmatch,
{error,
timeout}},
[{mc_client_binary,
stats_recv,
4},
{mc_client_binary,
stats,
4},
{ns_memcached,
has_started,
1},
{ns_memcached,
handle_info,
2},
{gen_server,
handle_msg,
5},
{ns_memcached,
init,
1},
{gen_server,
init_it,
6},
{proc_lib,
init_p_do_apply,
3}]}

and starts the warmup process over again. So the node appears to be forever stuck in pend. Aside from this specific problem, we’ve just found that for the large buckets we have a system that works very well UNTIL anything untoward happens. Is our mistake the big memory nodes? The nodes are local gigabit interconnect, is that insufficient to support the cluster? Couch is running on raid-ed SSD’s, we seem to do pretty well i/o wise. Any suggestions on how to make rebalance functional?

Thanks!

You are likely having a access log issue.
You can see it when look into the cbstats like this.

/opt/couchbase/bin/cbstats localhost:11210 warmup -b bucketname -u Administrator -p bucket_password

When you run the command there you will see Estimated time to load keys ,how many keys are loaded and more. Run the command every few minutes for about 15 minutes. if the data does not change or goes up and down then CB is having a problem with access logs.

Solution:
shut down the whole cluster. go to the data folder of CB /data/name_of_bucket … there you should see access.log and access.log.old … delete these files in all the nodes.
Restart the cluster and it should be back up.