We have been having a problem with one articular bucket getting stuck during rebalancing. We recently used the cbtransfer utility to copy as much data as we could from this bucket to a new bucket and deleted the old bucket. The transfer would get stuck at 99.5% when we cancelled it. Then we recreated the bucket and transferred all of the data back in using the transfer utility again. The bucket seemed ok, since we could use cbbackup, which was also hanging previously.
Today we removed a node to change the data directory and rebalanced. This rebalance completed normally. We made the configuration change and added the server back into the cluster and rebalanced. This time the rebalance hung almost immediately on this same bucket. There are 2 other buckets in the cluster that do not seem to be affected. Rebalancing moved no data from these other buckets to the node that was added.
We are using the Community Edition 3.0.1 in a 5 node cluster.
Here are some of the relevant log file entries that I could find on the server that was added and needed to be included in the rebalancing:
memcached.log.1.txt
Tue Oct 6 15:20:29.801146 EDT 3: (PFM) Notified the timeout on checkpoint persistence for vbucket 285, id 13, cookie 0x66a3b00
…
debug.log
[rebalance:debug,2015-10-06T15:20:29.801,ns_1@192.168.20.117:<0.7112.0>:janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no persistence. Will try again
[ns_server:info,2015-10-06T15:20:38.990,ns_1@192.168.20.117:ns_config_rep<0.558.0>:ns_config_rep:do_pull:343]Pulling config from: ‘ns_1@192.168.20.102’
[rebalance:debug,2015-10-06T15:20:41.829,ns_1@192.168.20.117:<0.5227.0>:janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no persistence. Will try again
[ns_server:debug,2015-10-06T15:20:45.696,ns_1@192.168.20.117:compaction_new_daemon<0.689.0>:compaction_new_daemon:process_scheduler_message:1288]Starting compaction for the following buckets:
[<<“PFM”>>]
[ns_server:info,2015-10-06T15:20:45.697,ns_1@192.168.20.117:<0.8720.0>:compaction_new_daemon:try_to_cleanup_indexes:564]Cleaning up indexes for bucket PFM
[ns_server:info,2015-10-06T15:20:45.697,ns_1@192.168.20.117:<0.8720.0>:compaction_new_daemon:spawn_scheduled_views_compactor:494]Start compaction of indexes for bucket PFM with config:
[{database_fragmentation_threshold,{30,undefined}},
{view_fragmentation_threshold,{30,undefined}}]
[ns_server:debug,2015-10-06T15:20:45.698,ns_1@192.168.20.117:compaction_new_daemon<0.689.0>:compaction_new_daemon:process_compactors_exit:1329]Finished compaction iteration.
[ns_server:debug,2015-10-06T15:20:45.698,ns_1@192.168.20.117:compaction_new_daemon<0.689.0>:compaction_scheduler:schedule_next:60]Finished compaction too soon. Next run will be in 30s
[ns_server:debug,2015-10-06T15:20:45.739,ns_1@192.168.20.117:compaction_new_daemon<0.689.0>:compaction_new_daemon:process_scheduler_message:1288]Starting compaction for the following buckets:
[<<“PFM”>>]
[ns_server:info,2015-10-06T15:20:45.740,ns_1@192.168.20.117:<0.8721.0>:compaction_new_daemon:spawn_scheduled_kv_compactor:468]Start compaction of vbuckets for bucket PFM with config:
[{database_fragmentation_threshold,{30,undefined}},
{view_fragmentation_threshold,{30,undefined}}]
[ns_server:debug,2015-10-06T15:20:45.744,ns_1@192.168.20.117:<0.8724.0>:compaction_new_daemon:bucket_needs_compaction:953]PFM
data size is 80003777, disk size is 90486274
[ns_server:debug,2015-10-06T15:20:45.744,ns_1@192.168.20.117:compaction_new_daemon<0.689.0>:compaction_new_daemon:process_compactors_exit:1329]Finished compaction iteration.
[ns_server:debug,2015-10-06T15:20:45.744,ns_1@192.168.20.117:compaction_new_daemon<0.689.0>:compaction_scheduler:schedule_next:60]Finished compaction too soon. Next run will be in 30s
[rebalance:debug,2015-10-06T15:20:55.228,ns_1@192.168.20.117:<0.8109.0>:janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no persistence. Will try again
[rebalance:debug,2015-10-06T15:21:00.811,ns_1@192.168.20.117:<0.7112.0>:janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no persistence. Will try again
[rebalance:debug,2015-10-06T15:21:12.830,ns_1@192.168.20.117:<0.5227.0>:janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no persistence. Will try again
…
babysitter.log
[ns_server:info,2015-10-06T15:20:27.236,babysitter_of_ns_1@127.0.0.1:<0.78.0>:ns_port_server:log:169]memcached<0.78.0>: Tue Oct 6 15:20
:27.035630 EDT 3: (PFM) DCP (Consumer) eq_dcpq:replication:ns_1@192.168.20.102->ns_1@192.168.20.117:PFM - (vb 0) Attempting to add stre
am with start seqno 27, end seqno 18446744073709551615, vbucket uuid 230505328703891, snap start seqno 27, and snap end seqno 27
[ns_server:info,2015-10-06T15:20:30.002,babysitter_of_ns_1@127.0.0.1:<0.78.0>:ns_port_server:log:169]memcached<0.78.0>: Tue Oct 6 15:20:29.801146 EDT 3: (PFM) Notified the timeout on checkpoint persistence for vbucket 285, id 13, cookie 0x66a3b00
[ns_server:info,2015-10-06T15:20:42.029,babysitter_of_ns_1@127.0.0.1:<0.78.0>:ns_port_server:log:169]memcached<0.78.0>: Tue Oct 6 15:20:41.828789 EDT 3: (PFM) Notified the timeout on checkpoint persistence for vbucket 341, id 12, cookie 0x66a4100
…
on server 192.168.20.102, memcached.log.5.txt (Last entry for PFM bucket)
Tue Oct 6 15:20:26.916683 EDT 3: (PFM) DCP (Producer) eq_dcpq:replication:ns_1@192.168.20.102->ns_1@192.168.20.117:PFM - (vb 0) stream
created with start seqno 27 and end seqno 18446744073709551615
…
Thanks for any help that you can provide.
Anthony