Rebalancing Hung


#1

We have been having a problem with one articular bucket getting stuck during rebalancing. We recently used the cbtransfer utility to copy as much data as we could from this bucket to a new bucket and deleted the old bucket. The transfer would get stuck at 99.5% when we cancelled it. Then we recreated the bucket and transferred all of the data back in using the transfer utility again. The bucket seemed ok, since we could use cbbackup, which was also hanging previously.

Today we removed a node to change the data directory and rebalanced. This rebalance completed normally. We made the configuration change and added the server back into the cluster and rebalanced. This time the rebalance hung almost immediately on this same bucket. There are 2 other buckets in the cluster that do not seem to be affected. Rebalancing moved no data from these other buckets to the node that was added.

We are using the Community Edition 3.0.1 in a 5 node cluster.
Here are some of the relevant log file entries that I could find on the server that was added and needed to be included in the rebalancing:
memcached.log.1.txt
Tue Oct 6 15:20:29.801146 EDT 3: (PFM) Notified the timeout on checkpoint persistence for vbucket 285, id 13, cookie 0x66a3b00

debug.log
[rebalance:debug,2015-10-06T15:20:29.801,ns_1@192.168.20.117:<0.7112.0>:janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no persistence. Will try again
[ns_server:info,2015-10-06T15:20:38.990,ns_1@192.168.20.117:ns_config_rep<0.558.0>:ns_config_rep:do_pull:343]Pulling config from: ‘ns_1@192.168.20.102’

[rebalance:debug,2015-10-06T15:20:41.829,ns_1@192.168.20.117:<0.5227.0>:janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no persistence. Will try again
[ns_server:debug,2015-10-06T15:20:45.696,ns_1@192.168.20.117:compaction_new_daemon<0.689.0>:compaction_new_daemon:process_scheduler_message:1288]Starting compaction for the following buckets:
[<<“PFM”>>]
[ns_server:info,2015-10-06T15:20:45.697,ns_1@192.168.20.117:<0.8720.0>:compaction_new_daemon:try_to_cleanup_indexes:564]Cleaning up indexes for bucket PFM
[ns_server:info,2015-10-06T15:20:45.697,ns_1@192.168.20.117:<0.8720.0>:compaction_new_daemon:spawn_scheduled_views_compactor:494]Start compaction of indexes for bucket PFM with config:
[{database_fragmentation_threshold,{30,undefined}},
{view_fragmentation_threshold,{30,undefined}}]
[ns_server:debug,2015-10-06T15:20:45.698,ns_1@192.168.20.117:compaction_new_daemon<0.689.0>:compaction_new_daemon:process_compactors_exit:1329]Finished compaction iteration.
[ns_server:debug,2015-10-06T15:20:45.698,ns_1@192.168.20.117:compaction_new_daemon<0.689.0>:compaction_scheduler:schedule_next:60]Finished compaction too soon. Next run will be in 30s
[ns_server:debug,2015-10-06T15:20:45.739,ns_1@192.168.20.117:compaction_new_daemon<0.689.0>:compaction_new_daemon:process_scheduler_message:1288]Starting compaction for the following buckets:
[<<“PFM”>>]
[ns_server:info,2015-10-06T15:20:45.740,ns_1@192.168.20.117:<0.8721.0>:compaction_new_daemon:spawn_scheduled_kv_compactor:468]Start compaction of vbuckets for bucket PFM with config:
[{database_fragmentation_threshold,{30,undefined}},
{view_fragmentation_threshold,{30,undefined}}]
[ns_server:debug,2015-10-06T15:20:45.744,ns_1@192.168.20.117:<0.8724.0>:compaction_new_daemon:bucket_needs_compaction:953]PFM data size is 80003777, disk size is 90486274
[ns_server:debug,2015-10-06T15:20:45.744,ns_1@192.168.20.117:compaction_new_daemon<0.689.0>:compaction_new_daemon:process_compactors_exit:1329]Finished compaction iteration.
[ns_server:debug,2015-10-06T15:20:45.744,ns_1@192.168.20.117:compaction_new_daemon<0.689.0>:compaction_scheduler:schedule_next:60]Finished compaction too soon. Next run will be in 30s
[rebalance:debug,2015-10-06T15:20:55.228,ns_1@192.168.20.117:<0.8109.0>:janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no persistence. Will try again
[rebalance:debug,2015-10-06T15:21:00.811,ns_1@192.168.20.117:<0.7112.0>:janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no persistence. Will try again
[rebalance:debug,2015-10-06T15:21:12.830,ns_1@192.168.20.117:<0.5227.0>:janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no persistence. Will try again

babysitter.log
[ns_server:info,2015-10-06T15:20:27.236,babysitter_of_ns_1@127.0.0.1:<0.78.0>:ns_port_server:log:169]memcached<0.78.0>: Tue Oct 6 15:20
:27.035630 EDT 3: (PFM) DCP (Consumer) eq_dcpq:replication:ns_1@192.168.20.102->ns_1@192.168.20.117:PFM - (vb 0) Attempting to add stre
am with start seqno 27, end seqno 18446744073709551615, vbucket uuid 230505328703891, snap start seqno 27, and snap end seqno 27

[ns_server:info,2015-10-06T15:20:30.002,babysitter_of_ns_1@127.0.0.1:<0.78.0>:ns_port_server:log:169]memcached<0.78.0>: Tue Oct 6 15:20:29.801146 EDT 3: (PFM) Notified the timeout on checkpoint persistence for vbucket 285, id 13, cookie 0x66a3b00

[ns_server:info,2015-10-06T15:20:42.029,babysitter_of_ns_1@127.0.0.1:<0.78.0>:ns_port_server:log:169]memcached<0.78.0>: Tue Oct 6 15:20:41.828789 EDT 3: (PFM) Notified the timeout on checkpoint persistence for vbucket 341, id 12, cookie 0x66a4100

on server 192.168.20.102, memcached.log.5.txt (Last entry for PFM bucket)
Tue Oct 6 15:20:26.916683 EDT 3: (PFM) DCP (Producer) eq_dcpq:replication:ns_1@192.168.20.102->ns_1@192.168.20.117:PFM - (vb 0) stream
created with start seqno 27 and end seqno 18446744073709551615

Thanks for any help that you can provide.
Anthony


#2

me too are having the same problem with couchbase 3.0.1:

babysitter.log:

[rebalance:debug,2015-11-29T8:35:25.940,ns_1@athpol16.polopoly.local:<0.19047.58>:janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no persistence. Will try again [rebalance:debug,2015-11-29T8:35:27.367,ns_1@athpol16.polopoly.local:<0.19073.58>:janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no persistence. Will try again

memcached.log

Sun Nov 29 08:33:23.365053 CET 3: (cmbucket) Notified the timeout on checkpoint persistence for vbucket 172, id 26156, cookie 0x65f7000 Sun Nov 29 08:33:33.435291 CET 3: (cmbucket) Notified the timeout on checkpoint persistence for vbucket 179, id 18922, cookie 0x3bf1000 Sun Nov 29 08:33:52.936887 CET 3: (cmbucket) Notified the timeout on checkpoint persistence for vbucket 186, id 41637, cookie 0x69ad900 Sun Nov 29 08:33:54.365511 CET 3: (cmbucket) Notified the timeout on checkpoint persistence for vbucket 172, id 26156, cookie 0x65f7000 Sun Nov 29 08:34:04.436043 CET 3: (cmbucket) Notified the timeout on checkpoint persistence for vbucket 179, id 18922, cookie 0x3bf1000 Sun Nov 29 08:34:23.937663 CET 3: (cmbucket) Notified the timeout on checkpoint persistence for vbucket 186, id 41637, cookie 0x69ad900 Sun Nov 29 08:34:25.366346 CET 3: (cmbucket) Notified the timeout on checkpoint persistence for vbucket 172, id 26156, cookie 0x65f7000 Sun Nov 29 08:34:35.437210 CET 3: (cmbucket) Notified the timeout on checkpoint persistence for vbucket 179, id 18922, cookie 0x3bf1000 Sun Nov 29 08:34:54.939542 CET 3: (cmbucket) Notified the timeout on checkpoint persistence for vbucket 186, id 41637, cookie 0x69ad900 Sun Nov 29 08:34:56.366859 CET 3: (cmbucket) Notified the timeout on checkpoint persistence for vbucket 172, id 26156, cookie 0x65f7000 Sun Nov 29 08:35:06.437682 CET 3: (cmbucket) Notified the timeout on checkpoint persistence for vbucket 179, id 18922, cookie 0x3bf1000 Sun Nov 29 08:35:25.940374 CET 3: (cmbucket) Notified the timeout on checkpoint persistence for vbucket 186, id 41637, cookie 0x69ad900 Sun Nov 29 08:35:27.367409 CET 3: (cmbucket) Notified the timeout on checkpoint persistence for vbucket 172, id 26156, cookie 0x65f7000 Sun Nov 29 08:35:37.438311 CET 3: (cmbucket) Notified the timeout on checkpoint persistence for vbucket 179, id 18922, cookie 0x3bf1000 Sun Nov 29 08:35:56.941156 CET 3: (cmbucket) Notified the timeout on checkpoint persistence for vbucket 186, id 41637, cookie 0x69ad900 Sun Nov 29 08:35:58.368150 CET 3: (cmbucket) Notified the timeout on checkpoint persistence for vbucket 172, id 26156, cookie 0x65f7000 Sun Nov 29 08:36:08.439453 CET 3: (cmbucket) Notified the timeout on checkpoint persistence for vbucket 179, id 18922, cookie 0x3bf1000


#3

last activity for vbucket 179 on bucket cmbucket:

node athpol13

``
Thu Nov 26 01:30:04.626750 CET 3: (cmbucket) Backfill task (1 to 41486) finished for vb 179 disk seqno 41486 memory seqno 41486
Fri Nov 27 01:30:04.216547 CET 3: (cmbucket) Backfill task (1 to 41486) finished for vb 179 disk seqno 41486 memory seqno 41486
Sat Nov 28 01:30:02.675927 CET 3: (cmbucket) DCP (Producer) eq_dcpq:cbbackup-PvNrEUFlTiSfBcMJ - (vb 93) Backfill complete, 179 items read from disk, last seqno read: 40941
Sat Nov 28 01:30:02.931153 CET 3: (cmbucket) Backfill task (1 to 41486) finished for vb 179 disk seqno 41486 memory seqno 41486

Sun Nov 29 08:02:30.831715 CET 3: (cmbucket) Deletion of vbucket 179 was completed.
Sun Nov 29 08:04:39.463333 CET 3: (cmbucket) Deletion of vbucket 179 was completed.
Sun Nov 29 08:09:02.901863 CET 3: (cmbucket) Deletion of vbucket 179 was completed.
Sun Nov 29 08:17:42.943254 CET 3: (cmbucket) Deletion of vbucket 179 was completed.
Sun Nov 29 08:18:22.899269 CET 3: (cmbucket) Deletion of vbucket 179 was completed.
Sun Nov 29 08:21:32.920246 CET 3: (cmbucket) Deletion of vbucket 179 was completed.
``

node athpol14
memcached.log.7.txt:Sun Nov 29 08:02:28.504468 CET 3: (cmbucket) DCP (Producer) eq_dcpq:replication:ns_1@athpol14.polopoly.local->ns_1@athpol16.polopoly.local:cmbucket - (vb 980) Stream closing, 0 items sent from disk, 179 items sent from memory, 67520 was last seqno sent

node athpol15:

memcached.log.5.txt:Sun Nov 29 07:58:43.438005 CET 3: (cmbucket) Backfill task (1 to 41486) finished for vb 179 disk seqno 41486 memory seqno 41486 memcached.log.5.txt:Sun Nov 29 08:02:30.968392 CET 3: (cmbucket) Backfill task (1 to 41486) finished for vb 179 disk seqno 41486 memory seqno 41486 memcached.log.5.txt:Sun Nov 29 08:04:39.540253 CET 3: (cmbucket) Backfill task (1 to 41486) finished for vb 179 disk seqno 41486 memory seqno 41486 memcached.log.5.txt:Sun Nov 29 08:09:14.961334 CET 3: (cmbucket) Backfill task (1 to 41486) finished for vb 179 disk seqno 41486 memory seqno 41486 memcached.log.5.txt:Sun Nov 29 08:11:53.880114 CET 3: (cmbucket) Backfill task (1 to 41486) finished for vb 179 disk seqno 41486 memory seqno 41486 memcached.log.5.txt:Sun Nov 29 08:17:58.967556 CET 3: (cmbucket) Backfill task (1 to 41486) finished for vb 179 disk seqno 41486 memory seqno 41486 memcached.log.5.txt:Sun Nov 29 08:21:26.507485 CET 3: (cmbucket) Backfill task (1 to 41486) finished for vb 179 disk seqno 41486 memory seqno 41486 memcached.log.5.txt:Sun Nov 29 08:21:37.915948 CET 3: (cmbucket) Backfill task (1 to 41486) finished for vb 179 disk seqno 41486 memory seqno 41486

node athpol16

memcached.log.7.txt:Sun Nov 29 08:11:32.928869 CET 3: (cmbucket) Deletion of vbucket 179 was completed.