GSI: different sporadic bugs + fragmentation up to 98%

[TL;DR] this thread shows (with graphs included):

  1. Fragmentation of GSI and “very specific” (+“not-likely-configurable”) conditions of defragmentation to begin (index disk size needs to be at least 1G for “compaction with defragmentation” to happen, fragmenation level with “normal defrag run” stays about 50-60%)
  2. There is a point of time (or, more likely, index-size-related conditions) after that index fragmentation grows up to 80-90% and defragmentation process unable to bring it back to 50-60%.
  3. There is a sporadic bug when GSI just stops to grow, ignoring all new upserting docs
  4. There is a sporadic bug that makes index size to “drop” to zero and then return back (or maybe stats just lie)
  5. And there is a possible problem with handling expired docs with GSI ( https://issues.couchbase.com/browse/MB-19781)

Hello,
4.1.X, see interesting graphs for index:


Is 1:50 normal proportion between index size on disk and index data size ?

@siri, @prataprc
more unexplainable GSI-related graphs:

To disappoint the ones, who think, “ah, that is just because of lack of data due to overload” , here is “no data due to overload” graph:

Keep in mind, that last three are not related to https://issues.couchbase.com/browse/MB-19714 (not patched version)

UPDATE: also looks interesting (in fact “select * from … where … is valued” says there are no docs at all):

UPDATE2: last repro:


Nothing helps to free disk space, except restarting nodes with indexes (for example: one-by-node):


The questions are “what is this” and “how to fix this” ?

@siri, @prataprc
1:50 = 1:(1/(1-fragmentation rate)) => 1:(1/(1-0.98)) = 1:(1/(0.02)) = 1:50
I was able to reproduce it on 4.1.1-GA (with result of 75.7% fragmentation for 2M docs)
The problem is described like: "GSI: incorrect handling of expired documents causes index fragmentation and enormous grow of used disk space"
See https://issues.couchbase.com/browse/MB-19781

Looks like @siri and @prataprc are both offline, @deepkaran_salooja maybe you could take a look ?

Hi @egrep,
are you seeing compaction kick in at all in indexer log?
thanks

@cihangirb,
wait a minute, i’ve got MUCH more interesting thing :wink: MB-XXXXX is on it’s way.

@cihangirb,
Now, we got 2 bugs (and i love to give them names):
https://issues.couchbase.com/browse/MB-19781 (“Expired death”; fragmentation caused by exp.docs)
https://issues.couchbase.com/browse/MB-19792 (“Bumpy ride”, see first 45-seconds fragmentation graph)

You’ve asked me about compaction log, but i had forced compaction many times and nothing happened.
I can look for something withing log now (nearest 30 mins, till last test configuration is alive).
But, as i think, most valuable is a reproduction schema, that is inside each MB-*. To blindly trust my observations is a possible waste of time, just reproduce both yourself :wink:
@siri, maybe you could comment those “fragmentation oddities” with GSI ?
@deepkaran_salooja, @prataprc ?

@egrep - compaction runs independently for data, index and views. If you have a topology where indexes having the issue is on a separate node and drive, you may be suffering from a compaction error that isn’t bubbling up on the indexes to get to this state. grep -i for “compact” and see if there is a compaction completion msg in indexer.log. I can look up the exact text if you are having issues.
thanks
-cihan

What log should i exaclty “grep” ?

/opt/couchbase/var/lib/couchbase/logs/indexer.log:2016-05-31T15:31:08.887+00:00 [Info] CompactionDaemon: Checking fragmentation, IndexInstId: 3680848504215299622 Data:3351772, Disk:28356608, ExtraSnapshotData:0, Fragmentation:88.17992617452694%
/opt/couchbase/var/lib/couchbase/logs/indexer.log:2016-05-31T15:31:38.888+00:00 [Info] CompactionDaemon: Checking fragmentation, IndexInstId: 3680848504215299622 Data:3351772, Disk:28356608, ExtraSnapshotData:0, Fragmentation:88.17992617452694%
/opt/couchbase/var/lib/couchbase/logs/indexer.log:2016-05-31T15:32:05.941+00:00 [Info] PeriodicStats = {“default:iQX:avg_scan_latency”:0,“default:iQX:avg_scan_wait_latency”:0,“default:iQX:avg_ts_interval”:200714057,“default:iQX:avg_ts_items_count”:21,“default:iQX:build_progress”:100,“default:iQX:data_size”:3351772,“default:iQX:delete_bytes”:0,“default:iQX:disk_size”:28356608,“default:iQX:flush_queue_size”:0,“default:iQX:frag_percent”:88,“default:iQX:get_bytes”:0,“default:iQX:insert_bytes”:10936,“default:iQX:items_count”:9897,“default:iQX:num_commits”:0,“default:iQX:num_compactions”:0,“default:iQX:num_completed_requests”:0,“default:iQX:num_docs_indexed”:20000,“default:iQX:num_docs_pending”:0,“default:iQX:num_docs_queued”:0,“default:iQX:num_flush_queued”:205,“default:iQX:num_last_snapshot_reply”:0,“default:iQX:num_requests”:0,“default:iQX:num_rows_returned”:0,“default:iQX:num_snapshot_waiters”:0,“default:iQX:num_snapshots”:9,“default:iQX:scan_bytes_read”:0,“default:iQX:scan_wait_duration”:0,“default:iQX:since_last_snapshot”:199612495,“default:iQX:timings/storage_clone_handle”:“0 0 0”,“default:iQX:timings/storage_commit”:“0 0 0”,“default:iQX:timings/storage_del”:“0 0 0”,“default:iQX:timings/storage_get”:“205 3273748 380521959320”,“default:iQX:timings/storage_handle_open”:“1 21790 474804100”,“default:iQX:timings/storage_info”:“19 163719 2898968793”,“default:iQX:timings/storage_iterator_next”:“0 0 0”,“default:iQX:timings/storage_meta_get”:“2 18225118 304963307518090”,“default:iQX:timings/storage_meta_set”:“0 0 0”,“default:iQX:timings/storage_new_iterator”:“0 0 0”,“default:iQX:timings/storage_set”:“206 1992214 28906881938”,“default:iQX:timings/storage_snapshot_close”:“8 304351 15061596145”,“default:iQX:timings/storage_snapshot_create”:“9 1757314 383319197322”,“default:iQX:total_scan_duration”:0,“default:mutation_queue_size”:0,“default:num_mutations_queued”:205,“default:num_nonalign_ts”:0,“default:ts_queue_size”:0,“memory_quota”:268435456,“memory_used”:25554944,“needs_restart”:false,“num_connections”:0,“timings/stats_response”:“6950 2041447151 3277796782181851”}
/opt/couchbase/var/lib/couchbase/logs/indexer.log:2016-05-31T15:32:08.888+00:00 [Info] CompactionDaemon: Checking fragmentation, IndexInstId: 3680848504215299622 Data:3351772, Disk:28356608, ExtraSnapshotData:0, Fragmentation:88.17992617452694%

And more:

/opt/couchbase/var/lib/couchbase/logs/stats.log:ep_compaction_exp_mem_threshold 85
/opt/couchbase/var/lib/couchbase/logs/stats.log:ep_compaction_write_queue_cap 10000
/opt/couchbase/var/lib/couchbase/logs/stats.log:ep_mlog_compactor_runs 0
/opt/couchbase/var/lib/couchbase/logs/stats.log:ep_pending_compactions 0
/opt/couchbase/var/lib/couchbase/logs/stats.log:ep_compaction_exp_mem_threshold 85
/opt/couchbase/var/lib/couchbase/logs/stats.log:ep_compaction_write_queue_cap 10000
/opt/couchbase/var/lib/couchbase/logs/stats.log:ep_mlog_compactor_runs 0
/opt/couchbase/var/lib/couchbase/logs/stats.log:ep_pending_compactions 0
/opt/couchbase/var/lib/couchbase/logs/stats.log:ep_compaction_exp_mem_threshold 85
/opt/couchbase/var/lib/couchbase/logs/stats.log:ep_compaction_write_queue_cap 10000
/opt/couchbase/var/lib/couchbase/logs/stats.log:ep_mlog_compactor_runs 0
/opt/couchbase/var/lib/couchbase/logs/stats.log:ep_pending_compactions 0
/opt/couchbase/var/lib/couchbase/logs/stats.log:ep_compaction_exp_mem_threshold 85
/opt/couchbase/var/lib/couchbase/logs/stats.log:ep_compaction_write_queue_cap 10000

And more:

/opt/couchbase/var/lib/couchbase/logs/projector.log:2016-05-31T14:08:48.546+00:00 [Info] PROJ[:9999] settings indexer.settings.compaction.check_period will updated to 30
/opt/couchbase/var/lib/couchbase/logs/projector.log:2016-05-31T14:08:48.546+00:00 [Info] PROJ[:9999] settings indexer.settings.compaction.interval will updated to 00:00,00:00
/opt/couchbase/var/lib/couchbase/logs/projector.log:2016-05-31T14:08:48.546+00:00 [Info] PROJ[:9999] settings indexer.settings.compaction.min_size will updated to 524288000
/opt/couchbase/var/lib/couchbase/logs/projector.log:2016-05-31T14:08:48.546+00:00 [Info] PROJ[:9999] settings indexer.settings.compaction.min_frag will updated to 30
/opt/couchbase/var/lib/couchbase/logs/projector.log:{“indexer.settings.compaction.interval”:“00:00,00:00”,“indexer.settings.log_level”:“info”,“indexer.settings.persisted_snapshot.interval”:5000,“indexer.settings.compaction.min_frag”:30,“indexer.settings.inmemory_snapshot.interval”:200,“indexer.settings.max_cpu_percent”:400,“indexer.settings.recovery.max_rollbacks”:5,“indexer.settings.memory_quota”:268435456}
/opt/couchbase/var/lib/couchbase/logs/projector.log:2016-05-31T14:20:05.388+00:00 [Info] PROJ[:9999] settings indexer.settings.compaction.interval will updated to 00:00,00:00
/opt/couchbase/var/lib/couchbase/logs/projector.log:2016-05-31T14:20:05.388+00:00 [Info] PROJ[:9999] settings indexer.settings.compaction.min_size will updated to 524288000
/opt/couchbase/var/lib/couchbase/logs/projector.log:2016-05-31T14:20:05.388+00:00 [Info] PROJ[:9999] settings indexer.settings.compaction.check_period will updated to 30
/opt/couchbase/var/lib/couchbase/logs/projector.log:2016-05-31T14:20:05.389+00:00 [Info] PROJ[:9999] settings indexer.settings.compaction.min_frag will updated to 30

And more:

/opt/couchbase/var/lib/couchbase/logs/reports.log: {name,compaction_new_daemon},
/opt/couchbase/var/lib/couchbase/logs/reports.log: {mfargs,{compaction_new_daemon,start_link,}},
/opt/couchbase/var/lib/couchbase/logs/reports.log: {name,compaction_daemon},
/opt/couchbase/var/lib/couchbase/logs/reports.log: {mfargs,{compaction_daemon,start_link,}},
/opt/couchbase/var/lib/couchbase/logs/reports.log: {name,compaction_new_daemon},
/opt/couchbase/var/lib/couchbase/logs/reports.log: {mfargs,{compaction_new_daemon,start_link,}},
/opt/couchbase/var/lib/couchbase/logs/reports.log: {name,compaction_daemon},
/opt/couchbase/var/lib/couchbase/logs/reports.log: {mfargs,{compaction_daemon,start_link,}},
/opt/couchbase/var/lib/couchbase/logs/reports.log: {name,compaction_new_daemon},

And more:

Start compaction of vbuckets for bucket default with config:
/opt/couchbase/var/lib/couchbase/logs/info.log:[ns_server:info,2016-05-31T14:58:05.198Z,ns_1@81.node:<0.28436.0>:compaction_new_daemon:spawn_scheduled_views_compactor:493]Start compaction of indexes for bucket default with config:
/opt/couchbase/var/lib/couchbase/logs/info.log:[ns_server:info,2016-05-31T14:58:34.991Z,ns_1@81.node:<0.28793.0>:compaction_new_daemon:spawn_scheduled_kv_compactor:467]Start compaction of vbuckets for bucket default with config:
/opt/couchbase/var/lib/couchbase/logs/info.log:[ns_server:info,2016-05-31T14:58:35.204Z,ns_1@81.node:<0.28797.0>:compaction_new_daemon:spawn_scheduled_views_compactor:493]Start compaction of indexes for bucket default with config:
/opt/couchbase/var/lib/couchbase/logs/info.log:[ns_server:info,2016-05-31T14:59:04.994Z,ns_1@81.node:<0.29166.0>:compaction_new_daemon:spawn_scheduled_kv_compactor:467]Start compaction of vbuckets for bucket default with config:

And more:

/opt/couchbase/var/lib/couchbase/logs/debug.log:[ns_server:debug,2016-05-31T15:17:04.148Z,ns_1@81.node:compaction_new_daemon<0.402.0>:compaction_scheduler:schedule_next:60]Finished compaction for compact_kv too soon. Next run will be in 30s
/opt/couchbase/var/lib/couchbase/logs/debug.log:[ns_server:debug,2016-05-31T15:17:05.330Z,ns_1@81.node:compaction_new_daemon<0.402.0>:compaction_new_daemon:process_scheduler_message:1251]Starting compaction (compact_views) for the following buckets:
/opt/couchbase/var/lib/couchbase/logs/debug.log:[ns_server:info,2016-05-31T15:17:05.332Z,ns_1@81.node:<0.9632.1>:compaction_new_daemon:spawn_scheduled_views_compactor:493]Start compaction of indexes for bucket default with config:
/opt/couchbase/var/lib/couchbase/logs/debug.log:[ns_server:debug,2016-05-31T15:17:05.332Z,ns_1@81.node:compaction_new_daemon<0.402.0>:compaction_new_daemon:process_compactors_exit:1292]Finished compaction iteration.
/opt/couchbase/var/lib/couchbase/logs/debug.log:[ns_server:debug,2016-05-31T15:17:05.332Z,ns_1@81.node:compaction_new_daemon<0.402.0>:compaction_scheduler:schedule_next:60]Finished compaction for compact_views too soon. Next run will be in 30s

What exactly should i look for ?

2 hours later after “reference repro” of “Bumpy ride” ( https://issues.couchbase.com/browse/MB-19792) compaction settings changed, compaction forced, but nothing happened:

@siri,
for my MB-questions:

/opt/couchbase/etc/couchdb/default.ini

[database_compaction]
; larger buffer sizes can originate smaller files
doc_buffer_size = 524288 ; value in bytes
checkpoint_after = 5242880 ; checkpoint after every N bytes were written

512K and 5M. 5 points * 5M = 25M << 512M.

? maybe, it’s not “in bytes”, but “int kilobytes” ?

[compaction_daemon]
; The delay, in seconds, between each check for which database and view indexes
; need to be compacted.
check_interval = 60
; If a database or view index file is smaller then this value (in bytes),
; compaction will not happen. Very small files always have a very high
; fragmentation therefore it’s not worth to compact them.
min_file_size = 131072

128K ?

Mmmm… is there another configuration file i need to change for compaction configuration? (or maybe there is different way of configuration ?).
Or, the perfect case: there is a guide for configuration. But where ?

UPDATE: so strange: :Couchbase SDKs

4.1.1-EE, see only:

Nothing about “Circular compaction” or “Setting fragmentation level”

hehe, yeah, you wish!
I was asking about possibility to do some simple configuration through a config file, didn’t get any response. There is /opt/couchbase/etc/couchbase/static_config but even that one is not described properly in the documentation.
But I beg my pardon for interrupting… :wink:

@siri, @cihangirb,
is a little wonder in form of “minimal config file explanation” possible for
me, @dmitryb (and, i also think, that @moon0326 with his 1:1168 [ 99% fragmentation on indexes ] index would not object)

Hi folks - busy day, apologies for the delayed response.
Unfortunately we don’t document these config files and I’d not recommend changing them. For the compaction issue, we don’t kick compaction off until the file gets to a certain size for the index (I believe it is set to 500MB). It is hard to follow on the thread on a few occasions but I think what you are seeing that you don’t see compaction kicking off or finishing.
grep for “CompactionDaemon: Compacting index instance:” and “CompactionDaemon: Finished compacting index instance:” - if you are not seeing these we have not kicked off compaction yet.
do you have these in the indexer.log?

@cihangirb

and I’d not recommend changing them.

Something that is dependent on these params is hardcoded ?
Unpredictable side-effects are possible ?

grep -i “CompactionDaemon” /opt/couchbase/var/lib/couchbase/logs/*

shows nothing, seems you’re right, but i’ll generate 1G+ per index to ensure.

@siri, @cihangirb,
there definitely is a problem: index just stops to grow after 1.5G on 4.1.1-EE (5914):



Need a MB / repro ?

UPDATE: changed algorithm for keys and values to be more different. The result is the same: it stucks:




@siri, @cihangirb,
is it enough to make you sure that there is an error and it shoud be investigated on the level of “GSI logs and code inspection” ?

yes - pls open an issue - include the indexer.log so we can take a look. it may take a long time for the compaction to catch up - can you describe your HW and topology of the cluster as well for me so we can choose something similar here.
thanks
-cihan

byt he way, did the last repro have a compaction msg in the indexer.log?

grep -i “CompactionDaemon: Finished compacting index instance”

@cihangirb
nope. just this for “compaction”:

0x4995c4 github.com/couchbase/indexing/secondary/indexer.(*compactionDaemon).loop+0x954 /home/couchbase/jenkins/workspace/sherlock-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/compaction_manager.go:98
0x4995c4 github.com/couchbase/indexing/secondary/indexer.(*compactionDaemon).loop+0x954 /home/couchbase/jenkins/workspace/sherlock-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/compaction_manager.go:98

@cihangirb,
my last log (where “it stucks”).
indexer.log.zip (947.2 KB)
MB with repro after couple of hours, little bit busy now.
This is from one node with index (3 total, 2 have indexes)