Indexer process periodically eats all cpu and memory

Couchbase 6.0 Community edition
Indexer process periodically raise CPU usage to 100% and we see a lot of goroutines and native processes…
We have to periodically remove node from cluster and rebuild indices.
Is it known issue with community edition 6.0?
What info should we gather?

Hi @kolchanov,

Please let me know name of the OS, number of Cpu cores and RAM size.

Also, Collect cpu profile, memory profile and goroutine dump at/around the time of CPU usage getting 100%.

Command to collect CPU profile is
curl -X GET -u <Cluster_admin-username>:<password> http://<hostname>:9102/debug/pprof/profile > cpu_profile.pprof

Command to collect memory profile is
curl -X GET -u <Cluster_admin-username>:<password> http://<hostname>:9102/debug/pprof/heap?debug=1 > memory_profile.pprof

Command to collect goroutine dump is
curl -X GET -u <Cluster_admin-username>:<password> http://<hostname>:9102/debug/pprof/goroutine?debug=1 > goroutine_dump.pprof

Also, check indexer.log file to see if any specific operation is running. For example forestdb storage compaction, around the time when CPU goes 100%?

Thanks.

We have 3 nodes with 8 CPU and 96 GB RAM per node.
I’ll update profile info soon

We observerd a lot of forestddb native threads

No success with thread dump - 0 bytes.

curl -X GET -u :* http://172.31.200.16:9102/debug/pprof/goroutine?debug=1 > goroutine_dump.pprof
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:–:-- 0:00:26 --:–:-- 0

Cpu profile
http://gofile.me/2R1hC/DZ6y4lI1F
Cpu profile in 15 minutes
http://gofile.me/2R1hC/iMqEPeYso
Goroutines dump
http://gofile.me/2R1hC/FW66RNSS2
Goroutines dump in 15 minites
http://gofile.me/2R1hC/fsR3udkdu
Indexer.log
http://gofile.me/2R1hC/el5SzrMYu

Any recommendations?

New indexer problem

020-06-07T23:27:30.625+09:00 [Info] memstats {“Alloc”:109856784, “TotalAlloc”:71316733712, “Sys”:215991384, “Lookups”:46105, “Mallocs”:1095795547,“Frees”:10
94855168, “HeapAlloc”:109856784, “HeapSys”:197689344, “HeapIdle”:84148224, “HeapInuse”:113541120,“HeapReleased”:63586304, “HeapObjects”:940379,“GCSys”:962560
0, “LastGC”:1591539973586314423,“PauseTotalNs”:1500641774, “PauseNs”:[905011], “NumGC”:1039}
2020-06-07T23:27:04.523+09:00 [Info] [Queryport “:9101”] connection “172.31.200.17:39704” doReceive() …
2020-06-07T23:27:31.244+09:00 [Info] [Queryport “:9101”] connection “172.31.200.17:39746” doReceive() …
2020-06-07T23:27:47.121+09:00 [Info] [Queryport “:9101”] connection “172.31.200.15:31498” doReceive() …
2020/06/07 23:28:02 revrpc: Got error (EOF) and will retry in 1s
2020-06-07T23:27:09.783+09:00 [Info] [Queryport “:9101”] connection “172.31.200.17:39692” doReceive() …

2020-06-07T23:37:42.144+09:00 [Info] [Queryport “:9101”] connection “172.31.200.16:41951” doReceive() …
2020-06-07T23:32:28.549+09:00 [Info] [Queryport “:9101”] connection “172.31.200.17:39958” doReceive() …
2020-06-07T23:32:43.182+09:00 [Info] StorageMgr::handleCreateSnapshot Added New Snapshot Index: 10711316375339803713 PartitionId: 0 SliceId: 0 Crc64: 9887827
889855871450 (SnapshotInfo: seqnos: 1745052, 1745052, 1499 committed:true) SnapCreateDur 4m20.163343113s SnapOpenDur 1m25.874407237s
2020/06/07 23:43:10 revrpc: Got error (EOF) and will retry in 1s
2020-06-07T23:43:12.797+09:00 [Info] [Queryport “:9101”] connection “172.31.200.16:45321” doReceive() …
2020-06-07T23:43:16.593+09:00 [Info] [Queryport “:9101”] connection “172.31.200.16:63649” doReceive() …

And indexer killed

2020-06-07T23:44:12.327+09:00 [Info] [Queryport “:9101”] connection “172.31.200.16:48227” doReceive() …
2020-06-07T23:44:12.327+09:00 [Info] [Queryport “:9101”] connection “172.31.200.17:41744” doReceive() …
2020-06-07T23:32:46.727+09:00 [Info] [Queryport “:9101”] connection “172.31.200.15:31684” doReceive() …
[goport(/u01/app/couchbase/bin/indexer)] 2020/06/07 23:46:19 child process exited with status 143
2020-06-07T23:46:19.807+09:00 [Info] Indexer started with command line: [/u01/app/couchbase/bin/indexer.bin -vbuckets=1024 -cluster=127.0.0.1:8091 -adminPort
=9100 -scanPort=9101 -httpPort=9102 -streamInitPort=9103 -streamCatchupPort=9104 -streamMaintPort=9105 -storageDir=/u01/app/couchbase/var/lib/couchbase/data/
@2i -diagDir=/u01/app/couchbase/var/lib/couchbase/crash -nodeUUID=17b60be0c1ffb7f18fa42291e49d6843 -ipv6=false -isEnterprise=false]
2020-06-07T23:46:19.827+09:00 [Info] Indexer::NewIndexer Status Warmup

Hi @kolchanov,

Thanks for sharing the information.

I checked the cpu profiles and goroutine dumps attached. I did not see anything stand out. When you collected these profiles, did you see CPU consumption reaching 100% ?

Also, when you saw CPU consumption reaching 100%, were there a lot of new documents being indexes? Or were there a lot of scans being served by the indexer?

I will need more time to analyse indexer.log file in detail.

Regarding, indexer process getting killed, did you check dmesg or messages? Were there any logs regarding that? Some times, we have seen that if memory sizing is not done properly, indexer can exit with “Out Of Memory” error.

Thank you for your attention.
We have upgraded to 6.5.1 CE and this problem has gone.

I want to place detailed analysis from my college:

Seems during last incidents problems looks like this.
Couch try to process some snapshots and initialize iterator for KV stroage here:

32 @ 0x408a90 0x4bd770 0x4cbb6a 0x4ef352 0x501ac1 0x501852 0x5cb45b 0x5cadde 0x5c9f26 0x5ad7b0 0xb5da65 0x465261
#	0x4bd76f	github.com/couchbase/indexing/secondary/fdb._Cfunc_fdb_iterator_init+0x4f		??:0
#	0x4cbb69	github.com/couchbase/indexing/secondary/fdb.(*KVStore).IteratorInit+0x419		goproj/src/github.com/couchbase/indexing/secondary/fdb/iterator.go:234
#	0x4ef351	github.com/couchbase/indexing/secondary/indexer.(*ForestDBIterator).Seek+0xc1		goproj/src/github.com/couchbase/indexing/secondary/indexer/forestdb_iterator.go:128
#	0x501ac0	github.com/couchbase/indexing/secondary/indexer.(*fdbSnapshot).Iterate+0x180		goproj/src/github.com/couchbase/indexing/secondary/indexer/forestdb_snapshot_reader.go:208
#	0x501851	github.com/couchbase/indexing/secondary/indexer.(*fdbSnapshot).Range+0xb1		goproj/src/github.com/couchbase/indexing/secondary/indexer/forestdb_snapshot_reader.go:180
#	0x5cb45a	github.com/couchbase/indexing/secondary/indexer.scanSingleSlice+0x41a			goproj/src/github.com/couchbase/indexing/secondary/indexer/scan_scatter.go:206
#	0x5caddd	github.com/couchbase/indexing/secondary/indexer.scanOne+0x11d				goproj/src/github.com/couchbase/indexing/secondary/indexer/scan_scatter.go:148
#	0x5c9f25	github.com/couchbase/indexing/secondary/indexer.scatter+0xc5				goproj/src/github.com/couchbase/indexing/secondary/indexer/scan_scatter.go:55
#	0x5ad7af	github.com/couchbase/indexing/secondary/indexer.(*IndexScanSource).Routine+0x80f	goproj/src/github.com/couchbase/indexing/secondary/indexer/scan_pipeline.go:336
#	0xb5da64	github.com/couchbase/indexing/secondary/pipeline.(*Pipeline).runIt.func1+0x44	

Which case calling of this code on C (libforestdb.so part of couchbase).

#1  0x00007f2834faae5a in bcache_read () from /u01/app/couchbase/bin/../lib/libforestdb.so
#2  0x00007f2834fc2887 in filemgr_read () from /u01/app/couchbase/bin/../lib/libforestdb.so
#3  0x00007f2834fb3c97 in btreeblk_read(void*, unsigned long) () from /u01/app/couchbase/bin/../lib/libforestdb.so
#4  0x00007f2834fad503 in _btree_next(btree_iterator*, void*, void*, int) () from /u01/app/couchbase/bin/../lib/libforestdb.so
#5  0x00007f2834fada3e in _btree_next(btree_iterator*, void*, void*, int) () from /u01/app/couchbase/bin/../lib/libforestdb.so
#6  0x00007f2834fb0ba0 in btree_next () from /u01/app/couchbase/bin/../lib/libforestdb.so
#7  0x00007f2834fddfea in _hbtrie_next(hbtrie_iterator*, btreeit_item*, void*, unsigned long*, void*, unsigned char) () from /u01/app/couchbase/bin/../lib/libforestdb.so
#8  0x00007f2834fde689 in _hbtrie_next(hbtrie_iterator*, btreeit_item*, void*, unsigned long*, void*, unsigned char) () from /u01/app/couchbase/bin/../lib/libforestdb.so
#9  0x00007f2834fde689 in _hbtrie_next(hbtrie_iterator*, btreeit_item*, void*, unsigned long*, void*, unsigned char) () from /u01/app/couchbase/bin/../lib/libforestdb.so
#10 0x00007f2834fdfed7 in hbtrie_next () from /u01/app/couchbase/bin/../lib/libforestdb.so
#11 0x00007f2834fe444a in fdb_iterator_next () from /u01/app/couchbase/bin/../lib/libforestdb.so
#12 0x00007f2834fe49e8 in fdb_iterator_init () from /u01/app/couchbase/bin/../lib/libforestdb.so
#13 0x0000000000cdc16c in _cgo_8f4f063755d2_Cfunc_fdb_iterator_init ()
#14 0x0000000000464050 in runtime.asmcgocall ()
#15 0x000000c428480688 in ?? ()
#16 0x0000000000408a6d in runtime.cgocall ()
#17 0x000000c428480698 in ?? ()
#18 0x000000c426d92100 in ?? ()
#19 0x000000c428480688 in ?? ()
#20 0x0000000000000950 in ?? ()
#21 0x000000c425e49520 in ?? ()
#22 0x0000000000462399 in runtime.systemstack ()
#23 0x0000000000436970 in _start ()

There about hundreeds similar threads which use CPU for spin lock.
However one of them do some useful work:

Thread 11 (Thread 0x7f05099fb700 (LWP 1206)):
#0  0x00007f077a480d69 in __memcpy_ssse3_back () from /lib64/libc.so.6
#1  0x00007f077ada3ecc in bcache_read () from /u01/app/couchbase/bin/../lib/libforestdb.so
#2  0x00007f077adbb887 in filemgr_read () from /u01/app/couchbase/bin/../lib/libforestdb.so
#3  0x00007f077adacc97 in btreeblk_read(void*, unsigned long) () from /u01/app/couchbase/bin/../lib/libforestdb.so
#4  0x00007f077ada7b71 in btree_init_from_bid () from /u01/app/couchbase/bin/../lib/libforestdb.so
#5  0x00007f077add70a3 in _hbtrie_next(hbtrie_iterator*, btreeit_item*, void*, unsigned long*, void*, unsigned char) () from /u01/app/couchbase/bin/../lib/libforestdb.so
#6  0x00007f077add7689 in _hbtrie_next(hbtrie_iterator*, btreeit_item*, void*, unsigned long*, void*, unsigned char) () from /u01/app/couchbase/bin/../lib/libforestdb.so
#7  0x00007f077add7689 in _hbtrie_next(hbtrie_iterator*, btreeit_item*, void*, unsigned long*, void*, unsigned char) () from /u01/app/couchbase/bin/../lib/libforestdb.so
#8  0x00007f077add8ed7 in hbtrie_next () from /u01/app/couchbase/bin/../lib/libforestdb.so
#9  0x00007f077addd44a in fdb_iterator_next () from /u01/app/couchbase/bin/../lib/libforestdb.so
#10 0x00007f077addd9e8 in fdb_iterator_init () from /u01/app/couchbase/bin/../lib/libforestdb.so
#11 0x0000000000cdc16c in _cgo_8f4f063755d2_Cfunc_fdb_iterator_init ()
#12 0x0000000000464050 in runtime.asmcgocall ()
#13 0x000000c425f94688 in ?? ()
#14 0x0000000000408a6d in runtime.cgocall ()
#15 0x000000c425f94698 in ?? ()
#16 0x000000c4276a9f80 in ?? ()
#17 0x000000c425f94688 in ?? ()
#18 0x0000000000000950 in ?? ()
#19 0x000000c428554820 in ?? ()
#20 0x0000000000462399 in runtime.systemstack ()
#21 0x0000000000436970 in _start ()

I think there something like high concurrency for one lock. I suppose that with not enogh RAM quote there RAM operations become slower which cause processing slower than it come from outside

int bcache_read(struct filemgr *file, bid_t bid, void *buf)
{
    struct hash_elem *h;
    struct bcache_item *item;
    struct bcache_item query;
    struct fnamedic_item *fname;

    // Note that we don't need to grab bcache_lock here as the block cache
    // is already created and binded when the file is created or opened for
    // the first time.
    fname = file->bcache;

    if (fname) {
        // file exists
        // set query
        query.bid = bid;
        // Update the access timestamp.
        struct timeval tp;
        gettimeofday(&tp, NULL); // TODO: Need to implement a better way of
                                 // getting the timestamp to avoid the overhead of
                                 // gettimeofday()
        atomic_store_uint64_t(&fname->access_timestamp,
                              (uint64_t) (tp.tv_sec * 1000000 + tp.tv_usec));

        size_t shard_num = bid % fname->num_shards;
        spin_lock(&fname->shards[shard_num].lock);

        // search shard hash table
        h = hash_find(&fname->shards[shard_num].hashtable, &query.hash_elem);
        if (h) {
            // cache hit
            item = _get_entry(h, struct bcache_item, hash_elem);
            if (item->flag & BCACHE_FREE) {
                spin_unlock(&fname->shards[shard_num].lock);
                DBG("Warning: failed to read the buffer cache entry for a file '%s' "
                    "because the entry belongs to the free list!\n",
                    file->filename);
                return 0;
            }

            // move the item to the head of list if the block is clean
            // (don't care if the block is dirty)
            if (!(item->flag & BCACHE_DIRTY)) {
                // TODO: Scanning the list would cause some overhead. We need to devise
                // the better data structure to provide a fast lookup for the clean list.
                list_remove(&fname->shards[shard_num].cleanlist, &item->list_elem);
                list_push_front(&fname->shards[shard_num].cleanlist, &item->list_elem);
            }

            memcpy(buf, item->addr, bcache_blocksize);
            _bcache_set_score(item);

            spin_unlock(&fname->shards[shard_num].lock);

            return bcache_blocksize;
        } else {
            // cache miss
            spin_unlock(&fname->shards[shard_num].lock);
        }
    }

    // does not exist .. cache miss
    return 0;
}

This code have at least one problem at my glance - there should be spin with limited iterations count with to possible run another lock type which not consume CPU.
Also there some problem with throttling of tasks from Golang side to prevent run of large amount threads.

@sduvuru Could you please look into this from ForestDB perspective? Thank you!

Unfortunately 6.5.1. indexer hanged again.

cpu_profile_2020-06-11_15-07-56.pprof
http://gofile.me/2R1hC/e3y44ySXg

cpu_profile_2020-06-11_15-09-00.pprof
http://gofile.me/2R1hC/9H753A0oM

goroutine_dump_2020-06-11_15-07-56.pprof
http://gofile.me/2R1hC/eC1K3phyb

goroutine_dump_2020-06-11_15-08-56.pprof
http://gofile.me/2R1hC/uuGDDAI9P