Cannot communicate with indexer process

ady.shimony · October 1, 2018, 9:23pm

Hi,

Server suddenly went into warmup, while working? how can it be?
Since then, I am getting:

Warning: Cannot communicate with indexer process. Information on indexes may be stale. Will retry.
Indexer Logs:

goroutine 45392476 [IO wait, 4 minutes]:
runtime.gopark(0xeb0a70, 0x7f24038b2590, 0xdf3de4, 0x7, 0x64bd1b, 0x5)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/runtime/proc.go:259 +0x13a fp=0xc4231078d0 sp=0xc4231078a0
runtime.netpollblock(0x7f24038b2568, 0x72, 0xb)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/runtime/netpoll.go:341 +0xeb fp=0xc423107918 sp=0xc4231078d0
net.runtime_pollWait(0x7f24038b2568, 0x72, 0xf)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/runtime/netpoll.go:160 +0x59 fp=0xc423107940 sp=0xc423107918
net.(*pollDesc).wait(0xc431138d80, 0x72, 0xc4231079d0, 0xc4200800a0)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/net/fd_poll_runtime.go:73 +0x38 fp=0xc423107968 sp=0xc423107940
net.(*pollDesc).waitRead(0xc431138d80, 0x15a2a60, 0xc4200800a0)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/net/fd_poll_runtime.go:78 +0x34 fp=0xc423107998 sp=0xc423107968
net.(*netFD).Read(0xc431138d20, 0xc480fdc000, 0x1000, 0x1000, 0x0, 0x15a2a60, 0xc4200800a0)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/net/fd_unix.go:243 +0x1a1 fp=0xc423107a08 sp=0xc423107998
net.(*conn).Read(0xc484006008, 0xc480fdc000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/net/net.go:173 +0x70 fp=0xc423107a78 sp=0xc423107a08
net/http.(*persistConn).Read(0xc4c9e64300, 0xc480fdc000, 0x1000, 0x1000, 0x6b0890, 0xc423107b58, 0x40a9dd)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/net/http/transport.go:1261 +0x154 fp=0xc423107af0 sp=0xc423107a78
bufio.(*Reader).fill(0xc45e380300)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/bufio/bufio.go:97 +0x10c fp=0xc423107b50 sp=0xc423107af0
bufio.(*Reader).Peek(0xc45e380300, 0x1, 0xc423107bbd, 0x1, 0x0, 0xc4d49fc180, 0x0)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/bufio/bufio.go:129 +0x62 fp=0xc423107b68 sp=0xc423107b50
net/http.(*persistConn).readLoop(0xc4c9e64300)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/net/http/transport.go:1418 +0x1a1 fp=0xc423107fa8 sp=0xc423107b68
runtime.goexit()
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc423107fb0 sp=0xc423107fa8
created by net/http.(*Transport).dialConn
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/net/http/transport.go:1062 +0x4e9

goroutine 45418873 [select]:
runtime.gopark(0xeb0af8, 0x0, 0xdf37ee, 0x6, 0x18, 0x2)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/runtime/proc.go:259 +0x13a fp=0xc423195378 sp=0xc423195348
runtime.selectgoImpl(0xc423195730, 0x0, 0x18)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/runtime/select.go:423 +0x11d9 fp=0xc4231955a0 sp=0xc423195378
runtime.selectgo(0xc423195730)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/runtime/select.go:238 +0x1c fp=0xc4231955c8 sp=0xc4231955a0
github.com/couchbase/indexing/secondary/indexer.(*ServiceMgr).wait(0xc4230b8240, 0xc4aba1ebb0, 0x8, 0x9, 0xc43cd49b60, 0x0, 0x0, 0x0, 0x0, 0x0, …)
/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/rebalance_service_manager.go:1537 +0x4e2 fp=0xc4231957$
github.com/couchbase/indexing/secondary/indexer.(*ServiceMgr).GetTaskList(0xc4230b8240, 0xc4aba1ebb0, 0x8, 0x9, 0xc43cd49b60, 0xc44b7b4938, 0x43a7ae, 0xc44b7b4910)
/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/rebalance_service_manager.go:282 +0x134 fp=0xc4231958b$
github.com/couchbase/cbauth/service.serviceAPI.GetTaskList.func1(0xc43cd49b60)
/home/couchbase/jenkins/workspace/couchbase-server-unix/godeps/src/github.com/couchbase/cbauth/service/revrpc.go:99 +0x7e fp=0xc423195938 sp=0xc4231958b8
github.com/couchbase/cbauth/service.withTimeout(0x7530, 0xc4231959c0)
/home/couchbase/jenkins/workspace/couchbase-server-unix/godeps/src/github.com/couchbase/cbauth/service/revrpc.go:46 +0x89 fp=0xc423195990 sp=0xc423195938
github.com/couchbase/cbauth/service.serviceAPI.GetTaskList(0x15aec80, 0xc4230b8240, 0xc4aba1ebb0, 0x8, 0x9, 0x7530, 0xc4725196e0, 0xc44b7b4a98, 0x4177d5)
/home/couchbase/jenkins/workspace/couchbase-server-unix/godeps/src/github.com/couchbase/cbauth/service/revrpc.go:103 +0xc0 fp=0xc423195a18 sp=0xc423195990
github.com/couchbase/cbauth/service.(*serviceAPI).GetTaskList(0xc42317e5a0, 0xc4aba1ebb0, 0x8, 0x9, 0x7530, 0xc4725196e0, 0x0, 0x0)
:11 +0x90 fp=0xc423195a70 sp=0xc423195a18
runtime.call64(0xc423180bd0, 0xc420036318, 0xc424feeb00, 0x3000000040)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/runtime/asm_amd64.s:480 +0x4c fp=0xc423195ac0 sp=0xc423195a70
reflect.Value.call(0xc423170ea0, 0xc420036318, 0x13, 0xdf1045, 0x4, 0xc423195ee0, 0x3, 0x3, 0x0, 0x0, …)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/reflect/value.go:434 +0x5c8 fp=0xc423195e10 sp=0xc423195ac0
reflect.Value.Call(0xc423170ea0, 0xc420036318, 0x13, 0xc423195ee0, 0x3, 0x3, 0xc4b34779b8, 0xc471827798, 0x80f829)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/reflect/value.go:302 +0xa4 fp=0xc423195e78 sp=0xc423195e10
net/rpc.(*service).call(0xc4211f5c00, 0xc4211f5bc0, 0xc42317e620, 0xc4230bb480, 0xc4232061e0, 0xd103c0, 0xc424e680c0, 0x199, 0xc6a040, 0xc4725196e0, …)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/net/rpc/server.go:383 +0x148 fp=0xc423195f38 sp=0xc423195e78
runtime.goexit()
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc423195f40 sp=0xc423195f38
created by net/rpc.(*Server).ServeCodec
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/net/rpc/server.go:477 +0x421

ady.shimony · October 1, 2018, 9:28pm

I had to restart the machine.
Its very disappointing to have so many issues with such a mature db…
I am with Community Edition 5.1.1 build 5723

Thanks,
Ady.

deepkaran.salooja · October 1, 2018, 10:52pm

@ady.shimony, that certainly is not the expected behavior. Would you be able to share the cbcollect from the indexer node for us to investigate?

ady.shimony · October 2, 2018, 6:32am

This morning I woke again to see that the node is warming up again…
I will check how to do cbcollect and share it here

ady.shimony · October 2, 2018, 7:26pm

Here is the cbcollect zip file

https://drive.google.com/open?id=1-cD4ZB94Xsbqmo_l3x99RkGD9oJRq6f-

Thanks,

Ady.

deepkaran.salooja · October 3, 2018, 10:36pm

I see memcached is getting killed by OOM killer frequently, which would case the node to go into warmup.
As all the services are running on a single node, it may not be sufficient to take the load. You may want to add an extra node to distribute the load. What is the memory quota for the bucket tx-history?

[Tue Oct 2 08:45:21 2018] Out of memory: Kill process 11061 (memcached) score 429 or sacrifice child
[Tue Oct 2 08:47:23 2018] Out of memory: Kill process 11104 (memcached) score 438 or sacrifice child
[Tue Oct 2 08:49:38 2018] Out of memory: Kill process 11147 (memcached) score 445 or sacrifice child
[Tue Oct 2 08:51:39 2018] Out of memory: Kill process 11187 (memcached) score 446 or sacrifice child
[Tue Oct 2 08:54:43 2018] Out of memory: Kill process 11232 (memcached) score 447 or sacrifice child
[Tue Oct 2 08:58:02 2018] Out of memory: Kill process 11279 (memcached) score 447 or sacrifice child
[Tue Oct 2 09:01:01 2018] Out of memory: Kill process 11322 (memcached) score 447 or sacrifice child
[Tue Oct 2 09:03:01 2018] Out of memory: Kill process 11365 (memcached) score 436 or sacrifice child
[Tue Oct 2 09:05:00 2018] Out of memory: Kill process 11405 (memcached) score 445 or sacrifice child
[Tue Oct 2 09:08:53 2018] Out of memory: Kill process 11448 (memcached) score 447 or sacrifice child
[Tue Oct 2 09:12:19 2018] Out of memory: Kill process 11493 (memcached) score 447 or sacrifice child
[Tue Oct 2 09:14:19 2018] Out of memory: Kill process 11535 (memcached) score 432 or sacrifice child
[Tue Oct 2 09:16:50 2018] Out of memory: Kill process 11578 (memcached) score 447 or sacrifice child
[Tue Oct 2 09:19:37 2018] Out of memory: Kill process 11622 (memcached) score 447 or sacrifice child
[Tue Oct 2 09:23:05 2018] Out of memory: Kill process 11671 (memcached) score 447 or sacrifice child
[Tue Oct 2 09:27:17 2018] Out of memory: Kill process 11714 (memcached) score 447 or sacrifice child
[Tue Oct 2 09:30:03 2018] Out of memory: Kill process 11761 (memcached) score 447 or sacrifice child

ady.shimony · October 3, 2018, 11:06pm

deepkaran.salooja · October 5, 2018, 1:34am

The bucket resident ratio looks fine. At the time of OOM kill, memcached was using 23GB of RAM . You may want to check if there is anything else running on the system consuming memory as just before the OOM the system had 300MB of free memory left. Or else you could add another node to ease out the system load.

ady.shimony · October 6, 2018, 5:18pm

I can’t add nodes. This suppose to be one machine project.

I added more memory, now 64 gig, 16 cores.

I restated the mahine and left it runing without any app.

Couch is taking more than 45 gig, even that my settings are 30 gig for data, and 4 gig for indexer, how is that even possible??

After restart couch was on 32 gigs.
couch memory is just growing… after few hours, its on 46! 72% of 64 gig.

Please advice,
Thanks,

Ady.

ady.shimony · October 6, 2018, 6:12pm

Indexer memory is keep on growing. even that all indexes build process is 100% and indexer memory set to 4 gig.

Its now on 40% of 64 gig ~ 26 gig!!!

I am not using the db at all, reboot and that is the mem state after few hours.

ady.shimony · October 6, 2018, 6:20pm

I set it up manually now, https://docs.couchbase.com/server/4.0/rest-api/rest-index-memory-quota.html

curl -X POST -u Administrator:password
http://127.0.0.1:8091/pools/default
-d ‘indexMemoryQuota=4000’

So again the memory use of the indexer is growing without control
After few minutes, already on 10%, that is 6.4 gig… instead of 4 gig limit I gave.

Now 11%… and growing.

ady.shimony · October 6, 2018, 7:27pm

An hour later - 19% = 12 gig…
update - 2 hours later - 26.4% and growing…
update - 4 hours later - 35%

ady.shimony · October 8, 2018, 12:38pm

After reaching almost to 60%, now its back to 18.5%.
What is going on here…

update - its growing like before…23% now.

@ deepkaran.salooja

deepkaran.salooja · October 8, 2018, 5:44pm

can you please run the below when you see high memory usage and share the output files.

curl http://<host>:9102/debug/pprof/heap -u <username>:<password> > indexer.mprof

curl http://<host>:9102/debug/pprof/goroutine?debug=1 -u <username:password> > indexer.gr

Once you have captured the above information, you can try applying the below settings, restart indexer and see if that helps the system behavior:

curl -X POST -u <user:pass> http://<ip>:9102/settings --data '{"indexer.dataport.dataChanSize": 1000}'

curl -X POST -u <user:pass> http://<ip>:9102/settings --data '{"indexer.stream_reader.fdb.workerBuffer": 1000}'

curl -X POST -u <user:pass> http://<ip>:9102/settings --data '{"indexer.stream_reader.fdb.mutationBuffer": 1000}'

ady.shimony · October 8, 2018, 9:14pm

Its now on 45%.

The files you asked are here.

https://drive.google.com/open?id=19aGSkzH8vbe5OQn2JueGiMovYDfmESZk

I ran the commands, got ok on all, restarted the machine.
I will update soon.

Thanks a lot for your help.

ady.shimony · October 8, 2018, 9:22pm

Didn’t help. Indexer now on 10% and growing… (6.4 gig instead of 4 gig limit)
30 minutes, now on 14%…

deepkaran.salooja · October 8, 2018, 10:28pm

From the memory profile, the memory usage is coming from the storage engine. There are not enough granular stats in community storage engine to get more details. It looks like the earlier crashes may have caused the storage to go in a bad state. You can try to drop/recreate the indexes.

Also, as the total indexed data for 4 indexes is around 160GB, it would be good to have indexer RAM quota to be 10GB. That should help it to function better.

ady.shimony · October 9, 2018, 5:28pm

Delete all indexes. No indexes - at all.
Restarted machine.
Waited 2 hours to see memory print, 0.5% - good.

Trying to create one index again and I am getting:
“msg”: “GSI CreateIndex() - cause: Encounter errors during create index. Error=Indexer In Recovery\n.”

This is unbelievable.

deepkaran.salooja · October 9, 2018, 6:42pm

This is actually unbelievable! Can you capture the cbcollect again and share it? Sorry for the trouble.

ady.shimony · October 9, 2018, 7:25pm

cbcollect:

https://drive.google.com/open?id=1raK7C9yYO0xwTbtHWO6OvioGxgQpRIyV

Thank you.