High indexer CPU on index server after move to different VMware cluster


#1

We have a 3-node cluster running 4.5.1.-2884. After restoring VM from a failed VMware cluster to new cluster, we are experiencing constant problems with the indexer service consuming high CPU and eventually grinding to a halt. So far our only solution has been to restart the server hosting the indexer service (running Global Indexes) and then it will slowly recur usually within four hours. This configuration had been rock solid stable for over a year before the move.

The servers are provisioned with 6 CPU, 16 GB, and 500 GB drives with are basically 90% free space.

The indexer.log shows a few somewhat generic errors:
couchbase Err projector.topicMissing
status:(error = Index scan timed out)
2018-04-11T09:35:18.788-06:00 [Error] StartDcpFeedOver(): MCResponse status=KEY_ENOENT, opcode=0x89, opaque=0, msg: Not found

What additional information would be helpful to troubleshoot?

Thanks,


#2

Please make sure all the required ports in your new cluster are open. https://developer.couchbase.com/documentation/server/current/install/install-ports.html

The above errors seem to point to configuration issues related to ports not being open. If it doesn’t help, you can share the full log.


#3

The firewalls on all servers are currently all disabled while we are troubleshooting the situation. Should all of these ports be visible on each Couchbase server. It seems that only the one running the indexer role does while the others have tcp port 999 for cluster communication visible.

Since I am a new to the forum, I am unable to upload the indexer.log at this time.

Thanks,


#4

From the index service perspective, all ports listed for “Indexer Service” needs to be open on nodes where index service has been enabled except port 9999 which needs to be enabled on all data service nodes.

Has the memory quota for index service been set correctly on the new cluster?

You can upload the log file using:
curl -X PUT -T indexer.log https://forumlogs.s3-us-west-1.amazonaws.com/indexer.log


#5

I just uploaded indexer.log file.


#6

I don’t see a lot of activity in this log snippet. Do you know know the time window when indexer process was consuming a lot of CPU.

A couple of things you can try:

  1. Change the compaction setting in UI to make it run only on Sunday rather than all the days it is currently set to.
  2. Increase the RAM quota to 2GB.

#7

WE had another failure today, so I’ll get the relevant logs and add them to the case early tomorrow MDT.


#8

I just uploaded yesterday’s indexer.log


#9

There is nothing that stands out from the log file. Can you indicate the time window when you observe the issue? How much cpu usage do you see for the indexer process?

Next time it happens, you can capture the cpu profile and share:

go tool pprof -seconds=60 -svg /opt/couchbase/bin/indexer http://localhost:9102/debug/pprof/profile > cpu_prof.svg

You may need to install graphviz on your machine.


#10

The time frame is from about 9:00 AM to 9:30. I wasn’t able to view processes but in the previous incident top showed aggregate CPU at 99% while indexer ran at 400 to 500% (the top view).

I’ll look at running pprof tomorrow if it’s possible. Since this is servicing a clinical app we can’t let it just hang.


#11

The log file you shared has logs from 2018-04-17T23:52:44 till 2018-04-18T08:25:41.


#12

I must have grabbed the wrong version. I’ll find the correct one.


#13

We did make the changes that you suggested, and will keep you informed about the stability over the next few days.