Indexer service looping between Warmup and Ready state, cannot run queries after VM restart

Hi everyone, after a hard shutdown of the VM (power loss) running Couchbase server, the queries on Couchbase cluster (single-node) do not run as the Indexer service continuously loops between “Warming Up” and “Ready” state.
It may be a reason of indexes having some problems in that situation.

Question: apart from enhancing the reliability of the VMs so they do not get turned off abruptly, and eventually adding other nodes, what other approach do you recommend for enhancing the reliability of indexes? I would expect Couchbase to be reliable in these kind of scenarios, although rare, hence perhaps you have recommendation on best configuration to use.

We use Couchbase Community 5.1.1.

Thank you !

@gmaggini,

It is not expected of the indexer process to loop between “Warmup” and “Ready” state. Usually, it comes back to “Ready” state and stays there unless there is a crash. Can you kindly share the indexer log so that we can try to root cause the issue.

You can find the indexer logs at /opt/couchbase/var/lib/couchbase/logs/

Having index replicas would ease your problem. For a query, If indexes on one node are not available, then it will be picked up from the other available nodes. Please refer to this blog article: https://blog.couchbase.com/couchbase-index-replicas/ for more details on index replicas.

Thanks,
Varun

@gmaggini,

On couchbase community edition, index replicas are not supported. You can create equivalent indexes on other index nodes to have better index availability.

Thanks,
Varun

@varun.velamuri thank you for your answer!
Here is additional information after research.

When the problem describes happens (indexes moving between Ready and Warm Up), there are 2 categories of errors in indexer.log.

The first one (repeating) is:

2020-05-13T10:41:00.514-07:00 [Warn] Indexer::validateIndexInstMap Bucket mybucket is not yet ready (err = MCResponse status=KEY_ENOENT, opcode=0x89, opaque=0, msg: ) Retrying(5)..
2020-05-13T10:41:00.713-07:00 [Error] StartDcpFeedOver(): MCResponse status=KEY_ENOENT, opcode=0x89, opaque=0, msg: 

the second type of error (also repeating) is:

'd:\Couchbase\var\lib\couchbase\data\@2i\business_idx_ic_terminal_status_10511548744874879119_0.index\data.fdb.188', errno = 32: 'The process cannot access the file because it is being used by another process.'2020-05-13T10:47:05.722-07:00 [ERRO][FDB] Successfully used partially compacted file 'd:\Couchbase\var\lib\couchbase\data\@2i\business_idx_ic_terminal_status_10511548744874879119_0.index\data.fdb.189' for recovery replacing old file d:\Couchbase\var\lib\couchbase\data\@2i\business_idx_ic_terminal_status_10511548744874879119_0.index\data.fdb.188.

Observations:

  • System has 16 GB of RAM and sufficient disk space
  • Data service 6 GB
  • Index service 2 GB
  • Full-text service 4 GB
  • 2 GB for Windows
  • No sign of performance degradation other than the error discussed here
  • 5 buckets were configured
  • buckets are usually not very large (typically < 10.000 documents), only 1 bucket has 700.000 documents * It has 1 large GSI secondary index of 700 MB and many others of 20 MB each
  • GSI indexing is set to “circular write”, compaction set every day 00:00 (default)

We are now currently looking into dropping the large secondary index, but not sure whether this could be the problem.
We are also considering increasing the memory for Indexer service, but we could not find any sizing recommendations for Index service (only for data service is well explained here: https://docs.couchbase.com/server/current/install/sizing-general.html)

Do you have any recommendation? And what could be causing the error reported, from your experience?

thank you

The log message
Successfully used partially compacted file ‘d:\Couchbase\var\lib\couchbase\data@2i\business_idx_ic_terminal_status_10511548744874879119_0.index\data.fdb.189’ for recovery replacing old file d:\Couchbase\var\lib\couchbase\data@2i\business_idx_ic_terminal_status_10511548744874879119_0.index\data.fdb.188.
indicates there is a successfully compacted file that was created and before switching over to that file there was an issue and recovery had to be done. When recovery opened the last forestdb file it found the successfully compacted file and switched over to it. So, the compacted file will be used as the current file. But it looks like the open is getting called again and again and I am not sure why that is happening.

@gmaggini,

I think we need the complete indexer logs to analyze the sequence of events to debug this issue further.

Also, if you have multiple nodes in the cluster, you can try to failover this node and rebalance-in again into the cluster. This would get the indexer out of this situation. Note that the existing indexes on this node would be dropped when failover and rebalance-in happens. So, you will have to create the indexes again.

Thanks,
Varun

@varun.velamuri thank you. Is it possible to send you the log files privately instead of attaching them?
Regarding your note: we do have a single node per cluster.

@sduvuru do you think this problem may be somehow linked to this commit? (coincidentally, by you! :slight_smile: )

As this commit is from August 2019, for sure it’s not included in CB Community 5.1.1, right? In which version would it be included?

Thank you!

@gmaggini,

You can send a mail to varun [dot] velamuri [at] couchbase [dot] com, attaching the log files . If the file size is too large, please let me know. I will create.a google drive link and share it with you.

Thanks,
Varun