Upgrade von 6.6 to 6.6.1: FTS dead

Hi,

On our Test Cluster w did a rolling upgrade from Couchbase EE 6.6 to EE 6.6.1.

We have
3 x data
2 x index/query
2 x fts

Data and Index/Query nodes were updated successfully, but for FTS, rebalance failed, and we saw in the logs:

2021-03-10T09:03:23.912+00:00 [FATA] main: ctl.StartCtl, err: planner: CfgGetPlanPIndexes err: cfg_metakv_lean: getLeanPlan, hash mismatch between plan hash: 6b761b50f7182ab87d713caf010cd507 contents: {“uuid”:“7a372552c357cd1c”,“planPIndexes”: …

Removing both search completely and re-attach was not successful, we got always the same error message.

@sreeks I found then your cbft cookbook (very nice document!) and did the following suggested steps:

step1. curl -i http://<hostname:8091>/diag/eval -d 'metakv:set(<<"/fts/cbgt/cfg/curMetaKvPlanKey">>, <<"">>).'
 -u<uname>:<pwd>

step2. curl -i http://<hostname:8091>/diag/eval -d 'metakv:set(<<"/fts/cbgt/cfg/planPIndexes">>, <<"\"ImplVersion\":\"5.5.0\"}">>).' -u<uname>:<pwd>

And finally, fts was back and cluster successfully rebalanced. Of course I lost FTS index definition, but that’s not a big problem.

Now, all good, but… what happened here and what did I with this “magic commands” of @sreeks cookbook?

Thanks, Pascal

Hey @gizmo74 ,

We are really sorry to hear that your cluster got corrupted and glad that you recovered yourself out of peril.
This could happen if there were concurrent index definition updates happening in your system while the rebalance were in progress. For example, operations like any index definition Creates/Updates/Deletes during a rebalance operation Or even many threads performing concurrent index CUD operations from different nodes parallelly around the same timestamp could result in this situation.

These concurrent updates could result in inconsistencies in the eventual partition-node layout plan while resolving the conflicts on the layout plans from the distributed planners in the system.

Was that the scenario in your situation? If it happened in the above situation, we are working on fixing this in the latest upcoming software version. We are also thinking about a better way to unblock the cluster with minimum damage (skip index rebuild) if such a situation occurs in the next release.

Any feedback on your cluster context/cbcollect info would be helpful for us.

Thanks,
Sreekanth

Hi @sreeks ,

Thanks for your quick answer. And no problem, that’s the goal of test clusters, finding problems and learning new things :slight_smile:

We already opened a support ticket (# 38792) with collected logs, but while this is not really a enterprise ticket (no licenses for the test cluster) we tried to fix the cluster ourselves :slight_smile:

It was just a replace of all nodes, one by one, so nobody did CUD operations on index definitions. But while this was done with some scripts maybe something went wrong it tried to add/remove a node during a running rebalance or so.

Just to be sure: if that happens again when we update the production cluster: the way we fixed that is the recommended way and all should be good then? No hidden problems when we do that?

Thanks,
Pascal

@gizmo74 ,

FTS occasionally print out the partition node layout plan into the logs and this could be used to resurrect the cluster with minimum service outage.
If you trace up from the “hash mis match” error in the FTS logs, then you could see logs similar

[INFO] cfg_metakv_lean: setLeanPlan, val: "large plan contents in json}

[INFO] cfg_metakv_lean: setLeanPlan, curMetaKvPlanKey set, val: {"path":"/fts/cbgt/cfg/planPIndexesLean/planPIndexesLean-4e6f3436c9042a1c8eb3948bd6079079-1615465985291/","uuid":"414de85530a0cc7f","implVersion":"5.5.0"}

If you reset the value of the “curMetaKvPlanKey” to the immediate value before the hash mismatch error, it should all work well without any index partition builds.
eg:

curl -i http://localhost:9000/diag/eval -d 'metakv:set(<<"/fts/cbgt/cfg/curMetaKvPlanKey">>, <<"{\"path\":\"/fts/cbgt/cfg/planPIndexesLean/planPIndexesLean-4e6f3436c9042a1c8eb3948bd6079079-1615465985291/\",\"uuid\":\"414de85530a0cc7f\",\"implVersion\":\"5.5.0\"}">>).'  -uuser:pwd

But if you happen to try this at a later point since the occurrence of the error, then resetting that to an empty value as you did now would be recommended.

Having said this, this is a distributed system with many moving parts like the index definitions, node definition changes as it happened in your cluster. So, points of failures are aplenty.

We are working to improve the robustness here in next release.