How to catch "Could not auto-failover node"

George-hash-K · March 31, 2022, 4:19pm

I’m trying to determine a critical state of the cluster
The idea is that when enough nodes failed then the cluster issues "Could not auto-failover node " and that message could be captured
Is there any API that can help to determine this situation of cluster unable to further auto-failover?

YoshiyukiKono · April 11, 2022, 1:05am

When it comes to the cluster health check with API, you cannot assume that the server that you try the API call is live but you must either call in sequence until you find a node that returns a response, or call all nodes.
For the former, you may use Cluster API. For the latter, you may use Node API.

Or, as more sophisticated alternative, you may use Prometheus integration.

Auto-failover policy depends on service. You must have your own rule to determine the state (“if it is unable to further auto-failover”) depending on your cluster’s composition.

I suspect my reaction isn’t what you expected but hope it helps in some way given the current situation: no response in 10 days.

George-hash-K · April 11, 2022, 2:18pm

I appreciate the response. So from my experience auto-failover works only once, when the any node goes down or unavailable. If another node fails then the cluster issues “could not auto-failover”.
It doesn’t seem to be related to the size of the cluster, because AFAIKT auto-failover works only once.
Does it make sense?

Kevin.Cherkauer · April 11, 2022, 4:24pm

@George-hash-K There is a user setting on UI for how many separate events you wish to allow Auto-failover to occur for. I believe this defaults to 1 but lets you change it up to 3 in in upcoming 7.1.0 release I believe the limit may have been removed entirely. You can change this to a number higher than 1 to allow further Auto-failovers after the first one.

Caveat – in Community Edition the limit is always 1. Multiple Auto-failovers is an Enterprise Edition-specific feature.