Kubernetes Autonomous Operator (GKE): Node (apparently?) crashed, and another spun up with a new name


#1

I have a 3-node Couchbase Cluster running in Google Kubernetes Engine. Apparently, one of the nodes crashed. When the Autonomous Operator spun up a new node, it had a different DNS name (" couchbase://cb-cluster-member-0003.cb-cluster-member.default.svc").

The problem is, my three nodes’ DNS names are hard-coded in my (.NET Core) application’s configuration file. When " cb-cluster-member-0002" was no longer available, I ran into all sorts of seemingly random problems. One such problem was that a N1QL query my application relied on was no longer working. I attempted to make this query through the Web Console, and it gave me some crab about the Index not existing, even though I had created it several days earlier. When I checked the “Bucket Insights” > “Queryable on Indexed Fields” > BucketName I saw it complaining about not being able to determine the schema based on the existing documents. But the documents had not changed! I ended up having to flush the bucket (thankfully this was a development system).

How do I compensate for this? First, I’d like to know how to avoid manually hard-coding DNS names of my nodes, and second, I’d like to know why I encountered that weirdness with just one bucket.

EDIT

I have dug through the logs, and I found some odd things.

Out of nowhere, I see this occur today:

IP address seems to have changed. Unable to listen on 'ns_1@cb-cluster-member-0001.cb-cluster-member.default.svc'. (POSIX error code: 'nxdomain')

This is followed by something similar to this, but for each node in the cluster, 0000, 0001, 0002:

IP address seems to have changed. Unable to listen on ‘ns_1@cb-cluster-member-0001.cb-cluster-member.default.svc’. (POSIX error code: ‘nxdomain’) (repeated 6 times)

Followed by several of these for each node:

Failed to add node cb-cluster-member-0002.cb-cluster-member.default.svc:8091 to cluster. Node already exists in cluster: ns_1@cb-cluster-member-0002.cb-cluster-member.default.svc (repeated 8 times)

Eventually, this happens, and is then followed by a rebalance:

Starting rebalance, KeepNodes = ['ns_1@cb-cluster-member-0000.cb-cluster-member.default.svc',
                                 'ns_1@cb-cluster-member-0001.cb-cluster-member.default.svc'], EjectNodes = [], Failed over and being ejected nodes = ['ns_1@cb-cluster-member-0002.cb-cluster-member.default.svc']; no delta recovery nodes

Then, I get this some more:

Failed to add node cb-cluster-member-0003.cb-cluster-member.default.svc:8091 to cluster. Failed to reach erlang port mapper. Failed to resolve address for "cb-cluster-member-0003.cb-cluster-member.default.svc". The hostname may be incorrect or not resolvable.

And then 0003 comes online, followed by a rebalance:
Node ns_1@cb-cluster-member-0003.cb-cluster-member.default.svc joined cluster

And then it fails to add the node:
Failed to add node cb-cluster-member-0003.cb-cluster-member.default.svc:8091 to cluster. Prepare join failed. Could not connect to "cb-cluster-member-0003.cb-cluster-member.default.svc" on port 8091. This could be due to an incorrect host/port combination or a firewall in place between the servers. (repeated 1 times)

Failed to add node cb-cluster-member-0003.cb-cluster-member.default.svc:8091 to cluster. Prepare join failed. Could not connect to "cb-cluster-member-0003.cb-cluster-member.default.svc" on port 8091. This could be due to an incorrect host/port combination or a firewall in place between the servers. (repeated 2 times)


#2

@ foxtrotuniform6969

Hi, if you are deploying cluster with Persisted Volumes then the hostnames will remain the same whenever a crash/outage occurs. Without persistent volumes a down Pod will be replaced by a new Pod after auto-failover because the down Pod cannot be recovered: https://docs.couchbase.com/operator/1.1/persisted-volumes-setup.html

As for the reason why the nodes went down it looks like your kube-dns service became unresponsive. This has more to do with networking resources of the GKE cluster itself than the operator. You can debug via ns lookup of Pod host:

kubectl create -f https://k8s.io/examples/admin/dns/busybox.yaml
kubectl exec -ti busybox – nslookup cb-example-0000.cb-example.default.svc.cluster.local

The ns lookup should also return an NXDOMAIN error indicating that the problem is with dns, otherwise please attach operator logs and we can take another look.

( see also https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/)


#3

Thanks for the quick reply, Tommy. I’m attempting to add PVs now, but am having trouble. I’ve followed the guide you referenced, but it looks like the PVs are not being created after I apply the configuration. I’ve confirmed the configuration got applied (one of my new buckets was created), and specified a valid storageClassName


#4

Ok, I recommend verifying your StorageClass because it’s actually possible to select a storage class that doesn’t dynamically claim volumes. For GKE specifically, the process is documented here:
https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/ssd-pd