Helm Chart - Cluster Setup Failed

Hello,

I’m using both the couchbase-operator and couchbase-cluster charts as subcharts in my application. This means that couchbasecluster object is already present when the operator pod starts up. On some installs, the cluster setup fails with this message in the operator logs:

time=“2019-06-24T18:16:33Z” level=info msg=“couchbase-operator v1.2.0 (release)” module=main
time=“2019-06-24T18:16:33Z” level=info msg=“Obtaining resource lock” module=main
time=“2019-06-24T18:16:33Z” level=info msg=“Starting event recorder” module=main
time=“2019-06-24T18:16:33Z” level=info msg=“Attempting to be elected the couchbase-operator leader” module=main
time=“2019-06-24T18:16:33Z” level=info msg=“Event(v1.ObjectReference{Kind:“Endpoints”, Namespace:“txt”, Name:“couchbase-operator”, UID:“326a87a4-96ac-11e9-890f-eabc2b90ead3”, APIVersion:“v1”, ResourceVersion:“962566”, FieldPath:”"}): type: ‘Normal’ reason: ‘LeaderElection’ txt-couchbase-operator-5969b769bb-cn88q became leader" module=event_recorder
time=“2019-06-24T18:16:33Z” level=info msg=“I’m the leader, attempt to start the operator” module=main
time=“2019-06-24T18:16:33Z” level=info msg=“Creating the couchbase-operator controller” module=main
time=“2019-06-24T18:16:33Z” level=info msg=“CRD initialized, listening for events…” module=controller
time=“2019-06-24T18:16:33Z” level=info msg=“Watching new cluster” cluster-name=txt-couchbase-cluster module=cluster
time=“2019-06-24T18:16:33Z” level=info msg=“Janitor process starting” cluster-name=txt-couchbase-cluster module=cluster
time=“2019-06-24T18:16:33Z” level=info msg=“Setting up client for operator communication with the cluster” cluster-name=txt-couchbase-cluster module=cluster
time=“2019-06-24T18:16:33Z” level=info msg=“Cluster does not exist so the operator is attempting to create it” cluster-name=txt-couchbase-cluster module=cluster
time=“2019-06-24T18:17:04Z” level=error msg=“Cluster setup failed: cluster create: failed to update cluster phase (Creating): Timeout: request did not complete within requested timeout 30s” cluster-name=txt-couchbase-cluster module=cluster
time=“2019-06-24T18:17:04Z” level=warning msg=“Fail to handle event: ignore failed cluster (txt-couchbase-cluster). Please delete its CR”

Other times, the cluster is created successfully. It seems that there is a race condition between the operator and the admission controller pods. In cases where the admission controller pod starts before the operator pod, the cluster is created successfully. In cases where the operator pod starts first, I always see the above error. Once I see that error I can manually remove the couchbasecluster and recreate and the cluster creates successfully. This is obviously not a good solution for production.

Should there be some sort of retry performed if the cluster fails due to what I assume is an issue communicating with the admission controller?

Hi Tom,

The admission controller mainly validates the cluster, although this also means that the admission controller will try to inject default values that may be missing from the cluster spec. We’ve tried to ensure that the default cluster values wouldn’t be dependent on having the admission controller running, which means the cluster should still start if you set :
deployments.admissionController=false.
However, if the cluster never starts without the admission controller (even with retrying), then I suspect the issue is some value isn’t being injected in time due to a race at startup. And if that’s the case I’ll need to see your chart values+overrides to look further.

What also seems telling is this timeout:
Timeout: request did not complete within requested timeout 30
Which can occur in cluster using many Persistent volumes or slow network pulling images. Try increasing pod-create timeout by adding the following to values file:

couchbaseOperator:
   # name of the couchbase operator
  name: "couchbase-operator"
....
   commandArgs:
     # Set to false if you are installing the CRD manually
     create-crd: true
     pod-create-timeout: 120
1 Like

Thanks. I already had the pod-create-timeout setting at 120 seconds as that was my first thought on what the problem might be. As you can see from the logs, the timeout still occurs at 30 seconds which leads me to believe that the issue is not with the pod creation, but some part of the process further upstream.

I disabled the admission-controller on your recommendation and that does seem to solve the immediate problem. The cluster is consistently created now. However, this does seem like a poor work around as having the additional validation of the admission-controller would be a good thing. If this is something that is being tracked in a bug I would like to follow it so that I can restore that functionality when the issue is resolved.

Thanks again for your help.

Ok thanks Tom,
Passing with admission controller disabled is actually the opposite of what I expected. From the original post it sounded like Pod creation fails unless the admission controller starts first.

The workaround was just to help understand the problem. We don’t have a ticket for this yet, but if you can share your charts I can try and reproduce the issue and see what’s happening here.
Also, I’m assuming everything works non-helm to exclude anything from operator itself.

Yes - when I deploy without helm, things work just fine. It seems to be the fact that when the operator starts up, there is already a couchbasecluster defined and if the admission-controller isn’t there, things fail. Specifically, it seems like this is the key error message:

Cluster setup failed: cluster create: failed to update cluster phase (Creating)

My chart is using our own custom containers for the application side so I can’t send it over as-is. Let me put together a more simplistic example and post back here.

I’ve attached a super simple example chart. I it installed with this command:

helm install -n cbtest --namespace cbtest cbtest-0.1.0.tgz

It failed to create the cluster with the same error as I posted earlier on the first attempt but succeeded on the second.

cbtest-0.1.0.tgz.zip (8.8 KB)

Thanks Tom,

Looks like the pod-timeout indentation is off a few spaces and the timeout wasn’t being set in your example here.
Running as is, I did get a similar timeout on my setup, but fixed it by correcting the indentation in your file. Everything created in one shot. Here’s the snippet I fixed, and full value file attached,

couchbase-operator:
  deployments:
    admissionController: true
  rbac:
    clusterRoleAccess: false
  couchbaseOperator:
    commandArgs:
      create-crd: false
      pod-create-timeout: 120s

I also shifted couchbaseOperator to the left, previously it was nested under rbac.

Hopefully this works in your setup as well
values.yaml.zip (692 Bytes)

Whoops - that’s a mistake on my part. However, I updated my test chart with your values.yaml and I’m still seeing the same problem:

time=“2019-06-27T00:41:38Z” level=info msg=“couchbase-operator v1.2.0 (release)” module=main
time=“2019-06-27T00:41:38Z” level=info msg=“Obtaining resource lock” module=main
time=“2019-06-27T00:41:38Z” level=info msg=“Starting event recorder” module=main
time=“2019-06-27T00:41:38Z” level=info msg=“Attempting to be elected the couchbase-operator leader” module=main
time=“2019-06-27T00:41:38Z” level=info msg=“I’m the leader, attempt to start the operator” module=main
time=“2019-06-27T00:41:38Z” level=info msg=“Creating the couchbase-operator controller” module=main
time=“2019-06-27T00:41:38Z” level=info msg=“Event(v1.ObjectReference{Kind:“Endpoints”, Namespace:“cbtest”, Name:“couchbase-operator”, UID:“52caf658-9874-11e9-8d44-4e5271016bf8”, APIVersion:“v1”, ResourceVersion:“1456953”, FieldPath:”"}): type: ‘Normal’ reason: ‘LeaderElection’ cbtest-couchbase-operator-76f7996cbb-srj58 became leader" module=event_recorder
time=“2019-06-27T00:41:38Z” level=info msg=“Watching new cluster” cluster-name=cbtest-couchbase-cluster module=cluster
time=“2019-06-27T00:41:38Z” level=info msg=“Janitor process starting” cluster-name=cbtest-couchbase-cluster module=cluster
time=“2019-06-27T00:41:38Z” level=info msg=“Setting up client for operator communication with the cluster” cluster-name=cbtest-couchbase-cluster module=cluster
time=“2019-06-27T00:41:38Z” level=info msg=“Cluster does not exist so the operator is attempting to create it” cluster-name=cbtest-couchbase-cluster module=cluster
time=“2019-06-27T00:42:08Z” level=error msg=“Cluster setup failed: cluster create: failed to update cluster phase (Creating): Timeout: request did not complete within requested timeout 30s” cluster-name=cbtest-couchbase-cluster module=cluster
time=“2019-06-27T00:42:08Z” level=warning msg=“Fail to handle event: ignore failed cluster (cbtest-couchbase-cluster). Please delete its CR”

On the second attempt with no changes everything worked.

Ok, thanks for giving it a try.
Which version of kubernetes is this? And also which platform are you running on? ie… AKS, AWS, minikube?

After verifying the chart you sent works - at least in AWS, I’m wondering if something environment related is happening. Specifically because the cause of error is due to the operator attempting to update the cluster status to ‘Creating’ which is a trivial task being sent over k8s rest api that shouldn’t timeout. This is pre-Pod creation.
Also, assuming just 1 operator is running in 1 namespace at a time?

Only thing I can think of as a fix on our end is to apply the timeout to this section of code that updates status. Because, if you have it set to 120s then it shouldn’t fail ‘within requested timeout 30s’. Currently timeout only applies to pod creation process.

No problem. I’m running this in AKS, and yes, just one operator per namespace. Right now, as a work around, I’m just running without the admission-controller which seems to work every time. Thanks for your help tracking this down.

Sure, and what is the kubernetes version you’re using in AKS… 1.12, 1.13?
We’ll see what we can find, and If this becomes really problematic to workaround please also provide output from https://docs.couchbase.com/operator/1.2/cbopinfo.html and I’ll take another look.
thanks!

Sorry for the delay - I was out of the office for a few days. All my testing has been done on Kubernetes 1.13.5.

I captured the cbopinfo output from both a good and bad run. I’m not totally sure if there is any sensitive information included, so there is there some way I can send those to you privately?

Thanks,
-Tom