Issues with "cluster-autoscaler" and Autonomous Operator

fbellagamba · January 25, 2019, 2:21pm

H!

A little background: Our cluster is set up with “cluster-autoscaler” to manage the number of instances running at a given time, and we have several “dynamic” environments used by our Dev team, that get created and deleted several times a day.

I’m evaluating using the Operator on our platform, but I’m facing an issue when trying to create a CBC as the Operator fails the cluster immediatelly when the pods get “Unschedulable” because of not available servers. But, actually, if it just waited a few minutes a new node would be available. Is there a way to increase this timeout?

Another option I though about it to create a kind of Meta-Operator, that would make sure there are resources available before creating the CBC CustomResource, but I was not able to find the “go-client” Client definitions needed to manage the CR directly from the API. Are they available? If not, can they be made available? Or, worst case, can you provide the Spec and Status definitions so I can create my own?

Thanks!
Fran

tommie · January 26, 2019, 12:09am

Hi Fran,
Thanks for trying out the operator. We’re planning to address issues with Pod scheduling in an up coming release of the operator. There will also be a way to customize the Pod creation timeout, which will also help to resolve this problem. In the meantime, you may want to pursue the workaround you’ve mentioned to check status of kubernetes nodes before deploying a cluster. Sounds like you want to know if the status of a Node is ready. You could use this client-go interface to get Nodes and try checking status that way:

The operator doesn’t really do any pre-checks on the environment, instead it attempts deployments and will continue to retry until any issues are resolved or timeout occurs. So in this case Unschedulable errors will simply lead to retries in the future instead of exiting.

fbellagamba · January 26, 2019, 6:35pm

Hey Tommie, thanks for your response!

I am looking for the same kind of “autogenerated typed client” for the CouchbaseCluster CRD,

That way I can, from my own “meta-operator”, know when the Couchbase Cluster is up and running, so I can mark my “meta-crd” as Ready when the CBC one is…

The ideal workflow for my operator would be:

Create the Instance Groups on Kops (we are using it to manage the cluster itself)
Wait for the IG managed nodes to be Ready (using the corev1 api you mentioned
Create the CouchbaseCluster resource
Wait for it to be up
Mark the meta-resource as Ready, so apps can pick it up and start using Couchbase

I understand that this is beyond the initial purview of the Autonomous Operator, but with the client definitions available, it should be possible for us users to build more complex workflows on top of it, for example, between steps 4 and 5 above we could:

Add application specific users
Restore data from backups, ie, to copy prod data for dev/staging/qa environments.

Makes sense?
Thanks!

tommie · January 28, 2019, 5:17pm

Hey Fran, thanks for sharing more info about your use use-case.

The steps you’ve outlined are possible, but it currently requires you to build your own client for the CBC type. Here’s a good resource to check out how to do that: https://www.martin-helmich.de/en/blog/kubernetes-crd-client.html
We’ll also consider providing a clientset like this, as it would be very helpful for scenarios like this.

Another option would be to check the status of couchbase cluster over an exposed LoadBalancer service. You could use the gocb lib for this to check NodeStatus directly from the clusters rest api… https://godoc.org/github.com/couchbase/gocb#ClusterManagerInternal.GetNodesMetadata

fbellagamba · January 28, 2019, 6:01pm

Yup, thanks Tommie, that’s the approach I’ve been taking (I creted the client far enough to get the cluster running, now I’m working on status monitoring and the like…)

As you said, having this provided along each operator version would be a great start for solutions as mine!

Again, thanks for the responses!