Couchbase Cluster on Kubernetes failing readiness probe

edavidj · July 10, 2019, 7:47pm

I was trying to follow along with the following guide and it’s prequel steps but I’ve consistently run into issues with deploying the couchbase cluster itself across multiple platforms now. I’ve created the operator and it seems to have deployed correctly, but the couchbase server fails a readiness probe. On Minishift (with the suggested cpu/memory settings) using the provided couchbase package’s files, it will always spawn 1 cb-example pod instead of 3 and this pod will never reach a ready state.

NAME                                  READY   STATUS    RESTARTS   AGE
cb-example-0000                       0/1     Running   0          6s

It doesn’t seem to provide any error logs in ‘/opt/couchbase/var/lib/couchbase/logs’ otherwise I would provide them. I understand this isn’t a lot of information to work off of but I’m not sure what else would be relevant. Here’s the output of describing the pod that’s failing:

Name:               cb-example-0000
Namespace:          myproject
Priority:           0
PriorityClassName:  <none>
Node:               localhost/10.0.2.15
Start Time:         Wed, 10 Jul 2019 15:36:11 -0400
Labels:             app=couchbase
                    couchbase_cluster=cb-example
                    couchbase_node=cb-example-0000
                    couchbase_node_conf=all_services
                    couchbase_service_analytics=enabled
                    couchbase_service_data=enabled
                    couchbase_service_eventing=enabled
                    couchbase_service_index=enabled
                    couchbase_service_query=enabled
                    couchbase_service_search=enabled
Annotations:        openshift.io/scc: restricted
                    operator.couchbase.com/version: 1.2.0
                    server.couchbase.com/version: enterprise-6.0.1
Status:             Running
IP:                 <ip>
Controlled By:      CouchbaseCluster/cb-example
Containers:
  couchbase-server:
    Container ID:   docker://d97a07fa7e96f16da65e589f62993c1d4d03f0f350fd976fe0e9cc3f5317bd60
    Image:          couchbase/server:enterprise-6.0.1
    Image ID:       docker-pullable://docker.io/couchbase/server@sha256:b5ab04755fa2844196e34eb4e2e551c309b3632ce01f3d080153f29db76e9376
    Ports:          8091/TCP, 8092/TCP, 8093/TCP, 8094/TCP, 8095/TCP, 8096/TCP, 9100/TCP, 9101/TCP, 9102/TCP, 9103/TCP, 9104/TCP, 9105/TCP, 9110/TCP, 9111/TCP, 9112/TCP, 9113/TCP, 9114/TCP, 9115/TCP, 9116/TCP, 9117/TCP, 9118/TCP, 9119/TCP, 9120/TCP, 9121/TCP, 9122/TCP, 11207/TCP, 11210/TCP, 11211/TCP, 11214/TCP, 11215/TCP, 18091/TCP, 18092/TCP, 18093/TCP, 18094/TCP, 18095/TCP, 18096/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    State:          Running
      Started:      Wed, 10 Jul 2019 15:36:12 -0400
    Ready:          False
    Restart Count:  0
    Readiness:      exec [test -f /tmp/ready] delay=10s timeout=5s period=20s #success=1 #failure=1
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-cwbxb (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  default-token-cwbxb:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-cwbxb
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     <none>
Events:
  Type     Reason     Age   From                Message
  ----     ------     ----  ----                -------
  Normal   Scheduled  27s   default-scheduler   Successfully assigned myproject/cb-example-0000 to localhost
  Normal   Pulled     27s   kubelet, localhost  Container image "couchbase/server:enterprise-6.0.1" already present on machine
  Normal   Created    27s   kubelet, localhost  Created container
  Normal   Started    26s   kubelet, localhost  Started container
  Warning  Unhealthy  6s    kubelet, localhost  Readiness probe failed:

and kubectl logs output:

whoami: cannot find name for user ID 1000140000
Starting Couchbase Server -- Web UI available at http://<ip>:8091
and logs available in /opt/couchbase/var/lib/couchbase/logs

I’ve tried the guide on an OKD cluster, docker for desktop, and Minishift now; and I’ve run into the same error everytime. Which leads me to believe I’m missing something fundamental to the success of the guide. Any suggestions or help would be greatly appreciated. Thanks!

tommie · July 10, 2019, 9:29pm

Hi Ethan,

If you’re running on MiniShift or any flavor of OpenShift, then you should use couchbase images from the Red Hat catalog. Refer to docs here for installation steps: https://docs.couchbase.com/operator/1.2/install-openshift.html#create-an-imagepullsecret

simon.murray · July 11, 2019, 9:28am

Also pods don’t become ready until Couchbase data is fully balanced across the cluster and we can tolerate a pod evacuation (Kubernetes upgrade support). Check the operator logs for anything going wrong that would prevent the readiness check from being updated.

edavidj · July 11, 2019, 3:42pm

Hi Tommie,

Thanks for the response. Is it a requirement to do so when using any flavour of OpenShift? We’re trying to integrate Couchbase into an operator guide for our organization, and the authentication requirements of using the Red Hat Catalog aren’t ideal for that purpose. It would be a lot smoother for a wider audience if we were able to use open source images. Although I understand for a production use case that wouldn’t be the best practice.

edavidj · July 11, 2019, 7:32pm

Hi Simon,

Sorry for the late reply, I looked into the operator logs as suggested and it does give some additional context. The cluster seems to be failing to start in general, although I’m not exactly sure why.

Here’s the attempt to create the cluster:

time="2019-07-10T18:29:44Z" level=info msg="Creating the couchbase-operator controller" module=main
time="2019-07-10T18:39:44Z" level=info msg="Watching new cluster" cluster-name=cb-example module=cluster
time="2019-07-10T18:39:44Z" level=info msg="Janitor process starting" cluster-name=cb-example module=cluster
time="2019-07-10T18:39:44Z" level=info msg="Setting up client for operator communication with the cluster" cluster-name=cb-example module=cluster
time="2019-07-10T18:39:44Z" level=info msg="Cluster does not exist so the operator is attempting to create it" cluster-name=cb-example module=cluster
time="2019-07-10T18:39:44Z" level=info msg="Creating headless service for data nodes" cluster-name=cb-example module=cluster
time="2019-07-10T18:39:44Z" level=info msg="Created service cb-example-ui for admin console" cluster-name=cb-example module=cluster
time="2019-07-10T18:39:44Z" level=info msg="Creating a pod (cb-example-0000) running Couchbase enterprise-6.0.1" cluster-name=cb-example module=cluster
time="2019-07-10T18:49:44Z" level=warning msg="member cb-example-0000 creation failed" cluster-name=cb-example module=cluster
...

I believe the relevant error message from further into the logs is:

time="2019-07-10T18:49:45Z" level=warning msg="  phase: Running" cluster-name=cb-example module=cluster
time="2019-07-10T18:49:45Z" level=warning msg="  qosClass: BestEffort" cluster-name=cb-example module=cluster
time="2019-07-10T18:49:45Z" level=warning msg="  startTime: \"2019-07-10T18:39:45Z\"" cluster-name=cb-example module=cluster
time="2019-07-10T18:49:45Z" level=warning cluster-name=cb-example module=cluster
time="2019-07-10T18:49:45Z" level=info msg="deleted pod (cb-example-0000)" cluster-name=cb-example module=cluster
time="2019-07-10T18:49:45Z" level=error msg="Cluster setup failed: context deadline exceeded: Connection error - dial tcp <ip>:8091: connect: connection refused" cluster-name=cb-example module=cluster

I’ve tried playing around with my set up but I can’t seem to avoid this error. I imagine it isn’t clear what could be causing it just from that but I figured maybe there’s an obvious issue I’m missing.

simon.murray · July 12, 2019, 1:31pm

From what I can tell the packets are making it to the pod, so L3 and DNS are working at least… Seems that Couchbase Server isn’t coming up for some reason and binding to port 8091. As Tommie says, it may be due to the FOSS image doing something that OpenShift’s strict/random security setup inhibits. I’d definitely try with the RHEL images first if at all possible as those can be run as any random PID.

edavidj · July 12, 2019, 4:47pm

I’ll give the RHEL images a try, thanks for the help

Dexter11A · July 23, 2019, 11:07am

Once the application and cluster are all set up on Kubernetes, we will test some scaling and failure scenarios.

ruslantkachuk · October 1, 2019, 9:48pm

Just add
oc adm policy add-scc-to-user anyuid system:serviceaccount::default

more
https://docs.okd.io/latest/admin_guide/manage_scc.html#enable-dockerhub-images-that-require-root