"failed to create persistent volume claim: context deadline exceeded" error on creating Couchbase operator with Persistent storage

jerome · December 3, 2018, 1:35pm

I was trying to setup the Couchbase Operator on IBM Cloud Kubernetes, but face an issue while adding Persistent storage to the cluster. After running the cbopctl create command the Couchbase services are created but Couchbase pods are not. The Persistent Volume is in the Pending state and then gets deleted on its own. Here’s the error in the operator logs -

time="2018-11-23T09:54:39Z" level=info msg="deleted pod (cb-example-0000)" cluster-name=cb-example module=cluster
time="2018-11-23T09:54:39Z" level=error msg="Cluster setup failed: fail to create member's pod (cb-example-0000): failed to create persistent volume claim: context deadline exceeded" cluster-name=cb-example module=cluster
time="2018-11-23T09:54:39Z" level=warning msg="Fail to handle event: ignore failed cluster (cb-example). Please delete its CR"

Here is the yaml that I used to create the operator -

apiVersion: couchbase.com/v1
kind: CouchbaseCluster
metadata:
  name: cb-example
  namespace: test-op
spec:
  baseImage: couchbase/server
  version: enterprise-5.5.1
  authSecret: cb-example-auth
  exposeAdminConsole: true
  adminConsoleServices:
    - data
  cluster:
    dataServiceMemoryQuota: 256
    indexServiceMemoryQuota: 256
    searchServiceMemoryQuota: 256
    eventingServiceMemoryQuota: 256
    analyticsServiceMemoryQuota: 1024
    indexStorageSetting: memory_optimized
    autoFailoverTimeout: 120
    autoFailoverMaxCount: 3
    autoFailoverOnDataDiskIssues: true
    autoFailoverOnDataDiskIssuesTimePeriod: 120
    autoFailoverServerGroup: false
  buckets:
    - name: default
      type: couchbase
      memoryQuota: 128
      replicas: 1
      ioPriority: high
      evictionPolicy: fullEviction
      conflictResolution: seqno
      enableFlush: true
      enableIndexReplica: false
  servers:
    - size: 3
      name: all_services
      services:
        - data
        - index
        - query
        - search
        - eventing
        - analytics
      pod:
        volumeMounts:
          default: couchbase
          data:  couchbase
          index: couchbase
  securityContext:
    fsGroup: 1000
  volumeClaimTemplates:
    - metadata:
        name: couchbase
      spec:
        storageClassName: "default"
        resources:
          requests:
            storage: 1Gi

The deployment works fine when Persistent Volumes are not added to the yaml. Tried with couchbase operator 1.0 and 1.1, got the same error.

jerome · December 5, 2018, 8:11am

This is kinda urgent. Any help here would be greatly appreciated.

simon.murray · December 5, 2018, 9:18am

Hi Jerome.

This is a known issue for clouds/storage providers that have poor performance characteristics. At present in Operator <=1.1.0 we have a timeout set for 5 minutes, which is evidently not long enough for IBM Cloud.

We have a fix planned for Operator 1.2.0 (to be released early 2019) that will allow you to override the default timeout.

To my mind 5 minutes to create a persistent volume is somewhat excessive. I’d be interested to know IBM’s take on why this is taking so long. They may be able to offer some workarounds to improve performance in the short term and allow your deployment.

jerome · December 6, 2018, 7:33pm

Thanks for replying.

It’s a lot less than 5 minutes. Here’s the entire log -


time="2018-12-06T19:17:49Z" level=info msg="Janitor process starting" cluster-name=cb-example module=cluster

time="2018-12-06T19:17:49Z" level=info msg="Setting up client for operator communication with the cluster" cluster-name=cb-example module=cluster

time="2018-12-06T19:17:49Z" level=info msg="Cluster does not exist so the operator is attempting to create it" cluster-name=cb-example module=cluster

time="2018-12-06T19:17:49Z" level=info msg="Creating headless service for data nodes" cluster-name=cb-example module=cluster

time="2018-12-06T19:17:49Z" level=info msg="Creating NodePort UI service (cb-example-ui) for data nodes" cluster-name=cb-example module=cluster

time="2018-12-06T19:17:49Z" level=info msg="Creating a pod (cb-example-0000) running Couchbase enterprise-5.5.1" cluster-name=cb-example module=cluster

time="2018-12-06T19:19:49Z" level=info msg="deleted pod (cb-example-0000)" cluster-name=cb-example module=cluster

time="2018-12-06T19:19:49Z" level=error msg="Cluster setup failed: fail to create member's pod (cb-example-0000): failed to create persistent volume claim: context deadline exceeded for pvc-couchbase-cb-example-0000-00-index" cluster-name=cb-example module=cluster

time="2018-12-06T19:19:49Z" level=warning msg="Fail to handle event: ignore failed cluster (cb-example). Please delete its CR"

Looking at the logs it times out after exactly 2 minutes. Is this parameter configurable?

jerome · December 10, 2018, 1:59pm

Has this issue been seen before? Should it be raised as an issue in Jira?

raju · December 10, 2018, 7:54pm

@jerome Sorry you are having an issue. Yes, please raise an issue in Jira and we can follow up on the issue

jerome · February 18, 2019, 7:37am

@raju Thanks for your response. I don’t think i have permission to create this issue. Could you raise this issue or help me get the required permissions?

jerome · February 19, 2019, 4:02am

I see a new 1.2-DP operator image. Has this parameter been made configurable yet? I don’t have access to documentation for 1.2. It still times out after 2 minutes.

simon.murray · February 19, 2019, 9:13am

Hi Jerome,

Very observant! That image is for a developer preview release, as such the documentation is not public domain yet. To answer your question, yes we have addressed your issue

The 1.2.0 GA is scheduled for release in approximately a month. Keep an eye on our blog and we’ll link to all the documentation and download resources when the time comes.

Regards Si

alston_dmello · February 19, 2019, 1:28pm

@simon.murray
Thank you so much for replying. We are really excited to hear this.
Our problem is that we have a release coming up this month and we need Couchbase setup by the end of this week.

If it’s already a part of the DP build, could you send me the parameter name? (I’ll modify the CRDs accordingly)
If there’s any other way that I could get the fix or if you have any alternate workaround that would be great too.

simon.murray · February 20, 2019, 11:32am

I’ll do my best to help you achieve your milestone then

So first up in your operator deployment add --pod-create-timeout=10m as an argument. It will accept anything that time.ParseDuration() will consume in golang.

Second, there are a few new attributes in the CouchbaseCluster resource that need to be filled in. Sane defaults are:

spec.adminConsoleServiceType: NodePort
spec.exposedFeatureServiceType: NodePort
spec.buckets[*].CompressionMode: passive

Let me know how you get on!

alston_dmello · February 20, 2019, 1:41pm

Thank you @simon.murray. I really appreciate you helping me with this.

I’ve tried deploying it with the new changes, but I’m not sure what I’m doing wrong. It still times out after 2 minutes.

Attaching the operator and couchbase cluster yaml files along with the operator logs. Could you take a look at them and let me know if you find something.

operator.yaml -
> apiVersion: extensions/v1beta1

    kind: Deployment
    metadata:
      name: couchbase-operator
    spec:
      replicas: 1
      selector:
    matchLabels:
      app: couchbase-operator
      template:
    metadata:
      labels:
        app: couchbase-operator
    spec:
      containers:
      - name: couchbase-operator
        image: couchbase/operator:1.2.0-DP
        command:
        - couchbase-operator
        args:
        - -create-crd
        - -pod-create-timeout=10m
    #        - --pod-create-timeout=10m
    #        - -enable-upgrades=true # Disable experimental upgrade feature
        env:
        - name: MY_POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        ports:
          - name: readiness-port
            containerPort: 8080
        readinessProbe:
          httpGet:
            path: /readyz
            port: readiness-port
          initialDelaySeconds: 3
          periodSeconds: 3
          failureThreshold: 19
      serviceAccountName: couchbase-operator-test-op

Couchbase-cluster.yaml

apiVersion: couchbase.com/v1
kind: CouchbaseCluster
metadata:
  name: cb-example
  namespace: test-op
spec:
  baseImage: couchbase/server
#  version: 5.5.1
  version: 6.0.1
  authSecret: cb-example-auth
  adminConsoleServiceType: NodePort
  exposedFeatureServiceType: NodePort
  buckets[*]:
    CompressionMode: passive
  exposeAdminConsole: true
  disableBucketManagement: false
  adminConsoleServices:
    - data
  cluster:
    dataServiceMemoryQuota: 256
    indexServiceMemoryQuota: 256
    searchServiceMemoryQuota: 256
    eventingServiceMemoryQuota: 256
    analyticsServiceMemoryQuota: 1024
    indexStorageSetting: memory_optimized
    autoFailoverTimeout: 3600
    autoFailoverMaxCount: 3
    autoFailoverOnDataDiskIssues: false
    autoFailoverOnDataDiskIssuesTimePeriod: 3600
    autoFailoverServerGroup: false
  buckets:
    - name: default
      type: couchbase
      memoryQuota: 128
      replicas: 1
      ioPriority: high
      evictionPolicy: fullEviction
      conflictResolution: seqno
      enableFlush: true
      enableIndexReplica: false
  servers:
    - size: 1
      name: all_services
      services:
        - search
        - eventing
        - analytics
#      pod:
#        volumeMounts:
#          default: couchbase
    - size: 3
      name: data_service
      services:
        - data
      pod:
        volumeMounts:
          default: couchbase
    - size: 2
      name: index_service
      services:
        - index
      pod:
        volumeMounts:
          default: couchbase
    - size: 2
      name: query_service
      services:
        - query
#      pod:
#        volumeMounts:
#          default: couchbase
#          data:  couchbase
#          index: couchbase
  volumeClaimTemplates:
    - metadata:
        name: couchbase
      spec:
        storageClassName: "ibmc-file-gold"
        resources:
          requests:
            storage: 20Gi

operator.logs
> time=“2019-02-20T13:21:51Z” level=info msg=“couchbase-operator v1.2.0 (release)” module=main

time="2019-02-20T13:21:51Z" level=info msg="Obtaining resource lock" module=main
time="2019-02-20T13:21:51Z" level=info msg="Starting event recorder" module=main
time="2019-02-20T13:21:51Z" level=info msg="Attempting to be elected the couchbase-operator leader" module=main
time="2019-02-20T13:22:08Z" level=info msg="I'm the leader, attempt to start the operator" module=main
time="2019-02-20T13:22:08Z" level=info msg="Creating the couchbase-operator controller" module=main
time="2019-02-20T13:22:08Z" level=info msg="Event(v1.ObjectReference{Kind:\"Endpoints\", Namespace:\"test-op\", Name:\"couchbase-operator\", UID:\"5604a51f-eef4-11e8-81b6-96f8cfb4c54c\", APIVersion:\"v1\", ResourceVersion:\"64054135\", FieldPath:\"\"}): type: 'Normal' reason: 'LeaderElection' couchbase-operator-6ff9589d49-gt7hm became leader" module=event_recorder
time="2019-02-20T13:22:08Z" level=info msg="CRD initialized, listening for events..." module=controller
time="2019-02-20T13:23:46Z" level=info msg="Watching new cluster" cluster-name=cb-example module=cluster
time="2019-02-20T13:23:46Z" level=info msg="Janitor process starting" cluster-name=cb-example module=cluster
time="2019-02-20T13:23:46Z" level=info msg="Setting up client for operator communication with the cluster" cluster-name=cb-example module=cluster
time="2019-02-20T13:23:46Z" level=info msg="Cluster does not exist so the operator is attempting to create it" cluster-name=cb-example module=cluster
time="2019-02-20T13:23:46Z" level=info msg="Creating headless service for data nodes" cluster-name=cb-example module=cluster
time="2019-02-20T13:23:46Z" level=info msg="Created service cb-example-ui for admin console" cluster-name=cb-example module=cluster
time="2019-02-20T13:23:46Z" level=info msg="Creating a pod (cb-example-0000) running Couchbase 6.0.1" cluster-name=cb-example module=cluster
time="2019-02-20T13:25:47Z" level=info msg="deleted pod (cb-example-0000)" cluster-name=cb-example module=cluster
time="2019-02-20T13:25:47Z" level=error msg="Cluster setup failed: fail to create member's pod (cb-example-0000): failed to create persistent volume claim: context deadline exceeded for cb-example-0000-default-00" cluster-name=cb-example module=cluster
time="2019-02-20T13:25:47Z" level=warning msg="Fail to handle event: ignore failed cluster (cb-example). Please delete its CR"

I apologize for the length of the post. I do not have permissions to attach files.

simon.murray · February 20, 2019, 2:09pm

No worries, it shows me all I need to know. I see your problem, it seems the PVC wait code is hard coded to 2 minutes and doesn’t honor the global pod creation timeout. I’ll raise a defect and get it fixed straight away. I’ll have a chat with our project management team to see if there’s anything we can do help you by Friday.

anil · February 20, 2019, 4:53pm

Hi @jerome , @alston_dmello,

As Simon mentioned we are working on fixing that issue in 1.2 release. Just wanted to let you guys know that 1.2 GA is tentatively planned for April - May timeframe. I would not recommend using a DP version for your release and wait for final GA version.

Can you please contact me anil@couchbase.com and I can assist with giving you an early drop with fix for testing purposes.

Thanks!

Anil Kumar