Couchbase frequently gives BucketNotFoundError

Hi, there!

TL;DR:
Starting from last week, we start to observe that our ETL pipelines written in python3 frequently raise the bucket not found error while trying to connect to our two node Couchbase cluster running on a local k8s cluster:

couchbase.exceptions._BucketNotFoundError_0x19 (generated, catch BucketNotFoundError): <RC=0x19[The bucket requested does not exist], There was a problem while trying to send/receive your request over the network. This may be a result of a bad network or a misconfigured client or server, C Source=(src/bucket.c,1071)>

The problem can be temporarily solved by completely restarting the Couchbase cluster. Both the k8s cluster and Couchbase do have a lot of unused resources and are running on the same machines as the ETL pipeline. Query from the web UI and cbq inside the server container works normally. I tried drilling into the logs but haven’t found any useful clues. We also tried upgrading couchbase from 6.0.3 to 6.5.1 but the problem is still there. Any hints or help will be greatly appreciated. The k8s and couchbase configs are as follow:

kubernetes:

  • two nodes running ubuntu 18.04, having 50% CPU and 60% RAM unused
  • k8s v1.16.3

Couchbase:

  • 2 nodes cluster
  • server: enterprise 6.5.1 (also tried enterprise 6.0.3)
  • client: couchbase python sdk 2.5.1, python 3.7

Long version:

We have a two node couchbase cluster setup on our kubernetes dev cluster. The Couchbase cluster were setup manually by applying several k8s yamls as we don’t have a k8s volume provisioner around. Going to public cloud could make life easier but this is not an option for us due to security concern. The cluster was set up at the beginning of this year, and ran into this issue a few times before. The ETL pipelines and Couchbase work fine for roughly two months and start to raise the error until we do a complete restart of Couchbase. The current wave of issue started around a week ago, and the error shows up almost every day.

We use Couchbase as the main storage for a few Apache airflow ETL pipelines under development. The pipelines crawl some pdf files from the internet and store the metadata as well as the extracted text from pdf into couchbase. We have three active buckets, the one that causes problem has around 12k documents and a few indices on different short string fields. This bucket does not have many new documents coming in every day, but each file can be up to several MB and we have post processing revisiting the document multiple times after it is created. Other pipelines connecting to other buckets with more smaller documents are working normally. It would be a great help if someone can point me to the relevant log or code.

Thank you in advance.

-Jeff