Not able to connect to cluster anymore suddenly

Hello,

I am suddenly not able anymore to connect to my couchbase cluster, the only way I found was to rollback my codebase to couchbase-sdk 2.x.
I recently migrated to 3.0, which was working fine and very well, now suddenly it will fail in every scenario: either on my local dev setup with nodejs and couchbase and as well on my local docker-compose setup.
I already tried debugging, also with sdk-doctor, with general debugging I can’t get any meaningful information, while sdk-doctor doesnt report any errors:

NodeJS Error: Error: cluster object was closed

cluster: Cluster {
    _connStr: 'couchbase://localhost/',
    _trustStorePath: undefined,
    _kvTimeout: undefined,
    _kvDurableTimeout: undefined,
    _viewTimeout: undefined,
    _queryTimeout: undefined,
    _analyticsTimeout: undefined,
    _searchTimeout: undefined,
    _managementTimeout: undefined,
    _auth: { username: 'xxx', password: 'xxx' },
    _closed: false,
    _clusterConn: null,
    _conns: {
        fwdisplay: Connection {
            _inst: CbConnection {},
            _closed: true,
            _pendOps: [],
            _pendBOps: [],
            _connected: false,
            _opened: true
        }
    },
    _transcoder: DefaultTranscoder {},
    _logFunc: undefined
}

SDK Doc Summary:
[WARN] Your connection string specifies only a single host. You should consider adding additional static nodes from your cluster to this list to improve your applications fault-tolerance
[WARN] Could not test Analytics service on 127.0.0.1 as it was not in the config

This is the connection setup:

this.cluster = new couchbase.Cluster(`couchbase://${COUCHBASE_HOSTNAME}/`, {
    username: COUCHBASE_USERNAME,
    password: COUCHBASE_PASSWORD
})
this.bucket = this.cluster.bucket(COUCHBASE_BUCKET)
this.collection = this.bucket.defaultCollection()

anyone has any ideas as to what to change to fix it?

Had the same issue and had to roll back.

Is anyone else having this? is it resolved with 3.1?

Did your setup first work for 3.x and then end to work? In my case, it did.

I now tried the sample https://github.com/couchbaselabs/try-cb-nodejs/tree/6.5-collections, but it crashes as well, leading me to believe that the error is not on my end.

Same as myself, initially it worked, but once it stops working my api is essentially broken, I can’t restart couchbase or the sync gateway periodically just because the nodejs library is broken, hence my rollback to the previous release.

Hello, I am assuming your problem statement aligns with Cluster closed - reinitialize connection?

This seems to be a bug with the underlying libcouchbase and will be fixed in the upcoming release.

There seems to be plenty of bug reports for the infamous cluster object was closed issue around here for the Node SDK… I just wanted to chime in on this thread with my own report.

Context

  • Couchbase (6.6.0 and 6.6.1), installed via the kubernetes operator
  • Node SDK 3.0.4 to 3.1.1 - tried all versions, used with intra-cluster DNS with the recommended connection string (my-cluster-srv DNS service)
  • ~4 pods with the same nodejs app connected to the cluster, and 2 sync gateway pods

Behaviour

Things seem to work fine for a few minutes, until suddenly on one given container (but NOT the others), the logs (activated with DEBUG=couchnode:lcb:error) are flooded with these errors:

2021-01-21T07:46:53.722Z couchnode:lcb:error (cccp @ ../deps/lcb/src/bucketconfig/bc_cccp.cc:187) <NOHOST:NOPORT> (CTX=(nil),) Could not get configuration: LCB_ERR_TIMEOUT (201)
2021-01-21T07:46:53.726Z couchnode:lcb:error (cccp @ ../deps/lcb/src/bucketconfig/bc_cccp.cc:187) <NOHOST:NOPORT> (CTX=(nil),) Could not get configuration: LCB_ERR_TIMEOUT (201)
2021-01-21T07:46:53.730Z couchnode:lcb:error (cccp @ ../deps/lcb/src/bucketconfig/bc_cccp.cc:187) <NOHOST:NOPORT> (CTX=(nil),) Could not get configuration: LCB_ERR_TIMEOUT (201)

Until it all comes to a stop with this error:

FATAL ERROR:
    libcouchbase experienced an unrecoverable error and terminates the program
    to avoid undefined behavior.
    The program should have generated a "corefile" which may used
    to gather more information about the problem.
    If your system doesn't create "corefiles" I can tell you that the
    assertion failed in ../deps/lcb/src/mcserver/negotiate.cc at line 50

This does not crash the container, but it seems to make it hang somehow, since there’s no more logs (including application logs), and the port that the app listens to becomes unresponsive, causing my livenessProbe to fail, and kubernetes to eventually kill and restart the container.
Other pods seem to do fine at the same time but will also randomly fail in the same way.
Sync Gateway is fine all along.

Diags:

Couchbase UI is fine.

sdkdoctor never seems to complain:

|====================================================================|
|          ___ ___  _  __   ___   ___   ___ _____ ___  ___           |
|         / __|   \| |/ /__|   \ / _ \ / __|_   _/ _ \| _ \          |
|         \__ \ |) | ' <___| |) | (_) | (__  | || (_) |   /          |
|         |___/___/|_|\_\  |___/ \___/ \___| |_| \___/|_|_\          |
|                                                                    |
|====================================================================|

Note: Diagnostics can only provide accurate results when your cluster
 is in a stable state.  Active rebalancing and other cluster configuration
 changes can cause the output of the doctor to be inconsistent or in the
 worst cases, completely incorrect.

08:54:32.016 INFO ▶ Parsing connection string `couchbase://oaf-couchbase-srv.default.svc.cluster.local/fs-bucket-v0`
08:54:32.016 INFO ▶ Connection string was parsed as a potential DNS SRV record
08:54:32.020 INFO ▶ Connection string identifies the following CCCP endpoints:
08:54:32.020 INFO ▶   1. 10-36-0-7.oaf-couchbase-srv.default.svc.cluster.local:11210
08:54:32.020 INFO ▶   2. 10-32-0-19.oaf-couchbase-srv.default.svc.cluster.local:11210
08:54:32.020 INFO ▶   3. 10-35-0-39.oaf-couchbase-srv.default.svc.cluster.local:11210
08:54:32.020 INFO ▶ Connection string identifies the following HTTP endpoints:
08:54:32.020 INFO ▶ Connection string specifies bucket `fs-bucket-v0`
08:54:32.027 WARN ▶ The hostname specified in your connection string resolves both for SRV records, as well as A records.  This is not suggested as later DNS configuration changes could cause the wrong servers to be contacted
08:54:32.027 INFO ▶ Performing DNS lookup for host `10-32-0-19.oaf-couchbase-srv.default.svc.cluster.local`
08:54:32.029 INFO ▶ Bootstrap host `10-32-0-19.oaf-couchbase-srv.default.svc.cluster.local` refers to a server with the address `10.32.0.19`
08:54:32.030 INFO ▶ Performing DNS lookup for host `10-36-0-7.oaf-couchbase-srv.default.svc.cluster.local`
08:54:32.031 INFO ▶ Bootstrap host `10-36-0-7.oaf-couchbase-srv.default.svc.cluster.local` refers to a server with the address `10.36.0.7`
08:54:32.032 INFO ▶ Performing DNS lookup for host `10-35-0-39.oaf-couchbase-srv.default.svc.cluster.local`
08:54:32.034 INFO ▶ Bootstrap host `10-35-0-39.oaf-couchbase-srv.default.svc.cluster.local` refers to a server with the address `10.35.0.39`
08:54:32.034 INFO ▶ Attempting to connect to cluster via CCCP
08:54:32.035 INFO ▶ Attempting to fetch config via cccp from `10-36-0-7.oaf-couchbase-srv.default.svc.cluster.local:11210`
08:54:32.042 INFO ▶ Attempting to fetch config via cccp from `10-32-0-19.oaf-couchbase-srv.default.svc.cluster.local:11210`
08:54:32.050 INFO ▶ Attempting to fetch config via cccp from `10-35-0-39.oaf-couchbase-srv.default.svc.cluster.local:11210`
08:54:32.054 WARN ▶ Bootstrap host `10-36-0-7.oaf-couchbase-srv.default.svc.cluster.local` is not using the canonical node hostname of `oaf-couchbase-0005.oaf-couchbase.default.svc`.  This is not neccessarily an error, but has been known to result in strange and challenging to diagnose errors when DNS entries are reconfigured.
08:54:32.054 WARN ▶ Bootstrap host `10-32-0-19.oaf-couchbase-srv.default.svc.cluster.local` is not using the canonical node hostname of `oaf-couchbase-0003.oaf-couchbase.default.svc`.  This is not neccessarily an error, but has been known to result in strange and challenging to diagnose errors when DNS entries are reconfigured.
08:54:32.054 WARN ▶ Bootstrap host `10-35-0-39.oaf-couchbase-srv.default.svc.cluster.local` is not using the canonical node hostname of `oaf-couchbase-0004.oaf-couchbase.default.svc`.  This is not neccessarily an error, but has been known to result in strange and challenging to diagnose errors when DNS entries are reconfigured.
08:54:32.054 INFO ▶ Selected the following network type: external
08:54:32.054 INFO ▶ Identified the following nodes:
08:54:32.054 INFO ▶   [0] 95.216.208.78
08:54:32.054 INFO ▶                  mgmtSSL: 30971,    eventingAdminPort: 30535,                 mgmt: 31351
08:54:32.054 INFO ▶                     n1ql: 30386,                  fts: 30561,          eventingSSL: 31810
08:54:32.054 INFO ▶                     cbas: 30104,                 capi: 30103,                   kv: 31941
08:54:32.054 INFO ▶                    kvSSL: 31297,              capiSSL: 32655,              n1qlSSL: 30074
08:54:32.054 INFO ▶                   ftsSSL: 31779,              cbasSSL: 32761
08:54:32.054 INFO ▶   [1] 95.217.218.135
08:54:32.054 INFO ▶              eventingSSL: 30673,              n1qlSSL: 31678,                kvSSL: 31871
08:54:32.054 INFO ▶                  capiSSL: 30075,                 n1ql: 30863,                 cbas: 31413
08:54:32.054 INFO ▶                  cbasSSL: 30705,              mgmtSSL: 30953,               ftsSSL: 30924
08:54:32.054 INFO ▶                       kv: 32210,                 capi: 30896,                  fts: 30585
08:54:32.054 INFO ▶        eventingAdminPort: 31922,                 mgmt: 31705
08:54:32.054 INFO ▶   [2] 135.181.30.248
08:54:32.054 INFO ▶                     n1ql: 32549,    eventingAdminPort: 32752,          eventingSSL: 31661
08:54:32.054 INFO ▶                  capiSSL: 32329,                   kv: 31872,                 capi: 30976
08:54:32.054 INFO ▶                  n1qlSSL: 32370,                  fts: 30763,              cbasSSL: 31852
08:54:32.054 INFO ▶                     mgmt: 32453,                kvSSL: 30068,               ftsSSL: 32228
08:54:32.054 INFO ▶                  mgmtSSL: 32355,                 cbas: 30578
08:54:32.054 INFO ▶ Fetching config from `http://95.216.208.78:31351`
08:54:32.090 INFO ▶ Received cluster configuration, nodes list:
[
  {
    "addressFamily": "inet",
    "alternateAddresses": {
      "external": {
        "hostname": "95.216.208.78",
        "ports": {
          "capi": 30103,
          "capiSSL": 32655,
          "kv": 31941,
          "mgmt": 31351,
          "mgmtSSL": 30971
        }
      }
    },
    "clusterCompatibility": 393222,
    "clusterMembership": "active",
    "configuredHostname": "oaf-couchbase-0003.oaf-couchbase.default.svc:8091",
    "couchApiBase": "http://oaf-couchbase-0003.oaf-couchbase.default.svc:8092/",
    "couchApiBaseHTTPS": "https://oaf-couchbase-0003.oaf-couchbase.default.svc:18092/",
    "cpuCount": 8,
    "externalListeners": [
      {
        "afamily": "inet",
        "nodeEncryption": false
      },
      {
        "afamily": "inet6",
        "nodeEncryption": false
      }
    ],
    "hostname": "oaf-couchbase-0003.oaf-couchbase.default.svc:8091",
    "interestingStats": {
      "cmd_get": 0,
      "couch_docs_actual_disk_size": 4752118309,
      "couch_docs_data_size": 3594756877,
      "couch_spatial_data_size": 0,
      "couch_spatial_disk_size": 0,
      "couch_views_actual_disk_size": 12589014,
      "couch_views_data_size": 12589014,
      "curr_items": 1091813,
      "curr_items_tot": 2185369,
      "ep_bg_fetched": 0,
      "get_hits": 0,
      "mem_used": 1808067560,
      "ops": 0,
      "vb_active_num_non_resident": 656502,
      "vb_replica_curr_items": 1093556
    },
    "mcdMemoryAllocated": 25088,
    "mcdMemoryReserved": 25088,
    "memoryFree": 13474230272,
    "memoryTotal": 32884228096,
    "nodeEncryption": false,
    "nodeUUID": "bccd30747f9e69e0269c24020361c680",
    "os": "x86_64-unknown-linux-gnu",
    "otpNode": "ns_1@oaf-couchbase-0003.oaf-couchbase.default.svc",
    "ports": {
      "direct": 11210,
      "distTCP": 21100,
      "distTLS": 21150,
      "httpsCAPI": 18092,
      "httpsMgmt": 18091
    },
    "recoveryType": "none",
    "services": [
      "cbas",
      "eventing",
      "fts",
      "index",
      "kv",
      "n1ql"
    ],
    "status": "healthy",
    "systemStats": {
      "allocstall": 0,
      "cpu_cores_available": 8,
      "cpu_stolen_rate": 0,
      "cpu_utilization_rate": 31.35483870967742,
      "mem_free": 13474230272,
      "mem_limit": 32884228096,
      "mem_total": 32884228096,
      "swap_total": 0,
      "swap_used": 0
    },
    "thisNode": true,
    "uptime": "44946",
    "version": "6.6.1-9213-enterprise"
  },
  {
    "addressFamily": "inet",
    "alternateAddresses": {
      "external": {
        "hostname": "95.217.218.135",
        "ports": {
          "capi": 30896,
          "capiSSL": 30075,
          "kv": 32210,
          "mgmt": 31705,
          "mgmtSSL": 30953
        }
      }
    },
    "clusterCompatibility": 393222,
    "clusterMembership": "active",
    "configuredHostname": "oaf-couchbase-0004.oaf-couchbase.default.svc:8091",
    "couchApiBase": "http://oaf-couchbase-0004.oaf-couchbase.default.svc:8092/",
    "couchApiBaseHTTPS": "https://oaf-couchbase-0004.oaf-couchbase.default.svc:18092/",
    "cpuCount": 8,
    "externalListeners": [
      {
        "afamily": "inet",
        "nodeEncryption": false
      },
      {
        "afamily": "inet6",
        "nodeEncryption": false
      }
    ],
    "hostname": "oaf-couchbase-0004.oaf-couchbase.default.svc:8091",
    "interestingStats": {
      "cmd_get": 0,
      "couch_docs_actual_disk_size": 4599740551,
      "couch_docs_data_size": 3572644462,
      "couch_spatial_data_size": 0,
      "couch_spatial_disk_size": 0,
      "couch_views_actual_disk_size": 11864147,
      "couch_views_data_size": 11864147,
      "curr_items": 1091273,
      "curr_items_tot": 2181978,
      "ep_bg_fetched": 0,
      "get_hits": 0,
      "mem_used": 1846524952,
      "ops": 0,
      "vb_active_num_non_resident": 640880,
      "vb_replica_curr_items": 1090705
    },
    "mcdMemoryAllocated": 25088,
    "mcdMemoryReserved": 25088,
    "memoryFree": 9966014464,
    "memoryTotal": 32884191232,
    "nodeEncryption": false,
    "nodeUUID": "00583abf725fca65006ff32e80185f0c",
    "os": "x86_64-unknown-linux-gnu",
    "otpNode": "ns_1@oaf-couchbase-0004.oaf-couchbase.default.svc",
    "ports": {
      "direct": 11210,
      "distTCP": 21100,
      "distTLS": 21150,
      "httpsCAPI": 18092,
      "httpsMgmt": 18091
    },
    "recoveryType": "none",
    "services": [
      "cbas",
      "eventing",
      "fts",
      "index",
      "kv",
      "n1ql"
    ],
    "status": "healthy",
    "systemStats": {
      "allocstall": 0,
      "cpu_cores_available": 8,
      "cpu_stolen_rate": 0,
      "cpu_utilization_rate": 76.33289986996098,
      "mem_free": 9966014464,
      "mem_limit": 32884191232,
      "mem_total": 32884191232,
      "swap_total": 0,
      "swap_used": 0
    },
    "uptime": "44130",
    "version": "6.6.1-9213-enterprise"
  },
  {
    "addressFamily": "inet",
    "alternateAddresses": {
      "external": {
        "hostname": "135.181.30.248",
        "ports": {
          "capi": 30976,
          "capiSSL": 32329,
          "kv": 31872,
          "mgmt": 32453,
          "mgmtSSL": 32355
        }
      }
    },
    "clusterCompatibility": 393222,
    "clusterMembership": "active",
    "configuredHostname": "oaf-couchbase-0005.oaf-couchbase.default.svc:8091",
    "couchApiBase": "http://oaf-couchbase-0005.oaf-couchbase.default.svc:8092/",
    "couchApiBaseHTTPS": "https://oaf-couchbase-0005.oaf-couchbase.default.svc:18092/",
    "cpuCount": 8,
    "externalListeners": [
      {
        "afamily": "inet",
        "nodeEncryption": false
      },
      {
        "afamily": "inet6",
        "nodeEncryption": false
      }
    ],
    "hostname": "oaf-couchbase-0005.oaf-couchbase.default.svc:8091",
    "interestingStats": {
      "cmd_get": 0,
      "couch_docs_actual_disk_size": 4690227191,
      "couch_docs_data_size": 3548986919,
      "couch_spatial_data_size": 0,
      "couch_spatial_disk_size": 0,
      "couch_views_actual_disk_size": 12224473,
      "couch_views_data_size": 12224473,
      "curr_items": 1090141,
      "curr_items_tot": 2179107,
      "ep_bg_fetched": 0,
      "get_hits": 0,
      "mem_used": 1873142368,
      "ops": 0,
      "vb_active_num_non_resident": 739779,
      "vb_replica_curr_items": 1088966
    },
    "mcdMemoryAllocated": 25088,
    "mcdMemoryReserved": 25088,
    "memoryFree": 20557443072,
    "memoryTotal": 32884228096,
    "nodeEncryption": false,
    "nodeUUID": "ae669a001fa9bf0f31524b8c5aef9195",
    "os": "x86_64-unknown-linux-gnu",
    "otpNode": "ns_1@oaf-couchbase-0005.oaf-couchbase.default.svc",
    "ports": {
      "direct": 11210,
      "distTCP": 21100,
      "distTLS": 21150,
      "httpsCAPI": 18092,
      "httpsMgmt": 18091
    },
    "recoveryType": "none",
    "services": [
      "cbas",
      "eventing",
      "fts",
      "index",
      "kv",
      "n1ql"
    ],
    "status": "healthy",
    "systemStats": {
      "allocstall": 0,
      "cpu_cores_available": 8,
      "cpu_stolen_rate": 0,
      "cpu_utilization_rate": 36.88946015424165,
      "mem_free": 20557443072,
      "mem_limit": 32884228096,
      "mem_total": 32884228096,
      "swap_total": 0,
      "swap_used": 0
    },
    "uptime": "42851",
    "version": "6.6.1-9213-enterprise"
  }
]
08:54:32.093 INFO ▶ Successfully connected to Key Value service at `95.216.208.78:31941`
08:54:32.099 INFO ▶ Successfully connected to Management service at `95.216.208.78:31351`
08:54:32.103 INFO ▶ Successfully connected to Views service at `95.216.208.78:30103`
08:54:32.105 INFO ▶ Successfully connected to Query service at `95.216.208.78:30386`
08:54:32.106 INFO ▶ Successfully connected to Search service at `95.216.208.78:30561`
08:54:32.108 INFO ▶ Successfully connected to Analytics service at `95.216.208.78:30104`
08:54:32.109 INFO ▶ Successfully connected to Key Value service at `95.217.218.135:32210`
08:54:32.118 INFO ▶ Successfully connected to Management service at `95.217.218.135:31705`
08:54:32.119 INFO ▶ Successfully connected to Views service at `95.217.218.135:30896`
08:54:32.121 INFO ▶ Successfully connected to Query service at `95.217.218.135:30863`
08:54:32.121 INFO ▶ Successfully connected to Search service at `95.217.218.135:30585`
08:54:32.124 INFO ▶ Successfully connected to Analytics service at `95.217.218.135:31413`
08:54:32.131 INFO ▶ Successfully connected to Key Value service at `135.181.30.248:31872`
08:54:32.137 INFO ▶ Successfully connected to Management service at `135.181.30.248:32453`
08:54:32.142 INFO ▶ Successfully connected to Views service at `135.181.30.248:30976`
08:54:32.144 INFO ▶ Successfully connected to Query service at `135.181.30.248:32549`
08:54:32.149 INFO ▶ Successfully connected to Search service at `135.181.30.248:30763`
08:54:32.155 INFO ▶ Successfully connected to Analytics service at `135.181.30.248:30578`
08:54:32.163 INFO ▶ Memd Nop Pinged `95.216.208.78:31941` 10 times, 0 errors, 0ms min, 1ms max, 0ms mean
08:54:32.169 INFO ▶ Memd Nop Pinged `95.217.218.135:32210` 10 times, 0 errors, 0ms min, 0ms max, 0ms mean
08:54:32.182 INFO ▶ Memd Nop Pinged `135.181.30.248:31872` 10 times, 0 errors, 0ms min, 1ms max, 0ms mean
08:54:32.182 INFO ▶ Diagnostics completed

Summary:
[WARN] The hostname specified in your connection string resolves both for SRV records, as well as A records.  This is not suggested as later DNS configuration changes could cause the wrong servers to be contacted
[WARN] Bootstrap host `10-36-0-7.oaf-couchbase-srv.default.svc.cluster.local` is not using the canonical node hostname of `oaf-couchbase-0005.oaf-couchbase.default.svc`.  This is not neccessarily an error, but has been known to result in strange and challenging to diagnose errors when DNS entries are reconfigured.
[WARN] Bootstrap host `10-32-0-19.oaf-couchbase-srv.default.svc.cluster.local` is not using the canonical node hostname of `oaf-couchbase-0003.oaf-couchbase.default.svc`.  This is not neccessarily an error, but has been known to result in strange and challenging to diagnose errors when DNS entries are reconfigured.
[WARN] Bootstrap host `10-35-0-39.oaf-couchbase-srv.default.svc.cluster.local` is not using the canonical node hostname of `oaf-couchbase-0004.oaf-couchbase.default.svc`.  This is not neccessarily an error, but has been known to result in strange and challenging to diagnose errors when DNS entries are reconfigured.

Found multiple issues, see listing above.

I haven’t tried downgrading to 2.x - not sure it works with Typescript…

Hmm… Waiting since 2 or 3 months now on this… I’m pretty sure that this issue isn’t on my end, many people report similar bugs, my setup worked for weeks before, now it doesn’t work anymore on any of my previous commit-states, on which it definitely previously worked.

Hmm. Looks like my local dev setup is working, but my docker setup is not working, while it’s really essentially the same.

@brett19 there seems to be a lot of similar reports on 3.x versions… What do you suggest? This keeps crashing even under moderate load, this is really blocking us…

Should we downgrade to 2.x?

I’m not sure how to retrieve the corefile in my kubernetes setup since the containers are terminated as soon as a crash occurs, but does this help:

FATAL ERROR:
    libcouchbase experienced an unrecoverable error and terminates the program
    to avoid undefined behavior.
    The program should have generated a "corefile" which may used
    to gather more information about the problem.
    If your system doesn't create "corefiles" I can tell you that the
    assertion failed in ../deps/lcb/src/mcserver/negotiate.cc at line 50

Hello @Yann_Jouanique are you facing this problem with 3.1.1 while you are on K8 pods or even when you are running application from a local machine ? And are you saying that with 2.x same environment setup everything worked ?

No, we never tried 2.x, so we’re not sure if that addresses the issue, but I’ve seen this suggested as a working solution in the forums. We’re reluctant to do this because I don’t think 2.x has typings that would work in our TS environment.

We have however tried all 3.x versions.

We cannot really reproduce this locally since Couchbase in Kubernetes is really hard to access externally, so we need the clients to be in k8s as well.

Could this be linked to https://issues.couchbase.com/projects/JSCBC/issues/JSCBC-837 ? Although that seems to be affecting only 3.1.1.

Note that this only seems to happen under moderate load. We never saw this happen during development but only now during pre-production tests… This is very problematic for us… It works but our pods are taking turn crashing every ~1 minute…

The failing assertion does not help us much track the issue but it’s here: https://github.com/couchbase/libcouchbase/blob/master/src/mcserver/negotiate.cc#L50

Also, the errors logged before the crash are here: https://github.com/couchbase/libcouchbase/blob/master/src/bucketconfig/bc_cccp.cc#L187

2021-01-21T07:46:53.722Z couchnode:lcb:error (cccp @ ../deps/lcb/src/bucketconfig/bc_cccp.cc:187) <NOHOST:NOPORT> (CTX=(nil),) Could not get configuration: LCB_ERR_TIMEOUT (201)

Thanks @Yann_Jouanique, no I didn’t mean to ask you to switch to 2.x thats not advisable.
Working with containers which I don’t think it’s tried / tested , having said that it should not be any different unless something is blocking attempt to ping the server - @brett19 / @ericb thoughts ?

Well the error messages indeed tend to indicate some network issue, however:

  • This is observed on 2 very different network environments (our QA is a home-made k8s cluster with a small German hosting provider, our Production is Azure)
  • The crashes do not occur at the same time across all the replicas of the pod. They alternate fairly regularly (1 different pod crashes roughly every minute). And Sync Gateway (2 pods) reports no issue at that time.

So I don’t really think this is network-related. I suspect something related to concurrency (since this only happens under some sustained load, ~6 queries/s/pod).

Hello…

I’m wondering if there are any thoughts about this. We are still experiencing very frequent libcouchbase crashes in Production (~1/h on each of our ~8 containers). Since we’re a paying customer we have also opened a support ticket, but that hasn’t really progressed either. We’re seeing this on our other Node-based apps as well, but not on our C# apps so far, though the volume is different so it’s hard to compare.
Our first app is basically just a cache, so crashes are acceptable (our cache will simply be slightly outdated) but we’re now working on transactional apps where this unstability clearly won’t be acceptable for us.

Any upcoming SDK release that might have a chance of fixing these? @AV25242 @brett19 ?

Hello @Yann_Jouanique do you know if support team determined this to be a SDK bug and had raised an SDK ticket by any chance ? that will be the way to find out if this was lined up along side SDK releases.