Not able to connect to cluster anymore suddenly

Hello,

I am suddenly not able anymore to connect to my couchbase cluster, the only way I found was to rollback my codebase to couchbase-sdk 2.x.
I recently migrated to 3.0, which was working fine and very well, now suddenly it will fail in every scenario: either on my local dev setup with nodejs and couchbase and as well on my local docker-compose setup.
I already tried debugging, also with sdk-doctor, with general debugging I can’t get any meaningful information, while sdk-doctor doesnt report any errors:

NodeJS Error: Error: cluster object was closed

cluster: Cluster {
    _connStr: 'couchbase://localhost/',
    _trustStorePath: undefined,
    _kvTimeout: undefined,
    _kvDurableTimeout: undefined,
    _viewTimeout: undefined,
    _queryTimeout: undefined,
    _analyticsTimeout: undefined,
    _searchTimeout: undefined,
    _managementTimeout: undefined,
    _auth: { username: 'xxx', password: 'xxx' },
    _closed: false,
    _clusterConn: null,
    _conns: {
        fwdisplay: Connection {
            _inst: CbConnection {},
            _closed: true,
            _pendOps: [],
            _pendBOps: [],
            _connected: false,
            _opened: true
        }
    },
    _transcoder: DefaultTranscoder {},
    _logFunc: undefined
}

SDK Doc Summary:
[WARN] Your connection string specifies only a single host. You should consider adding additional static nodes from your cluster to this list to improve your applications fault-tolerance
[WARN] Could not test Analytics service on 127.0.0.1 as it was not in the config

This is the connection setup:

this.cluster = new couchbase.Cluster(`couchbase://${COUCHBASE_HOSTNAME}/`, {
    username: COUCHBASE_USERNAME,
    password: COUCHBASE_PASSWORD
})
this.bucket = this.cluster.bucket(COUCHBASE_BUCKET)
this.collection = this.bucket.defaultCollection()

anyone has any ideas as to what to change to fix it?

Had the same issue and had to roll back.

Is anyone else having this? is it resolved with 3.1?

Did your setup first work for 3.x and then end to work? In my case, it did.

I now tried the sample https://github.com/couchbaselabs/try-cb-nodejs/tree/6.5-collections, but it crashes as well, leading me to believe that the error is not on my end.

Same as myself, initially it worked, but once it stops working my api is essentially broken, I can’t restart couchbase or the sync gateway periodically just because the nodejs library is broken, hence my rollback to the previous release.

Hello, I am assuming your problem statement aligns with Cluster closed - reinitialize connection?

This seems to be a bug with the underlying libcouchbase and will be fixed in the upcoming release.

There seems to be plenty of bug reports for the infamous cluster object was closed issue around here for the Node SDK… I just wanted to chime in on this thread with my own report.

Context

  • Couchbase (6.6.0 and 6.6.1), installed via the kubernetes operator
  • Node SDK 3.0.4 to 3.1.1 - tried all versions, used with intra-cluster DNS with the recommended connection string (my-cluster-srv DNS service)
  • ~4 pods with the same nodejs app connected to the cluster, and 2 sync gateway pods

Behaviour

Things seem to work fine for a few minutes, until suddenly on one given container (but NOT the others), the logs (activated with DEBUG=couchnode:lcb:error) are flooded with these errors:

2021-01-21T07:46:53.722Z couchnode:lcb:error (cccp @ ../deps/lcb/src/bucketconfig/bc_cccp.cc:187) <NOHOST:NOPORT> (CTX=(nil),) Could not get configuration: LCB_ERR_TIMEOUT (201)
2021-01-21T07:46:53.726Z couchnode:lcb:error (cccp @ ../deps/lcb/src/bucketconfig/bc_cccp.cc:187) <NOHOST:NOPORT> (CTX=(nil),) Could not get configuration: LCB_ERR_TIMEOUT (201)
2021-01-21T07:46:53.730Z couchnode:lcb:error (cccp @ ../deps/lcb/src/bucketconfig/bc_cccp.cc:187) <NOHOST:NOPORT> (CTX=(nil),) Could not get configuration: LCB_ERR_TIMEOUT (201)

Until it all comes to a stop with this error:

FATAL ERROR:
    libcouchbase experienced an unrecoverable error and terminates the program
    to avoid undefined behavior.
    The program should have generated a "corefile" which may used
    to gather more information about the problem.
    If your system doesn't create "corefiles" I can tell you that the
    assertion failed in ../deps/lcb/src/mcserver/negotiate.cc at line 50

This does not crash the container, but it seems to make it hang somehow, since there’s no more logs (including application logs), and the port that the app listens to becomes unresponsive, causing my livenessProbe to fail, and kubernetes to eventually kill and restart the container.
Other pods seem to do fine at the same time but will also randomly fail in the same way.
Sync Gateway is fine all along.

Diags:

Couchbase UI is fine.

sdkdoctor never seems to complain:

|====================================================================|
|          ___ ___  _  __   ___   ___   ___ _____ ___  ___           |
|         / __|   \| |/ /__|   \ / _ \ / __|_   _/ _ \| _ \          |
|         \__ \ |) | ' <___| |) | (_) | (__  | || (_) |   /          |
|         |___/___/|_|\_\  |___/ \___/ \___| |_| \___/|_|_\          |
|                                                                    |
|====================================================================|

Note: Diagnostics can only provide accurate results when your cluster
 is in a stable state.  Active rebalancing and other cluster configuration
 changes can cause the output of the doctor to be inconsistent or in the
 worst cases, completely incorrect.

08:54:32.016 INFO ▶ Parsing connection string `couchbase://oaf-couchbase-srv.default.svc.cluster.local/fs-bucket-v0`
08:54:32.016 INFO ▶ Connection string was parsed as a potential DNS SRV record
08:54:32.020 INFO ▶ Connection string identifies the following CCCP endpoints:
08:54:32.020 INFO ▶   1. 10-36-0-7.oaf-couchbase-srv.default.svc.cluster.local:11210
08:54:32.020 INFO ▶   2. 10-32-0-19.oaf-couchbase-srv.default.svc.cluster.local:11210
08:54:32.020 INFO ▶   3. 10-35-0-39.oaf-couchbase-srv.default.svc.cluster.local:11210
08:54:32.020 INFO ▶ Connection string identifies the following HTTP endpoints:
08:54:32.020 INFO ▶ Connection string specifies bucket `fs-bucket-v0`
08:54:32.027 WARN ▶ The hostname specified in your connection string resolves both for SRV records, as well as A records.  This is not suggested as later DNS configuration changes could cause the wrong servers to be contacted
08:54:32.027 INFO ▶ Performing DNS lookup for host `10-32-0-19.oaf-couchbase-srv.default.svc.cluster.local`
08:54:32.029 INFO ▶ Bootstrap host `10-32-0-19.oaf-couchbase-srv.default.svc.cluster.local` refers to a server with the address `10.32.0.19`
08:54:32.030 INFO ▶ Performing DNS lookup for host `10-36-0-7.oaf-couchbase-srv.default.svc.cluster.local`
08:54:32.031 INFO ▶ Bootstrap host `10-36-0-7.oaf-couchbase-srv.default.svc.cluster.local` refers to a server with the address `10.36.0.7`
08:54:32.032 INFO ▶ Performing DNS lookup for host `10-35-0-39.oaf-couchbase-srv.default.svc.cluster.local`
08:54:32.034 INFO ▶ Bootstrap host `10-35-0-39.oaf-couchbase-srv.default.svc.cluster.local` refers to a server with the address `10.35.0.39`
08:54:32.034 INFO ▶ Attempting to connect to cluster via CCCP
08:54:32.035 INFO ▶ Attempting to fetch config via cccp from `10-36-0-7.oaf-couchbase-srv.default.svc.cluster.local:11210`
08:54:32.042 INFO ▶ Attempting to fetch config via cccp from `10-32-0-19.oaf-couchbase-srv.default.svc.cluster.local:11210`
08:54:32.050 INFO ▶ Attempting to fetch config via cccp from `10-35-0-39.oaf-couchbase-srv.default.svc.cluster.local:11210`
08:54:32.054 WARN ▶ Bootstrap host `10-36-0-7.oaf-couchbase-srv.default.svc.cluster.local` is not using the canonical node hostname of `oaf-couchbase-0005.oaf-couchbase.default.svc`.  This is not neccessarily an error, but has been known to result in strange and challenging to diagnose errors when DNS entries are reconfigured.
08:54:32.054 WARN ▶ Bootstrap host `10-32-0-19.oaf-couchbase-srv.default.svc.cluster.local` is not using the canonical node hostname of `oaf-couchbase-0003.oaf-couchbase.default.svc`.  This is not neccessarily an error, but has been known to result in strange and challenging to diagnose errors when DNS entries are reconfigured.
08:54:32.054 WARN ▶ Bootstrap host `10-35-0-39.oaf-couchbase-srv.default.svc.cluster.local` is not using the canonical node hostname of `oaf-couchbase-0004.oaf-couchbase.default.svc`.  This is not neccessarily an error, but has been known to result in strange and challenging to diagnose errors when DNS entries are reconfigured.
08:54:32.054 INFO ▶ Selected the following network type: external
08:54:32.054 INFO ▶ Identified the following nodes:
08:54:32.054 INFO ▶   [0] 95.216.208.78
08:54:32.054 INFO ▶                  mgmtSSL: 30971,    eventingAdminPort: 30535,                 mgmt: 31351
08:54:32.054 INFO ▶                     n1ql: 30386,                  fts: 30561,          eventingSSL: 31810
08:54:32.054 INFO ▶                     cbas: 30104,                 capi: 30103,                   kv: 31941
08:54:32.054 INFO ▶                    kvSSL: 31297,              capiSSL: 32655,              n1qlSSL: 30074
08:54:32.054 INFO ▶                   ftsSSL: 31779,              cbasSSL: 32761
08:54:32.054 INFO ▶   [1] 95.217.218.135
08:54:32.054 INFO ▶              eventingSSL: 30673,              n1qlSSL: 31678,                kvSSL: 31871
08:54:32.054 INFO ▶                  capiSSL: 30075,                 n1ql: 30863,                 cbas: 31413
08:54:32.054 INFO ▶                  cbasSSL: 30705,              mgmtSSL: 30953,               ftsSSL: 30924
08:54:32.054 INFO ▶                       kv: 32210,                 capi: 30896,                  fts: 30585
08:54:32.054 INFO ▶        eventingAdminPort: 31922,                 mgmt: 31705
08:54:32.054 INFO ▶   [2] 135.181.30.248
08:54:32.054 INFO ▶                     n1ql: 32549,    eventingAdminPort: 32752,          eventingSSL: 31661
08:54:32.054 INFO ▶                  capiSSL: 32329,                   kv: 31872,                 capi: 30976
08:54:32.054 INFO ▶                  n1qlSSL: 32370,                  fts: 30763,              cbasSSL: 31852
08:54:32.054 INFO ▶                     mgmt: 32453,                kvSSL: 30068,               ftsSSL: 32228
08:54:32.054 INFO ▶                  mgmtSSL: 32355,                 cbas: 30578
08:54:32.054 INFO ▶ Fetching config from `http://95.216.208.78:31351`
08:54:32.090 INFO ▶ Received cluster configuration, nodes list:
[
  {
    "addressFamily": "inet",
    "alternateAddresses": {
      "external": {
        "hostname": "95.216.208.78",
        "ports": {
          "capi": 30103,
          "capiSSL": 32655,
          "kv": 31941,
          "mgmt": 31351,
          "mgmtSSL": 30971
        }
      }
    },
    "clusterCompatibility": 393222,
    "clusterMembership": "active",
    "configuredHostname": "oaf-couchbase-0003.oaf-couchbase.default.svc:8091",
    "couchApiBase": "http://oaf-couchbase-0003.oaf-couchbase.default.svc:8092/",
    "couchApiBaseHTTPS": "https://oaf-couchbase-0003.oaf-couchbase.default.svc:18092/",
    "cpuCount": 8,
    "externalListeners": [
      {
        "afamily": "inet",
        "nodeEncryption": false
      },
      {
        "afamily": "inet6",
        "nodeEncryption": false
      }
    ],
    "hostname": "oaf-couchbase-0003.oaf-couchbase.default.svc:8091",
    "interestingStats": {
      "cmd_get": 0,
      "couch_docs_actual_disk_size": 4752118309,
      "couch_docs_data_size": 3594756877,
      "couch_spatial_data_size": 0,
      "couch_spatial_disk_size": 0,
      "couch_views_actual_disk_size": 12589014,
      "couch_views_data_size": 12589014,
      "curr_items": 1091813,
      "curr_items_tot": 2185369,
      "ep_bg_fetched": 0,
      "get_hits": 0,
      "mem_used": 1808067560,
      "ops": 0,
      "vb_active_num_non_resident": 656502,
      "vb_replica_curr_items": 1093556
    },
    "mcdMemoryAllocated": 25088,
    "mcdMemoryReserved": 25088,
    "memoryFree": 13474230272,
    "memoryTotal": 32884228096,
    "nodeEncryption": false,
    "nodeUUID": "bccd30747f9e69e0269c24020361c680",
    "os": "x86_64-unknown-linux-gnu",
    "otpNode": "ns_1@oaf-couchbase-0003.oaf-couchbase.default.svc",
    "ports": {
      "direct": 11210,
      "distTCP": 21100,
      "distTLS": 21150,
      "httpsCAPI": 18092,
      "httpsMgmt": 18091
    },
    "recoveryType": "none",
    "services": [
      "cbas",
      "eventing",
      "fts",
      "index",
      "kv",
      "n1ql"
    ],
    "status": "healthy",
    "systemStats": {
      "allocstall": 0,
      "cpu_cores_available": 8,
      "cpu_stolen_rate": 0,
      "cpu_utilization_rate": 31.35483870967742,
      "mem_free": 13474230272,
      "mem_limit": 32884228096,
      "mem_total": 32884228096,
      "swap_total": 0,
      "swap_used": 0
    },
    "thisNode": true,
    "uptime": "44946",
    "version": "6.6.1-9213-enterprise"
  },
  {
    "addressFamily": "inet",
    "alternateAddresses": {
      "external": {
        "hostname": "95.217.218.135",
        "ports": {
          "capi": 30896,
          "capiSSL": 30075,
          "kv": 32210,
          "mgmt": 31705,
          "mgmtSSL": 30953
        }
      }
    },
    "clusterCompatibility": 393222,
    "clusterMembership": "active",
    "configuredHostname": "oaf-couchbase-0004.oaf-couchbase.default.svc:8091",
    "couchApiBase": "http://oaf-couchbase-0004.oaf-couchbase.default.svc:8092/",
    "couchApiBaseHTTPS": "https://oaf-couchbase-0004.oaf-couchbase.default.svc:18092/",
    "cpuCount": 8,
    "externalListeners": [
      {
        "afamily": "inet",
        "nodeEncryption": false
      },
      {
        "afamily": "inet6",
        "nodeEncryption": false
      }
    ],
    "hostname": "oaf-couchbase-0004.oaf-couchbase.default.svc:8091",
    "interestingStats": {
      "cmd_get": 0,
      "couch_docs_actual_disk_size": 4599740551,
      "couch_docs_data_size": 3572644462,
      "couch_spatial_data_size": 0,
      "couch_spatial_disk_size": 0,
      "couch_views_actual_disk_size": 11864147,
      "couch_views_data_size": 11864147,
      "curr_items": 1091273,
      "curr_items_tot": 2181978,
      "ep_bg_fetched": 0,
      "get_hits": 0,
      "mem_used": 1846524952,
      "ops": 0,
      "vb_active_num_non_resident": 640880,
      "vb_replica_curr_items": 1090705
    },
    "mcdMemoryAllocated": 25088,
    "mcdMemoryReserved": 25088,
    "memoryFree": 9966014464,
    "memoryTotal": 32884191232,
    "nodeEncryption": false,
    "nodeUUID": "00583abf725fca65006ff32e80185f0c",
    "os": "x86_64-unknown-linux-gnu",
    "otpNode": "ns_1@oaf-couchbase-0004.oaf-couchbase.default.svc",
    "ports": {
      "direct": 11210,
      "distTCP": 21100,
      "distTLS": 21150,
      "httpsCAPI": 18092,
      "httpsMgmt": 18091
    },
    "recoveryType": "none",
    "services": [
      "cbas",
      "eventing",
      "fts",
      "index",
      "kv",
      "n1ql"
    ],
    "status": "healthy",
    "systemStats": {
      "allocstall": 0,
      "cpu_cores_available": 8,
      "cpu_stolen_rate": 0,
      "cpu_utilization_rate": 76.33289986996098,
      "mem_free": 9966014464,
      "mem_limit": 32884191232,
      "mem_total": 32884191232,
      "swap_total": 0,
      "swap_used": 0
    },
    "uptime": "44130",
    "version": "6.6.1-9213-enterprise"
  },
  {
    "addressFamily": "inet",
    "alternateAddresses": {
      "external": {
        "hostname": "135.181.30.248",
        "ports": {
          "capi": 30976,
          "capiSSL": 32329,
          "kv": 31872,
          "mgmt": 32453,
          "mgmtSSL": 32355
        }
      }
    },
    "clusterCompatibility": 393222,
    "clusterMembership": "active",
    "configuredHostname": "oaf-couchbase-0005.oaf-couchbase.default.svc:8091",
    "couchApiBase": "http://oaf-couchbase-0005.oaf-couchbase.default.svc:8092/",
    "couchApiBaseHTTPS": "https://oaf-couchbase-0005.oaf-couchbase.default.svc:18092/",
    "cpuCount": 8,
    "externalListeners": [
      {
        "afamily": "inet",
        "nodeEncryption": false
      },
      {
        "afamily": "inet6",
        "nodeEncryption": false
      }
    ],
    "hostname": "oaf-couchbase-0005.oaf-couchbase.default.svc:8091",
    "interestingStats": {
      "cmd_get": 0,
      "couch_docs_actual_disk_size": 4690227191,
      "couch_docs_data_size": 3548986919,
      "couch_spatial_data_size": 0,
      "couch_spatial_disk_size": 0,
      "couch_views_actual_disk_size": 12224473,
      "couch_views_data_size": 12224473,
      "curr_items": 1090141,
      "curr_items_tot": 2179107,
      "ep_bg_fetched": 0,
      "get_hits": 0,
      "mem_used": 1873142368,
      "ops": 0,
      "vb_active_num_non_resident": 739779,
      "vb_replica_curr_items": 1088966
    },
    "mcdMemoryAllocated": 25088,
    "mcdMemoryReserved": 25088,
    "memoryFree": 20557443072,
    "memoryTotal": 32884228096,
    "nodeEncryption": false,
    "nodeUUID": "ae669a001fa9bf0f31524b8c5aef9195",
    "os": "x86_64-unknown-linux-gnu",
    "otpNode": "ns_1@oaf-couchbase-0005.oaf-couchbase.default.svc",
    "ports": {
      "direct": 11210,
      "distTCP": 21100,
      "distTLS": 21150,
      "httpsCAPI": 18092,
      "httpsMgmt": 18091
    },
    "recoveryType": "none",
    "services": [
      "cbas",
      "eventing",
      "fts",
      "index",
      "kv",
      "n1ql"
    ],
    "status": "healthy",
    "systemStats": {
      "allocstall": 0,
      "cpu_cores_available": 8,
      "cpu_stolen_rate": 0,
      "cpu_utilization_rate": 36.88946015424165,
      "mem_free": 20557443072,
      "mem_limit": 32884228096,
      "mem_total": 32884228096,
      "swap_total": 0,
      "swap_used": 0
    },
    "uptime": "42851",
    "version": "6.6.1-9213-enterprise"
  }
]
08:54:32.093 INFO ▶ Successfully connected to Key Value service at `95.216.208.78:31941`
08:54:32.099 INFO ▶ Successfully connected to Management service at `95.216.208.78:31351`
08:54:32.103 INFO ▶ Successfully connected to Views service at `95.216.208.78:30103`
08:54:32.105 INFO ▶ Successfully connected to Query service at `95.216.208.78:30386`
08:54:32.106 INFO ▶ Successfully connected to Search service at `95.216.208.78:30561`
08:54:32.108 INFO ▶ Successfully connected to Analytics service at `95.216.208.78:30104`
08:54:32.109 INFO ▶ Successfully connected to Key Value service at `95.217.218.135:32210`
08:54:32.118 INFO ▶ Successfully connected to Management service at `95.217.218.135:31705`
08:54:32.119 INFO ▶ Successfully connected to Views service at `95.217.218.135:30896`
08:54:32.121 INFO ▶ Successfully connected to Query service at `95.217.218.135:30863`
08:54:32.121 INFO ▶ Successfully connected to Search service at `95.217.218.135:30585`
08:54:32.124 INFO ▶ Successfully connected to Analytics service at `95.217.218.135:31413`
08:54:32.131 INFO ▶ Successfully connected to Key Value service at `135.181.30.248:31872`
08:54:32.137 INFO ▶ Successfully connected to Management service at `135.181.30.248:32453`
08:54:32.142 INFO ▶ Successfully connected to Views service at `135.181.30.248:30976`
08:54:32.144 INFO ▶ Successfully connected to Query service at `135.181.30.248:32549`
08:54:32.149 INFO ▶ Successfully connected to Search service at `135.181.30.248:30763`
08:54:32.155 INFO ▶ Successfully connected to Analytics service at `135.181.30.248:30578`
08:54:32.163 INFO ▶ Memd Nop Pinged `95.216.208.78:31941` 10 times, 0 errors, 0ms min, 1ms max, 0ms mean
08:54:32.169 INFO ▶ Memd Nop Pinged `95.217.218.135:32210` 10 times, 0 errors, 0ms min, 0ms max, 0ms mean
08:54:32.182 INFO ▶ Memd Nop Pinged `135.181.30.248:31872` 10 times, 0 errors, 0ms min, 1ms max, 0ms mean
08:54:32.182 INFO ▶ Diagnostics completed

Summary:
[WARN] The hostname specified in your connection string resolves both for SRV records, as well as A records.  This is not suggested as later DNS configuration changes could cause the wrong servers to be contacted
[WARN] Bootstrap host `10-36-0-7.oaf-couchbase-srv.default.svc.cluster.local` is not using the canonical node hostname of `oaf-couchbase-0005.oaf-couchbase.default.svc`.  This is not neccessarily an error, but has been known to result in strange and challenging to diagnose errors when DNS entries are reconfigured.
[WARN] Bootstrap host `10-32-0-19.oaf-couchbase-srv.default.svc.cluster.local` is not using the canonical node hostname of `oaf-couchbase-0003.oaf-couchbase.default.svc`.  This is not neccessarily an error, but has been known to result in strange and challenging to diagnose errors when DNS entries are reconfigured.
[WARN] Bootstrap host `10-35-0-39.oaf-couchbase-srv.default.svc.cluster.local` is not using the canonical node hostname of `oaf-couchbase-0004.oaf-couchbase.default.svc`.  This is not neccessarily an error, but has been known to result in strange and challenging to diagnose errors when DNS entries are reconfigured.

Found multiple issues, see listing above.

I haven’t tried downgrading to 2.x - not sure it works with Typescript…

Hmm… Waiting since 2 or 3 months now on this… I’m pretty sure that this issue isn’t on my end, many people report similar bugs, my setup worked for weeks before, now it doesn’t work anymore on any of my previous commit-states, on which it definitely previously worked.

Hmm. Looks like my local dev setup is working, but my docker setup is not working, while it’s really essentially the same.

@brett19 there seems to be a lot of similar reports on 3.x versions… What do you suggest? This keeps crashing even under moderate load, this is really blocking us…

Should we downgrade to 2.x?

I’m not sure how to retrieve the corefile in my kubernetes setup since the containers are terminated as soon as a crash occurs, but does this help:

FATAL ERROR:
    libcouchbase experienced an unrecoverable error and terminates the program
    to avoid undefined behavior.
    The program should have generated a "corefile" which may used
    to gather more information about the problem.
    If your system doesn't create "corefiles" I can tell you that the
    assertion failed in ../deps/lcb/src/mcserver/negotiate.cc at line 50

Hello @Yann_Jouanique are you facing this problem with 3.1.1 while you are on K8 pods or even when you are running application from a local machine ? And are you saying that with 2.x same environment setup everything worked ?

No, we never tried 2.x, so we’re not sure if that addresses the issue, but I’ve seen this suggested as a working solution in the forums. We’re reluctant to do this because I don’t think 2.x has typings that would work in our TS environment.

We have however tried all 3.x versions.

We cannot really reproduce this locally since Couchbase in Kubernetes is really hard to access externally, so we need the clients to be in k8s as well.

Could this be linked to https://issues.couchbase.com/projects/JSCBC/issues/JSCBC-837 ? Although that seems to be affecting only 3.1.1.

Note that this only seems to happen under moderate load. We never saw this happen during development but only now during pre-production tests… This is very problematic for us… It works but our pods are taking turn crashing every ~1 minute…

The failing assertion does not help us much track the issue but it’s here: https://github.com/couchbase/libcouchbase/blob/master/src/mcserver/negotiate.cc#L50

Also, the errors logged before the crash are here: https://github.com/couchbase/libcouchbase/blob/master/src/bucketconfig/bc_cccp.cc#L187

2021-01-21T07:46:53.722Z couchnode:lcb:error (cccp @ ../deps/lcb/src/bucketconfig/bc_cccp.cc:187) <NOHOST:NOPORT> (CTX=(nil),) Could not get configuration: LCB_ERR_TIMEOUT (201)

Thanks @Yann_Jouanique, no I didn’t mean to ask you to switch to 2.x thats not advisable.
Working with containers which I don’t think it’s tried / tested , having said that it should not be any different unless something is blocking attempt to ping the server - @brett19 / @ericb thoughts ?

Well the error messages indeed tend to indicate some network issue, however:

  • This is observed on 2 very different network environments (our QA is a home-made k8s cluster with a small German hosting provider, our Production is Azure)
  • The crashes do not occur at the same time across all the replicas of the pod. They alternate fairly regularly (1 different pod crashes roughly every minute). And Sync Gateway (2 pods) reports no issue at that time.

So I don’t really think this is network-related. I suspect something related to concurrency (since this only happens under some sustained load, ~6 queries/s/pod).

Hello…

I’m wondering if there are any thoughts about this. We are still experiencing very frequent libcouchbase crashes in Production (~1/h on each of our ~8 containers). Since we’re a paying customer we have also opened a support ticket, but that hasn’t really progressed either. We’re seeing this on our other Node-based apps as well, but not on our C# apps so far, though the volume is different so it’s hard to compare.
Our first app is basically just a cache, so crashes are acceptable (our cache will simply be slightly outdated) but we’re now working on transactional apps where this unstability clearly won’t be acceptable for us.

Any upcoming SDK release that might have a chance of fixing these? @AV25242 @brett19 ?

Hello @Yann_Jouanique do you know if support team determined this to be a SDK bug and had raised an SDK ticket by any chance ? that will be the way to find out if this was lined up along side SDK releases.

I’m kinda giving up on this. Almost 4 months later and I’m still having the same issue in Docker. I’ll probably switch to other database solutions in the future. And I am even being ignored.

Hey @Yann_Jouanique and @EtzBetz, I’m sorry you’re frustrated. We’ll certainly get someone to have a look at this. It’s not expected. While we do help on the forums as time allows, if you are a subscriber and escalated the priority on this through support, it’d be great if you could direct message @AV25242 or I some details to follow up internally.

Looking at the code, that assertion is in a place that hasn’t changed for about 5 years, and it seems to be related to closing a connection. At least to me, it looks rather odd since it’s just doing some basic cleanup work with a socket.

One question @Yann_Jouanique, can you describe the base image you’re using for the node pods? By chance is there anything special in terms of networking, like running Envoy or another service mesh? I’m not saying that’s directly causing the problem, but it might help us isolate it since this isn’t happening in other environments.

Also, @geraldapeoples, are you saying Sync Gateway is affected in your case? That would tend to point to something a bit more general, as Sync Gateway does not use nodejs/libcouchbase.

Hi, no I don’t think it is related to the sync gateway … the nodejs app connects to the couchbase server as per the docs, I am back on v2 so haven’t looked at this in a while, just not comfortable moving to v3 just in case I get the same errors in production which would be a disaster, JSCBC-837 being open indicates to me that v3 has significant bugs at present, hopefully in future v3 will be stable enough to risk moving to

Hello!

We’re using the node:12 base image,. We have been working with support to provide the core dumps which I think they’re going through now, although it looks like they don’t open well in gbd without access to the original process binary (but are fine in lldb).

I wouldn’t expect that we’re doing anything outside of the ordinary - this application is our first deployment on a sparkling new AKS cluster, where not very much is running besides Couchbase, Sync Gateway, and our app. We did deploy Kong API gateway recently which I guess does some messing around with services (?) but that happened long after we started noticing those issues.

Sync Gateway appears to be doing fine whenever our app crashes, which would indicate network seems fine anyway - although we have faced different issues with Sync Gateway, notably that it doesn’t seem to pick up new network configurations whenever we addd/remove couchbase nodes with the operator. But that is a completely separate issue.