Trouble connecting with nodejs

Hello! We are using the NodeJS SDK, and having trouble connecting to the cluster on one of our machines. We get:

  couchnode:lcb:info (instance @ ../deps/lcb/src/instance.cc:509) Version=3.0.1-njs, Changeset=76262e72018abc83c1131c625bd016f080321d37 +0ms
  couchnode:lcb:info (instance @ ../deps/lcb/src/instance.cc:510) Effective connection string: couchbase://localhost/si?client_string=couchnode%2F3.0.2%20(node%2F14.4.0%3B%20v8%2F8.1.307.31-node.33%3B%20ssl%2F1.1.1g). Bucket=si +0ms
  couchnode:lcb:debug (instance @ ../deps/lcb/src/instance.cc:374) Applying initial cntl client_string=couchnode/3.0.2 (node/14.4.0; v8/8.1.307.31-node.33; ssl/1.1.1g) +0ms
  couchnode:lcb:info (instance @ ../deps/lcb/src/instance.cc:202) DNS SRV lookup failed: LCB_ERR_UNKNOWN_HOST (1049). Ignore this if not relying on DNS SRV records +0ms
  couchnode:lcb:debug (instance @ ../deps/lcb/src/instance.cc:139) Adding host localhost:8091 to initial HTTP bootstrap list +0ms
  couchnode:lcb:debug (instance @ ../deps/lcb/src/instance.cc:139) Adding host localhost:11210 to initial CCCP bootstrap list +0ms
  couchnode:lcb:trace (instance @ ../deps/lcb/src/instance.cc:182) Bootstrap hosts loaded (cccp:1, http:1) +0ms
  couchnode:lcb:info (bootstrap @ ../deps/lcb/src/bootstrap.cc:259) Requested network configuration: heuristic +0ms
  couchnode:lcb:debug (confmon @ ../deps/lcb/src/bucketconfig/confmon.cc:92) Preparing providers (this may be called multiple times) +0ms
  couchnode:lcb:debug (confmon @ ../deps/lcb/src/bucketconfig/confmon.cc:99) Provider CCCP is ENABLED +0ms
  couchnode:lcb:debug (confmon @ ../deps/lcb/src/bucketconfig/confmon.cc:99) Provider HTTP is ENABLED +0ms
  couchnode:lcb:trace (confmon @ ../deps/lcb/src/bucketconfig/confmon.cc:299) Refreshing current cluster map +0ms
  couchnode:lcb:trace (confmon @ ../deps/lcb/src/bucketconfig/confmon.cc:285) Attempting to retrieve cluster map via CCCP +0ms
  couchnode:lcb:info (cccp @ ../deps/lcb/src/bucketconfig/bc_cccp.cc:160) Requesting connection to node localhost:11210 for CCCP configuration +0ms
  couchnode:lcb:debug (lcbio_mgr @ ../deps/lcb/src/lcbio/manager.cc:481) <localhost:11210> (HE=0x55f817971e30) Creating new connection because none are available in the pool +0ms
  couchnode:lcb:trace (lcbio_mgr @ ../deps/lcb/src/lcbio/manager.cc:393) <localhost:11210> (HE=0x55f817971e30) New pool entry: I=0x55f817963770 +0ms
  couchnode:lcb:info (connection @ ../deps/lcb/src/lcbio/connect.cc:487) <localhost:11210> (SOCK=fb8b505b323bcb04) Starting. Timeout=2000000us +0ms
  couchnode:lcb:error (bootstrap @ ../deps/lcb/src/bootstrap.cc:196) Failed to bootstrap client=0x55f817aaf920. Error=LCB_ERR_TIMEOUT (201), Message=Failed to bootstrap in time +0ms
  couchnode:lcb:debug (lcbio_mgr @ ../deps/lcb/src/lcbio/manager.cc:503) <localhost:11210> (HE=0x55f817971e30) Request=0x55f817aaf860 has no connection.. yet +0ms

When we run Wireshark, we see the connection, and we see the response, and it looks good. But no dice.

This only seems to happen on one machine, but we’re totally at a loss for whats happening. Tips on troubleshooting further?

which version of the SDK are you using ? Are you able to connect to other nodes in the cluster and is this a problem with only one machine ? do you have any logs ?

Thanks for the reply! We’re using the 3.0.2 SDK. It is only happening on the one machine, which is also talking to the cluster via the Rust SDK (which, oddly enough, is also using the C bindings under the hood… so one imagines an error would happen in both places?)

It appears to work fine on other systems. Just this one . What logs would be helpful?

@ellis.breen can you please help?

I have been facing a very similar issue. Our containerized app (hence exact same environment, binaries, configuration, network…) behaves well when run on some nodes, and fails on others with a similar message, i.e. a LCB_ERR_TIMEOUT after just 1ms.

After MUCH troubleshooting I still don’t understand the exact underlying root cause but I believe it is due to the way we use the SDK in our code, rather than environment issues, and that we actually receive a valid response but can’t invoke a handler to process it.

Here’s the full outcome of my research:

  • Node SDK version: 3.0.4
  • OS: docker image node:12-slim

First, sdk-doctor reports no issues when run from the container, so it’s unlikely that environmental issues would cause problems (network, DNS…)… There are some warnings but they’re very unlikely to be relevant in this case (warning about potential issues if DNS entries are updated, which isn’t likely here).Much more interesting are the SDK traces, which show the following pattern:

  • Initial connectivity is successful (at SDK startup, before our app even finishes loading and starts receiving traffic, and we are successfully retrieving cluster and bucket information
  • However when our first request is sent, the connection seems to fail and close, and from then on we never see any further SDK traces. The error happens precisely between those log entries, reporting a timeout after just 1ms (while the configured timeout is 2s):
2020-07-11T00:25:04.142Z couchnode:lcb:info (connection @ ../deps/lcb/src/lcbio/connect.cc:487) <10-36-0-4.oaf-cluster-srv.default.svc.cluster.local:11210> (SOCK=370ec60b9851eb41) Starting. Timeout=2000000us
2020-07-11T00:25:04.143Z couchnode:lcb:error (bootstrap @ ../deps/lcb/src/bootstrap.cc:196) Failed to bootstrap client=0x35944b0. Error=LCB_ERR_TIMEOUT (201) (Last=LCB_ERR_TIMEOUT (201)), Message=Failed to bootstrap in time
2020-07-11T00:25:04.147Z couchnode failed to connect to bucket: [Error: LCB_ERR_TIMEOUT (201): The request was not completed by the user-defined timeout] {
code: 201
}
  • The logs also show the exact IP/port being used, which are definitely reachable using netcat
  • But what is particularly interesting is that when I run the following trivial code from the same container , things work out perfectly:
const couchbase = require('couchbase');
let cluster = new couchbase.Cluster('couchbase://oaf-cluster-srv.default.svc.cluster.local', {
username: 'Administrator',
password: 'password',
});let query = async function() {
let bucket = await cluster.bucket('fs-bucket-v0');
let collection = await bucket.defaultCollection();
let results = await cluster.query('select * from `fs-bucket-v0`;'); console.dir(results);
// console.dir(bucket);
// console.dir(collection);
};query();
  • Not only I’m getting the expected data, but we can also see a lot of further SDK traces going on long after the response is received, showing there’s some keepalive going on.
  • The fact that we don’t see such keepalive traces in our own application logs indicates that somehow our SDK instance may have died/been closed for some reason.
  • This is consistent with the fact that the next log entry in the successful case (which we do NOT see in our app logs) mentions Received completion handler

So, TL;DR: I believe some issue in our own code prevents a completion handler to be available / executed to process what would otherwise be a successful call .
I don’t understand why exactly, and in particular why this only happens on some environments. One theory could be that environments with lower memory may simply be garbage-collection out-of-scope objects more frequently… this could explain why things started working on my local minikube after I made some otherwise unrelated cleanup, or why it tends to work better where there’s only 1 single couchbase node in the cluster.

Yann,
I was running into this exact issue. Even had the same results when using the trivial code. I ended up finding that the problem was due to an import statement. We utilize index files to handle exporting our classes. The original code that didn’t work, referenced an index file that was one directory up from the code I was trying to execute. The code that worked, referenced the index file that was in the same directory as the code that I was trying to execute. For example:

├ src
│   |- repositories 
|      |-index.ts
|      |-test
|        |-index.ts
|        |-myClass.ts

When I used import { MyClass } from 'src/repositories'; I had the same problem as you with the connection failing. When I changed my import to be import { MyClass } from 'src/repositories/test I was able to get a connection and the code worked.

Hope this helps.

Hmm interesting!
In our case we ended up solving the issue as follows:
This seems to be caused by a combination of 2 things:

  • This bugreport hints that the client SDK does not retry after initial failing connections: https://issues.couchbase.com/browse/JSCBC-738
  • Our initial connection attempt was failing because our initial call to Cluster.connect was made asynchronously, and we’re not awaiting its completion. This means that one some fast environments the initial connection call could complete before our first queries, but not on others. Queries would therefore fail on those environments…

So we solved this by doing an asynchronous initialization that awaits the completion of the first Cluster.connect calls…

Another update. I did more testing with other logic and was not able to get a connection using the fix I mentioned. However, the bug report for JSCBC-738 has some example connection logic using couchbase.Cluster.connect() which is not documented anywhere in the Node SDK documentation but is a method that can be used to connect to a cluster. The key with this method is that it’s an async method so I can add an await clause or using the traditional promise then clause to wait for the connection to be made. Implementing this change has worked for everything I’ve tried.
Thanks for pointing me in the right direction, Yann.