Serverless Provisioned Concurrency Issue with Couchbase Connection

I have just attempted to implement AWS Lambda Provisioned Concurrency to remedy “cold-start” latency issues with each lambda function. Although the provisioned instances appear to be available much quicker than before, the connection to the db is either not made, not maintained, or otherwise drops soon after the PC instance is spawned.

In short, the PC lambda functions do not appear to connect to the Couchbase cluster as it does under normal circumstances. My configuration is as follows:

Couchbase Server 6.0.4 - 3-nodes - public IP access
Node SDK 2.6.9
Lambda env - Node 10.x
AmazonLinux2

I am using what I believe is a static connection reference with a failover to connect if no connection is detected during a query. Attached is a screen-cap of my connection configuration:

If this may be improved, I would appreciate learning how.

UPDATE: I have some additional information I have gathered attempting resolve this issue and posted it in the issue tracker of the serverless framework, although it does not appear the serverless framework, or the Provisioned lamda is defective in any way.

Could it be I need to adjust or reconfigure the connection config?

Regards,

JG

It’s very hard to say what this is without more information. We recently have been discussing a similar situation, and one suspect is that the Freeze/Thaw cycle in the AWS Lambda is at issue.

Specifically, the AWS docs say when using a database, verify you are connected before trying to use it:
“if your Lambda function establishes a database connection, instead of reestablishing the connection, the original connection is used in subsequent invocations. We suggest adding logic in your code to check if a connection exists before creating one.”

What I believe is happening in the other case is the process is ‘frozen’ in such a way that the TCP connection goes half open. When it’s ‘thawed’ later to process an event, it tries to use a TCP connection it believed was in the ESTABLISHED state, gets a RST, and then may return an error when the request from your app tries to use it.

In most deployments, we have some background processing that is, at a very low rate (2.5 seconds between requests, eventually visiting all connections) using those TCP connections. Of course, that doesn’t happen if someone is freezing the process.

The workaround, in the other case, is to leverage the SDK’s ping() function to do what AWS recommends: ping to ensure connectivity before trying to use the client instance. That workaround has proven effective there, but it adds latency.

We don’t specifically test/support AWS Lambda and there are probably some changes we’d like to make for that environment when we get a chance.

Questions for you: Does this freeze/thaw sound plausible? Do you know of any way to detect it?

It may be possible to intercept requests after the ‘thaw’ and check and fix the state of the connections before handling the request, but only if we know this cycle has occurred. I posted a question on one of the AWS blogs, but they’ve never posted my comment or a reply.

@brett19 may have some other thoughts.

Thank you @ingenthr for your reply.

I will review the documentation for ping() and attempt to leverage it and determine if that will remedy the issue.

Before I provide the answers to your questions or provide additional context, etc., I want to be clear what it is I believe is needed at this point. The code snippet I have posted to this topic accurately illustrates our connection configuration strategy. Although, checking for an active connection is requisite, correctly reconnecting to a dropped connection is equally important. If the configuration I have posted here does not do both of these things the issue I have reported may be resolved by correcting or improving our current connection implementation.

To your questions;

Does this freeze/thaw sound plausible?

Response:
Its hard to say in this configuration. I can tell you with confidence we had solved all previous intermittent connectivity issues by following instructions available in Couchbase documentation, assistance provided in this forum, and advice and illustrations provided within numerous blogs and articles.

To be clear, we haven’t experienced “connection timeouts”, “…shutdown bucket”, or any other network or connectivity failures for some time. We have also successfully applied retry strategies where appropriate. The code above and the aforementioned improvements have resolved request failures related to “cold-starts” for some time. Connectivity has been very stable.

With connection-stability achieved, we focused our efforts on improving the response-time of requests made to idle serverless functions (“cold-starts”). Prior to implementing the recently provided AWS Lambda Provisioned Concurrency feature, responses rarely failed, even when they were slow.

Lambda Provisioned Concurrency was the logical choice to improve the performance of responses related to the cold-start (“freeze/thaw”) reality. It is this effort which spawned this topic.

Do you know of any way to detect it?

Yes. and we have. Unfortunately, it appears the nature of the “cure” is responsible for the “symptom” I am reporting here.

Lambda Provisioned Concurrency promises to ensure a configurable number of “warm” instances of the function are always available. It definitely appears to do just that. Unfortunately, connectivity to our cluster has been the challenge. Figuring out why we cannot maintain a stable connection, reconnect, or otherwise reinitialize a stable connection to our CB cluster is why where we are now.

I assumed given our connection configuration was to spec and appeared to be robust, we should not have issues connecting, reconnecting, or even re-initializing communications to CB from any active instance. Our connection logic has always resided in a separate “store.js” file “required” into each lambda file dependent upon the CB cluster.

We believe we have implemented our connection logic as soundly as recommended. If not, then we are seeking to learn what improvements or adjustments we need to make to it to allow a “warm” instance of a serverless function to work appropriately and robustly.