Couchbase server skipping responding concurrent kv queries

hi all,

I’m working on a legacy application that uses couchbase server “4.0.0-4051-community” and recently I noticed an issue after performing an instance upgrade + reshuffle.

some context info:

  • the couchbase servers are running on AWS EC instances
  • couchbase cluster consists of 8 instances
  • instances were upgraded (instance type upgrade) 2 at a time
  • consumer web application (using couchbase java client 2.7.0) sends kv queries and view queries.

we noticed significant latency increase after all the reshuffling is completed. after some investigation I’ve found some some issues on the web application side (regarding to timeout) but there is one issue I think is caused by couchbase server:

when many (on the order of hundreds) kv-queries are issued concurrently, the couchbase server does not respond to all of them.
e.g. when issuing 283 (actual number from the application) asynchronous (using this method) kv queries concurrently, I got 255 response, while the remaining 28 are just hanging indefinitely, without any exceptions.

some details:

  • when the application is deployed to the same EC network as the couchbase server (low network latency, on the order of single digit millisecond), this issue happens consistently for the same set of keys.
  • for the keys that are hanging, issuing a single synchronous get (using this method) will return document not found, but from the web console (UI), I was able to query the document by id.
  • when the same application is running from my laptop (non-EC network with slightly higher latency, about 50ms network latency), all 283 concurrent asynchronous calls are responded
  • for the keys that caused the issue on EC netowrk, documents are responded correctly when requested individually with single get calls from locally-running applicaiton.
  • with application running at the same time on both EC network and locally, the same key can be get locally via a single get but not on the cloud (same API)

the above observations lead me to believe that there are multiple issues (maybe the same one) on the couchbase server side (4.0.0-4051-community)

  • there is a race condition in handling incoming commands / requests. same set of calls works fine with a higher-latency network but not on a low latency network
  • certains keys are not responded consistently. valid response via high-latency network but invalid response via low latency network.

I figured can be an issue that has been fixed in an earlier version. Can someone from Couchbase server team briefly go over the related changes and let me know whether this is fixed in a later version?

Many thanks in advance!