Java SDK to protect against Internal Couchbase Errors

zoltan.zvara · October 8, 2021, 4:31pm

Due to a possible bug in Couchbase 7.0.1 CE and both EE [1], we experience frequent crashes of the indexing service, which yields the following errors in the Java SDK after the indexer & query service crashes.

Internal Couchbase Server error
{
   "completed":true,
   "coreId":"0x3dc04a9600000002",
   "errors":[
      {
         "code":5000,
         "message":" dial tcp 10.233.82.91:9101: connect: connection refused from [seven-couchbase-2.seven-couchbase.development.svc.sigma:9101] - cause:  dial tcp 10.233.82.91:9101: connect: connection refused from [seven-couchbase-2.seven-couchbase.development.svc.sigma:9101]"
      }
   ],
   "idempotent":true,
   "lastDispatchedFrom":"10.240.0.10:41586",
   "lastDispatchedTo":"seven-couchbase-2.seven-couchbase.development.svc.sigma:8093",
   "requestId":202,
   "requestType":"QueryRequest",
   "retried":0,
   "service":{
      "operationId":"960c557d-95f8-4ccf-919d-889440cf8857",
      "statement":"SELECT b.* FROM `bucket`.`default`.`sescus-A_e` b WHERE clientID = $byField",
      "type":"query"
   },
   "timeoutMs":40000,
   "timings":{
      "dispatchMicros":7488504,
      "totalDispatchMicros":7488504
   }
}

The query is executed as follows:

val f = $.cluster.query(
          s"SELECT b.* FROM ${$.scalaEntityCollection.from} b " +
            "WHERE " + name + " = $byField",
          QueryOptions()
            .readonly(true)
            .parameters(
              Named(
                "byField" -> value
              )
            )
        )

We set the SELECT query to be idempotent, however, the SDK will not attempt any retries, possibly, because the type of error is not eligible for retry by the SDK. Therefore, we retry outside of the SDK as suggested by the SDK documentation. Below is an experimental implementation of the blocking retry mechanism:

protected[storage] def withComplementaryRetry[R](
    f: => R
  )(implicit couchbaseConfiguration: Couchbase.Configuration): R = {
    val querySuccess = new retry.Success[R](_.isInstanceOf[R])
    Await.result(
      retry.JitterBackoff(
        Int.MaxValue,
        couchbaseConfiguration.complementaryRetryStrategy.baseDelay
      )(odelay.Timer.default, retry.Jitter.full(cap = 3.seconds)) {
        val future = Future(f)(Couchbase.complementaryRetryingThreadPool)
        future.onComplete {
          case Failure(exception) =>
            log.warn(
              "Could not complete query successfully due to error [{}] with message [{}]!",
              exception.getClass.getName,
              exception.getMessage
            )
          case Success(_) => ()
        }
        future
      }(querySuccess, Couchbase.complementaryRetryingThreadPool),
      couchbaseConfiguration.complementaryRetryStrategy.timeout
    )
  }

When the above Internal Couchbase Server error is observed, the query by the withComplementaryRetry will be retried (because the SDK refuses to do so). However, even after Couchbase recovers from the indexer crash quickly, the SDK does not recover from Internal Couchbase Server error, even after 300 seconds. While the 300 seconds are ticking down and the withComplementaryRetry attempts, again and again, a new SDK instance constructed during the retrying can complete queries successfully, I think the SDK can not recover from these Internal Couchbase Server errors.

The above question is relevant because Couchbase 7.0.1 CE and EE have an issue with the indexing service [1], and protecting against these cases is now relevant.

Is there any way to force the SDK programmatically to recover from these errors for idempotent queries?

[1] Couchbase 7 release date - #29 by yogendra.acharya

graham.pople · October 11, 2021, 4:49pm

Hi @zoltan.zvara
I’ve not got a concrete reply at this point, but just wanted to let you know that I’ve raised this internally for discussion. The initial thought is that since you’ve set the readonly flag then this should indeed be retried, but please don’t take that as definitive until we’ve had a chance to look further.