Couchbase Java SDK & circuit breaker issue

We are seeing issues while doing the failover to secondary server using circuit breaker pattern. We stopped the couchbase service and tried doing the get call. We saw that the get call was waiting for 75 seconds and then moving to secondary. This is the couchbase config we have

DefaultCouchbaseEnvironment
        .builder()
        .autoreleaseAfter(Duration.ofMinutes(5).toMillis())
        .build();

This is how we have put the code for failover using circuit breaker

/**
 * Decorates the output from primary bucket with circuitbreaker
 *
 * @param key The key to fetch jsonDocument from couchbase
 * @return The JsonDocument for the provided key
 */
public Flowable<JsonDocument> get(final String key, final String traceId) {
    final long startTime = System.currentTimeMillis();
    return primaryByKey(key, traceId) // Fetch from primary
            .compose(circuitBreakerOperatorForKey)
            .onErrorResumeNext(t -> {
                log.error("Exception/Failure condition to fallback to secondary: {}", t.getMessage());
                if (t instanceof DocumentDoesNotExistException || t instanceof NoSuchElementException) {
                    return Flowable.error(t);
                } else {
                    return secondaryByKey(key, traceId);
                }
            })
            .doOnNext(t -> {
                log.info("name=\"get\" event_id=\"{}\" key=\"{}\" timeTaken={}", traceId, key, System.currentTimeMillis() - startTime);
            });
}
/**
 * Decorates the output from primary bucket with circuitbreaker
 *
 * @param key The key to fetch jsonDocument from couchbase
 * @return The JsonDocument for the provided key
 */
public Flowable<JsonDocument> get(final String key, final String traceId) {
    final long startTime = System.currentTimeMillis();
    return primaryByKey(key, traceId) // Fetch from primary
            .compose(circuitBreakerOperatorForKey)
            .onErrorResumeNext(t -> {
                log.error("Exception/Failure condition to fallback to secondary: {}", t.getMessage());
                if (t instanceof DocumentDoesNotExistException || t instanceof NoSuchElementException) {
                    return Flowable.error(t);
                } else {
                    return secondaryByKey(key, traceId);
                }
            })
            .doOnNext(t -> {
                log.info("name=\"get\" event_id=\"{}\" key=\"{}\" timeTaken={}", traceId, key, System.currentTimeMillis() - startTime);
            });
}
/**
 * Get the JsonDocument for the key from secondary bucket when there is failure in primary and in this condition circuit will be in open
 * or half_open state
 * <p>
 * Note: But sometimes there will be slight delay in switching from CLOSED to OPEN state so you might see CLOSED state in secondary
 *
 * @param key The key to fetch from far couchbase instance
 * @return The JsonDocument
 */
private Flowable<JsonDocument> secondaryByKey(String key, final String traceId) {
    log.info("{}", new LogEvent("secondaryByKey", "IN")
            .addEntry("key", key)
            .addEntry("traceId", traceId)
            .addEntry("circuit_state", circuitBreakerForCouchbaseReadKey.getState().name()));
    return RxJavaInterop.toV2Flowable(secondaryAsyncBucket
            .get(key)
            .doOnError(error -> log.error("Error in secondaryByKey", error)));
}

Attached are the logs for the same.

server_logs.zip (177.7 KB)

Please see line # 9477 in backward direction in the logs.

As per the SDK if there are issues in connecting to the cluster or node being down, it should instantly throw an error rather than sending it after 75 seconds. We can’t put a smaller get timeout as in case the couchbase server is flooded with request, we risk opening the circuit and moving to secondary which will add delay of its own.

Also one other observation is if you change the bucket password after the server startup, it gives the exception instantaneously.

As per the javadoc @ AsyncBucket (Couchbase Java SDK) there are 4 exceptions associated with get call but none are thrown even when the sdk is not able to connect to couchbase cluster. We can’t wait for x sec as the agreed upon sla is below 60 msec with consumers and we can’t failover if one of the odd request takes more than 50-55 msec due to cross data center request.

@daschl can you please assist here

@daschl can you please assist.

@david.nault @graham.pople @AV25242 Can we get some eyes on this.

@himanshu.mps you are using the best effort retry strategy, so this is why the client will retry the operation until the timeout hits. If you want to have “fail fast” capabilities, you need to enable the fail fast retry strategy. Note that in SDK 2 you can only enable it globally, in SDK 3 you can even override it on a per-request basis if needed.

With the best effort strategy, the behavior you described is the expected one.