Reconfiguration problems on node information after failover (SDK V2.0.3)


#1

I’m doing test on SDK V2.0.3.

There’s reconfigration problem after failover.

Exception

[com.couchbase.client.core.endpoint.Endpoint] - [search-easter-cache10/10.41.117.140:11210][KeyValueEndpoint]: Could not connect to endpoint, retrying with delay 4096ms:
java.net.ConnectException: Connection refused: search-easter-cache10/10.41.117.140:11210

Test Results

Load Server : 15
Client Server : 6 (L7)
Data Seed
request per 40 keys

Normal Performance
concurrence(per load server) : 1

  • QPS : 3045 (4ms)
  • Response time : 4 (no shaking)
  • OPS : 90K (shaking :100~120K)

concurrence(per load server) : 4

  • QPS : 9705 (5ms)
  • Response time : 4 (5~12, shaking)
  • OPS : 90K (shaking : 0~400K)

Node Down
concurrence(per load server) : 4

  • QPS : 1095 (53 ms)
  • Response time : 53 ms (no shaking)
  • OPS : 36K (no shaking)

Failover => IF do not reboot client server, throughput IS NOT recovered .

  • QPS : 1350 (45ms) => recovered to 9500(ops) after client reboot
  • Response time: 45ms (no shaking) => recovered to 5~6ms after client reboot
  • OPS: 48K => recovered to 9650(ops) after client reboot

Code

public class HammerGather implements Func1<String, Observable<StringDocument>> {
private final Logger logger = LoggerFactory.getLogger(HammerGather.class);

private final Bucket bucket;
private final long masterTimeout;
private final long replicaTimeout;

public HammerGather(Bucket bucket, long master_timeout, long replica_timeout) {
    this.bucket = bucket;
    this.masterTimeout = master_timeout;
    this.replicaTimeout = replica_timeout;
}

@Override
public Observable<StringDocument> call(final String imageHash) {
    return bucket.async().get(imageHash, StringDocument.class)
     .timeout(masterTimeout, TimeUnit.MILLISECONDS)
     .onErrorResumeNext(new Resumer(imageHash));
}

class Resumer implements Func1<Throwable, Observable<? extends StringDocument>> {
    final private String imageHash;

    public Resumer(final String imageHash) {
        this.imageHash = imageHash;
    }

    @Override
    public Observable<? extends StringDocument> call(Throwable throwable) {
        logger.warn(String.format("master error %s, timeout(%s), %s", imageHash, masterTimeout, throwable.toString()));
        return Observable.<StringDocument>empty();
    }

#2

Hi @Sam_K,

so are you saying after a failover performed on the UI (or through auto failover), your throughput drops to 0?

  • Can you share the exact steps you are performing to get to this result (maybe alongside with the code you’re using to drive the workload?)

We are performing extensive tests in our QE lab and of course test the failover scenario as well. There can always be bugs, but in order to track it I need a closer look at the workload.

  • Can you please also share DEBUG logs during your test run so we can identify odd behaviour?

Thanks!


#3

Ok l’ll share Debug logs with you.

I performed failover on the UI, after failover OPS was not recovered to normal status before node fail.

Above Test results, you can see below ops changes.
Normal OPS : 90,000
Node Down : 36,000
Failover: 48,000 (I hoped 90,000)

Test steps
15 load severs start requesting to cbclients.

  • each load server has 4 threads to request
  • 1 request has 40 keys for couchbase server.
  • response timeout is 100ms
  1. Warmup cbclients for 20 minute and monitoring OPS and response time.

  2. Getting normal status

  3. Couchbase node down
    -sudo /etc/init.d/couchbase-server stop

  4. Monitoring on node down without failover for 20minutes.

  • OPS dopped down
  1. Failover on the UI

  2. Monitoring after failover


#4

Thanks for the steps. I think if you don’t failover for 20 mins the request ringbuffer is quite full. How long did you wait to see it recover properly?


#5

Maybe I waited to recover for 5 minutes after failover.
I’m sorry I can’t remember wating time correctly.

Is not Ringbuffer clear although async.get timeout occurred?