Node Offline and GetReplica failure


#1

Hi.

I am still finding my way with the ‘new’ official api, and couchbase client behaviour in general.

However I would expect that if I attempt a get on a key in a vbucket of a node that has gone offline (through network or vm failure, no failed over) it fails as expected with ‘timeout’

In this case I try GetReplica (replica index 0 to attempt all replicas - though i only have one) which also fails with “The operation has timed out.”

Is GetReplica useful in this case, should I expect to receive the dirty read that i want.

It looks similar to this in the .NET client http://review.couchbase.org/#/c/49270/

This is in a two node cluster with 1 replica, i’ve generated 1024 keys that hash to each of the default initial vBuckets (initially due to a previous balance bug, but thats a different story)


#2

Hey danmux,

You indeed should expect this to work. What replicaIdx are you passing to the GetReplica method?

Cheers, Brett


#3

0, which looks like tries any replica indexes it can.

EDIT - Ok actually it looks like it actually is recovering a replica, but slowly it was hidden because i am reading 1024 values in goroutines - and 512 are failing - it was hard to see the successful replica reads - I think the timeouts were all due to the initial get.

When I pick one key i know to be in the primary vBucket on the downed node it is read from the replica.

However when I put back the 1024 reads I get ‘Queue overflow’ implying the replica read is indeed slow ?

EDIT 2 ok since you told me it should work, and after looking at it properly I can see the replica reads are working, as expected and they are not slow (3ms over loopback) it is the failed original reads that fill the queue, because they are held in the queue for a long time before timing out i suppose, a replica read when getting the queue full error also works.

So in summary - the code is working as designed.

Thanks for your inspirational reply Brett.

Dan.


#4

So it now looks like that once the node is down, and the queue has filled up (even after setting opTimeout to 500ms)
that it never recovers - once the node is back up I still get the “Queue overflow.” error.

Why dose the queue not empty - surely the old requests timeout and then are dropped from the queue? new requests are good (now the node is up) and can be served fast, im only sending 1024 reads every 4 seconds.

Or is it that because the connection to the node for those vbuckets has been severed, nothing is actually reconnected, so they will always timeout now, until i reconnect? - and is that something I should do in my own client code and if so how, just the normal ‘OpenBucket’ ?


#5

Hey danmux,

This sounds like a bug. When a node recovers, the queue for it should continue and the operations should be dispatched. I’ll take a look!

Cheers, Brett


#6

Hey danmux,

Can you confirm whether the Queue Overflow goes away eventually following the successful return of a node, or does it just stay with an overflowed queue indefinitely.

Cheers, Brett


#7

Will confirm when the pressure of the release has subsided.

(off topic - if i have a suggestion/improvement whats the procedure - PR or can I generate a ticket in yr bug tracker? or suggest it here in the forum first?)


#8

Hey danmux,

You can create a ticket in our bug tracker. You can additionally submit a PR to our GitHub repo, although this option will soon involve submitting your PR through Gerrit instead once we’ve managed to move the project there.

Cheers, Brett