Map-reduce views not available during failover


#1

I am experiencing problems with availability of map-reduce views after node failure. When node fails, any query to a view from control panel timeouts after 60 seconds and our application client (in node.js, using couchbase@2.2.4 ) reports cryptic error “error : parsing failed” after 15 seconds.

We are running couchbase-community-4.1.0 (dockerized) with 5 nodes. Bucket configured with:

  • Replicas enabled, number: 2
  • View index replicas enabled
  • Auto-failover is enabled with timeout of 30 seconds

Documentation on Views and High Availability led me to believe that after a failover of node view queries should be served from a replica; this is not happening. Just after triggering failover cluster begins to rebuild index on this bucket and view queries fail.

Said queries are issued with:

couchbase.ViewQuery.from("bucket", "history")
            .on_error(couchbase.ViewQuery.ErrorMode.STOP)
            .stale(couchbase.ViewQuery.Update.NONE)
            .reduce(true);

Is there a configuration option on high availability that we are missing? Right now failure of single node disables our whole application, which is a game-breaker for us.


24x7 Uptime during a fail-over
#2

Can you check if you see this problem directly from the view engine REST API, or if it’s something related to your client application / SDK?

You can go to the Admin UI, Indexes -> Views, select your production view, then either query using the UI, or click on the generated URL at the bottom and test with curl or whatever from your app servers. For example, to query the sample bucket beer-sample using the brewery_beers view, showing the first 50 results:

curl "http://127.0.0.1:8092/beer-sample/_design/beer/_view/brewery_beers?limit=50"

#3

When querying view directly from Admin UI I get following error:

from:  http://<host>:8092/bucket/_design/bucket/_view/history?inclusive_end=true&stale=false&connection_timeout=60000&limit=10&skip=0
reason: unexpected

It looks like it’s not SDK-related.


#4

@asingh @vmx any suggetions?


#5

Did you get that output from curl call or you copied the message from Admin UI? Usually error message will be more verbose. You could just share the dump of curl call output.

You point about leveraging replica indexes on node down scenario - it would work fine for stale=ok queries but for stale=false queries which basically triggers index update first and then returns the collated result back to client/user. On node failover, vbuckets move in and out from nodes within the cluster - for that reason view updater has to make sure it index/cleanup new vbuckets that moved in/moved out. More details if you’re interested are documented - View Engine Internals

From your application perspective, do you really need stale=false query?

FYI, in this case looks like you’re using aggregation(reduce) but non-agreegate queries you would get much better/consistent performance using N1QL and Secondary Indexes(primarily because of the design differences b/w view-engine and 2i).


#6

This is from Admin UI; curl output claimed timeout on two of the nodes:

$ curl -s "http://<creds>@moon-paw:8092/bucket/_design/bucket/_view/history?inclusive_end=true&stale=ok&connection_timeout=15000&limit=10&skip=0"
{
    "rows":[...],
    "errors":[
        {"from":"loyal-hound:8092","reason":"timeout"},
        {"from":"moon-moon:8092","reason":"timeout"}
    ]
}

These nodes did show up ~90% CPU usage at that time, but 15 seconds to read a stale index (which has replicas) seems excessive.

No I don’t and my application actually doesn’t use it:

couchbase.ViewQuery.from("bucket", "history")
            .on_error(couchbase.ViewQuery.ErrorMode.STOP)
            .stale(couchbase.ViewQuery.Update.NONE)
            .reduce(true);

This is our healthcheck query, and this one is timing out.


#7

These nodes did show up ~90% CPU usage at that time, but 15 seconds to read a stale index (which has replicas) seems excessive.

I can’t say for sure if your cluster is properly sized for the workload. View Indexer doesn’t use CPU firepower aggressively, I suppose it would use just one core - even in that case if you’re hitting 90% CPU util, then it’s a problem. Probably worth checking for the process that’s eating most CPU as well.

Now about the point you made w.r.t. 15 seconds just to read data from index file - please note it isn’t just reading, it’s indexing/cleaning new/old data as well(for all the new partitions that would have come to loyal-hound and moon-moon node). And that time would vary on your dataset.

No I don’t and my application actually doesn’t use it:

I’m not sure what purpose you want to serve with health check query, probably you could share some info and then we could suggest appropriate alternatives. Certainly firing stale=false query isn’t a sane way to verify if nodes are up or not.

I also hope you haven’t created views on your cluster just for doing health check.


#8

Cluster does not have much problems when in normal operation. It holds ~10M documents and can accept consistently ~15k/s writes in normal operation. Views are all available in these conditions; problems arise after failover or during rebalance.

I will check with single indexer thread and monitor workload.

This particular view serves event statistic over time; it serves a number of events registered between given timestamps and some basic statistics of them, grouped by event source. This check just tests top-level reduce (which would usually output total number of events recorded and timestamp of last recorded event) to check if statistics is available and how old it is, so application will not bother sending detailed queries.

[quote=“asingh, post:7, topic:10968”]I also hope you haven’t created views on your cluster just for doing health check.
[/quote]
Certainly not.


#9

Maybe you could reassess if you health check query could just do stale=ok calls and certainly cluster sizing review needs to be prioritised(it includes bunch of things like cpu/ram firepower, some essential os settings, # of ddocs/views, different couchbase you’re running in your setup, disk throughput and more - some of common ones should be present in official docs). In many cases, cluster sizing review would improve working state of CB cluster.


#10

I managed to recreate this problem with in our staging environment. We have 5 servers running couchbase community-4.1.1, 4GB RAM each, cluster RAM quotas are set to 1536MB for data and 1024MB for index. Bucket has ~1e7 documents, 1024MB per node RAM quota, replicas and view index replicas enabled with number of replicas set to 1.

Auto-failover enabled with default timeout of 120 seconds.

Operation is tested with:

curl "http://<creds>@couchbase:8092/bucket/_design/bucket/_view/history?inclusive_end=true&stale=ok&connection_timeout=15000&limit=10&skip=0"
  1. Normal operation

  2. Initiated shutdown of one node (with docker stop)

    1. Cluster gives error to queries: econnrefused (expected)
  3. After 120s - auto failover kicks in (as expected)

    1. Cluster immidietly begun indexing bucket/_design/bucket (not expected, should use replica view index)
    2. Cluster is not responding to queries, including stale queries (timeout, NOT expected)
    3. Index is rebuilt after ~1 hour.
    4. Cluster begun responding to queries nomally

It seems that after auto failover, cluster does not use view index replicas and start re-indexing bucket apparently from scratch, which we did not expect. Shouldn’t the view index replicas be used in this scenario? I was lead to believe that bucket with view index replicas enabled has… well, replicas of view data that should serve queries after failover:

When you failover a node, the partitions from that node are not accessible anymore. The corresponding replicas that are split across the cluster will become active and serve the missing data. The end user will still get all the data back he expects.

Why this is not happening?


#11

This lack of auto failover of the indexing service was also unexpected for our application when this occurred.
After manually failing over the node via the couchbase web ui, the view queries began working again.

This document has a single line that mentions why this happens, but newcomers would have no idea of the consequence of the statement: "Not designed to failover the Index Service by default"
https://developer.couchbase.com/documentation/server/current/clustersetup/automatic-failover.html

Our application needs two things that I’m not sure Couchbase can currently provide:

  1. auto failover of index service
  2. java sdk where connected application can be notified on failure and notified when cluster is no longer in error, as in the interim period, no query results can be trusted and at present I don’t see an api to be alerted to that (I could be wrong, please let me know if there is such an api)

#12

I assumed that this was a reference to view replicas not being enabled by default.


#13

@leon.chadwick That sentence you quoted specifically calls out the index service, which is GSI (global secondary indexes). View indexes are maintained by the data service and the replication and failover behavior is described here. Hope that helps. -Will


#14

I have read page you have pointed me to, and it states:

Couchbase Server can optionally create replica indexes on data nodes that contain replicated data.

You can query a view index during cluster rebalance and node failover operations.

Problem is, we don’t see that happening. If a node fails, or there is rebalance in progress, our indexes are not available. We have confirmed this in our staging and even a completely independent cluster we have set up at Digital Ocean just to run these tests (5 8GB/4CPU dropplets) to rule out any hardware-based issues.

At this point CouchBase seems to be unsuitable for our use case (despite documentation stating that it should be suitable) and we have moved to another DB solutions.


#15

Thank you Will for your response. My second point on the problem was making the client aware of issues when doing the view query. For example, when I shut down a node of the cluster, the view service is still providing data without erroring, depending upon whether the ViewQuery.onError() was set to STOP, or CONTINUE, I will get 0 results or only the results from the nodes that are still up respectively. It would be helpful to have the option to have OnError.ERROR, to actually raise an error, as in our scenario an incomplete data set would be far worse than being told there was an error.
As soon as the node goes down, the logs in the client code start spamming warnings, so somewhere in the Java SDK it must immediately know there is an issue. This was the point I was making that it would be useful to allow the client to observe the change in cluster health as it happens (in addition to the aforementioned failure when the cluster is actually queried).
Below is the warning example that is being logged over and over:
2017-02-07 17:16:11,643 WARN [cb-io-1-4] c.c.c.c.e.Endpoint - [myserver.myorg.com/1.2.3.4:8093:8093][QueryEndpoint]: Socket connect took longer than specified timeout.
2017-02-07 17:16:11,646 WARN [cb-io-1-4] c.c.c.c.e.Endpoint - Error during reconnect:
com.couchbase.client.deps.io.netty.channel.ConnectTimeoutException: connection timed out: xldn6386dap.ldn.swissbank.com/14.64.153.73:8093
at com.couchbase.client.deps.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:218)
at com.couchbase.client.deps.io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
at com.couchbase.client.deps.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
at com.couchbase.client.deps.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:408)
at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:455)
at com.couchbase.client.deps.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at com.couchbase.client.deps.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
2017-02-07 17:16:11,646 WARN [cb-io-1-4] c.c.c.c.e.Endpoint - [myserver.myorg.com/1.2.3.4:8093][QueryEndpoint]: Could not connect to endpoint, retrying with delay 32 MILLISECONDS:
com.couchbase.client.deps.io.netty.channel.ConnectTimeoutException: connection timed out: xldn6386dap.ldn.swissbank.com/14.64.153.73:8093
at com.couchbase.client.deps.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:218)
at com.couchbase.client.deps.io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
at com.couchbase.client.deps.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
at com.couchbase.client.deps.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:408)
at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:455)
at com.couchbase.client.deps.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at com.couchbase.client.deps.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)


#16

Our operating model here in the Java SDK @leon.chadwick is that we try to get the request serviced inside the timeout value. Since views requests are idempotent, as you see the Java SDK will keep retrying, but logging warnings in case we need to diagnose something from logs.

For observing changes in cluster health, the Java SDK does have an Event Bus which is covered in the documentation. It depends a bit on the nature of the failure, but that may give you the observe behavior you want. @daschl or @subhashni may be able to answer whether you’ll see events in a particular instance.

Also, the OnError enums are really to set the params when sending a view request, not so much an instruction to the SDK. The errors described here are errors received when the cluster is processing a request, not an error in the request/response of a request. For that, most paths end at TimeoutException because of the idempotent retries mentioned above.

Hope that helps!


#17

Thanks for the feedback. I am looking at how the EventBus could be of use.
The raw problem here is that with a cluster with 1 or more nodes out of service, most paths do NOT appear to end in TimeoutException, but in fact return with an AsyncViewResult.success == TRUE ! At the same time, AsyncViewResult.error != null. I have now stopped paying attention to the success flag and just look at the error and am not able to judge correctly that the query has actually not been a success.


#18

We see the same kind of behaviour in our setup. We enabled view index replication because our customer noticed error responses during a rebalance/failover/node failure. We had hoped that this would solve our problems but this does not seem to have any affect.

Anybody care to elaborate on this ? Why do the views need to rebuild themselves even though we have enabled view index replication ?


#19

The AsyncViewResult.success true sounds like a bug of some sort. @daschl: any thoughts on what might be happening here?


#20

I’m not sure why views would need to rebuild even though view index replication is on. Maybe @anil can help or point to someone who can?