Long running N1QL queries completing prematurely without exceptions

query
connections
n1ql

#1

Hello!

I’m encountering an issue that I’m not sure how to resolve, seen on both Couchbase 4.1 and 4.5 DP1 enterprise using java SDK 2.2.3 and 2.2.5 (tested on both)

I’ve got a bucket (single server cluster) with about 1.7 million docs, no mapreduce views, 3 GSI indexes
Bucket stats:
RAM usage 2.52GB /7.4GB
Data usage 1.4GB /1.98GB
Data RAM quota: 8192mb
Index RAM quota 5120mb

I’m running an async N1QL query that, when 100% successful, returns about 170k docs.

When I run this query under any sort of load (such as running the same query on another instance of the client on a different VM) the subscribe.onComplete function triggers before all 170k are retrieved, typically after about 270-300 seconds having processed about 100-110k documents. No exceptions occur, onError is not triggered on the observable and onComplete is invoked.

Here’s the query with the GSI indexed parts bolded (where some names have been changed to protect the innocent).

SELECT * FROM mybucket AS thelog WHERE order_a = $order_a AND order_b = $order_b AND time BETWEEN $startTime AND $endTime AND type = $type ORDER BY time ASC

On a successful run when all rows are retrieved it takes about 5-6 minutes (which is OK, we’re running this async on a lower priority thread…).

I’ve set the serverTimeout to 30 minutes for the query parms and the environment vars have the queryEndpoints set to 4 (the number of threads I expect to process requests concurrently)

The observable grabs the AsyncN1qlQueryResult and flatMaps to AsyncN1qlQueryRow, then maps to the JsonObject inside, then subscribe takes the JsonObject and writes to a file on disk after some light formatting. the subscribe.onComplete counts down the latch waiting on the main thread.


#2

Very interesting. There must be something the load is inducing. One guess is that it could be that there is a timeout somewhere further back in the system and an error is being masked at some level. Did you note anything on the N1qlMetrics for the seemingly truncated queries? It’d be interesting for instance if they all showed exactly the same execution time.

I’d recommend a couple things to diagnose further.

Check the logs of cbq-engine and indexes on the nodes. Is there anything that indicates a timeout or an error occurred? The log locations are described in the Administration Guide.

Also, turn up the logging level on the Java program using whatever logging framework you’ve selected. We have some details on setting this up in the documentation.

If neither of those turn up anything suspicious, it may be worth getting the packet capture between the application and the cluster (port 8093) to isolate the problem to either the SDK -> cbq-engine interaction or cbq-engine -> indexes/buckets.


#3

Thanks for the ideas.

I also thought about timeouts, but one interesting behavior I found is that a few times when I tested 3 concurrent long-running queries (all the exact same query) the first two would “fail” after ~270-300 seconds and the last one would run for another 200 seconds or so, complete and have all my data, total run time for the last successful one was about 500 seconds.

I’ll try increasing the log level in the Java SDK and grabbing the server logs. I collected the logs earlier this morning but have no idea what to look for and the files are quite large. I’ll look specifically at the cbq engine log and look for likely candidates.

If I do get a packet capture, do you know what I should look for? Maybe what signals the end of a n1ql data stream from couchbase?


#4

OK, the /opt/couchbase/var/lib/couchbase/logs/query.log shows something that may be it.

I ran 3 concurrent queries again, 2 “failed” one after 290 seconds, another after 295 seconds. The last one completed after 537 seconds and has all the rows I expect.

There are two “error” lines in the log that indicate gsi index scan timeout, though the succesful one ran far longer. No exception on the client.

2016-03-30T22:25:10.468-05:00 [Info] connected with 1 indexers
2016-03-30T22:25:10.469-05:00 [Info] index 12522202708307348729 has 1 replicas
2016-03-30T22:25:10.469-05:00 [Info] index 13836329514760950447 has 1 replicas
2016-03-30T22:25:10.469-05:00 [Info] index 8215952061795496058 has 1 replicas
2016-03-30T22:25:10.47-05:00 [Info] client load stats {“12522202708307348729”: 2.457756976e+09,“13836329514760950447”: 1.09334126766e+11}
2016-03-30T22:26:10.47-05:00 [Info] connected with 1 indexers
2016-03-30T22:26:10.47-05:00 [Info] index 13836329514760950447 has 1 replicas
2016-03-30T22:26:10.47-05:00 [Info] index 8215952061795496058 has 1 replicas
2016-03-30T22:26:10.47-05:00 [Info] index 12522202708307348729 has 1 replicas
2016-03-30T22:26:10.471-05:00 [Info] client load stats {“13836329514760950447”: 1.09334126766e+11,“12522202708307348729”: 2.457756976e+09}
2016-03-30T22:26:22.327-05:00 [Error] [GsiScanClient:“IPADDRESSHERE:9101”] Range(40cf16b6-b914-456d-aa06-e0e18f672278) response failed Index scan timed out
2016-03-30T22:26:23.117-05:00 [Error] [GsiScanClient:“IPADDRESSHERE:9101”] Range(27935db7-b340-49a2-ac66-053a838e01c6) response failed Index scan timed out
2016-03-30T22:27:10.469-05:00 [Info] connected with 1 indexers
2016-03-30T22:27:10.47-05:00 [Info] index 12522202708307348729 has 1 replicas
2016-03-30T22:27:10.47-05:00 [Info] index 13836329514760950447 has 1 replicas
2016-03-30T22:27:10.47-05:00 [Info] index 8215952061795496058 has 1 replicas
2016-03-30T22:27:10.47-05:00 [Info] client load stats {“13836329514760950447”: 1.09334126766e+11,“12522202708307348729”: 2.457756976e+09}
2016-03-30T22:28:10.558-05:00 [Info] connected with 1 indexers
2016-03-30T22:28:10.558-05:00 [Info] index 12522202708307348729 has 1 replicas
2016-03-30T22:28:10.558-05:00 [Info] index 13836329514760950447 has 1 replicas
2016-03-30T22:28:10.558-05:00 [Info] index 8215952061795496058 has 1 replicas
2016-03-30T22:28:10.558-05:00 [Info] client load stats {“13836329514760950447”: 1.09334126766e+11,“12522202708307348729”: 2.457756976e+09}
2016-03-30T22:29:10.469-05:00 [Info] connected with 1 indexers
2016-03-30T22:29:10.469-05:00 [Info] index 12522202708307348729 has 1 replicas
2016-03-30T22:29:10.469-05:00 [Info] index 13836329514760950447 has 1 replicas
2016-03-30T22:29:10.469-05:00 [Info] index 8215952061795496058 has 1 replicas
2016-03-30T22:29:10.469-05:00 [Info] client load stats {“12522202708307348729”: 2.457756976e+09,“13836329514760950447”: 1.09334126766e+11}
2016-03-30T22:30:10.468-05:00 [Info] connected with 1 indexers
2016-03-30T22:30:10.469-05:00 [Info] index 12522202708307348729 has 1 replicas
2016-03-30T22:30:10.469-05:00 [Info] index 13836329514760950447 has 1 replicas
2016-03-30T22:30:10.469-05:00 [Info] index 8215952061795496058 has 1 replicas
2016-03-30T22:30:10.469-05:00 [Info] client load stats {“12522202708307348729”: 2.457756976e+09,“13836329514760950447”: 1.09334126766e+11}
2016-03-30T22:31:10.468-05:00 [Info] connected with 1 indexers
2016-03-30T22:31:10.468-05:00 [Info] index 12522202708307348729 has 1 replicas
2016-03-30T22:31:10.469-05:00 [Info] index 13836329514760950447 has 1 replicas
2016-03-30T22:31:10.469-05:00 [Info] index 8215952061795496058 has 1 replicas
2016-03-30T22:31:10.469-05:00 [Info] client load stats {“13836329514760950447”: 1.09334126766e+11,“12522202708307348729”: 2.457756976e+09}
2016-03-30T22:32:10.469-05:00 [Info] connected with 1 indexers
2016-03-30T22:32:10.469-05:00 [Info] index 12522202708307348729 has 1 replicas
2016-03-30T22:32:10.469-05:00 [Info] index 13836329514760950447 has 1 replicas
2016-03-30T22:32:10.469-05:00 [Info] index 8215952061795496058 has 1 replicas
2016-03-30T22:32:10.47-05:00 [Info] client load stats {“13836329514760950447”: 1.09334126766e+11,“12522202708307348729”: 2.457756976e+09}
2016-03-30T22:33:10.468-05:00 [Info] connected with 1 indexers
2016-03-30T22:33:10.469-05:00 [Info] index 12522202708307348729 has 1 replicas
2016-03-30T22:33:10.469-05:00 [Info] index 13836329514760950447 has 1 replicas
2016-03-30T22:33:10.469-05:00 [Info] index 8215952061795496058 has 1 replicas
2016-03-30T22:33:10.469-05:00 [Info] client load stats {“13836329514760950447”: 1.09334126766e+11,“12522202708307348729”: 2.457756976e+09}


#5

That definitely helps narrow it down. Since this is something we can see at development time, can you log N1qlMetrics.asJsonObject() along with the happy and unhappy case? This will help us identify if we need to file an issue in the SDK to handle an error or file an issue somewhere between query and 2i.


#6

Is there any way to tweak the index scan timeout value on the server?

Here’s two unhappy metrics
{“executionTime”:“5m8.959192566s”,“resultCount”:167051,“sortCount”:167051,“errorCount”:1,“resultSize”:369402851,“elapsedTime”:“5m8.959432245s”}

{“executionTime”:“5m4.585986595s”,“resultCount”:163922,“sortCount”:163922,“errorCount”:1,“resultSize”:362480091,“elapsedTime”:“5m4.58609267s”}

happy metrics
{“executionTime”:“4m14.208411232s”,“resultCount”:178532,“sortCount”:178532,“resultSize”:394791598,“elapsedTime”:“4m14.208576703s”}


#7

The error object in the metrics object agrees with the server’s query.log
{“msg”:“Index scan timed out - cause: Index scan timed out”,“code”:12015}

It’s problematic that this does not throw an exception on the client - the results are not complete.

I suppose I can check the metrics object for errors at some point to cancel the rest of the process flow, but I’d still at least like to be able to tweak the index scan value.

To get it to work in the first place I had to set the N1qlParams.serverSideTimeout larger, perhaps this could indicate a willingness to wait on the index scan.


#8

I’ve opened JCBC-942 about the error handling problem.

I don’t know how to adjust the 2i scan timeout off the top of my head. I thought the timeout sent for query processing would be passed down, but the docs indicate that the “indexer timeout setting” needs to be adjusted. They don’t indicate how to adjust it though, so I’ve filed DOC-1248.


#9

Thanks ingenthr!

I’ve adjusted the index scan timeout using the REST method and I am now able to complete my long-running queries with success.

As I mentioned before it would be useful if the index scan timeout matched the server timeout value, especially when I have explicitly set it on the query.


#10

@jberglund can you confirm that you see an error JsonObject in the errors() Observable (or List if you’re using the sync api)?

The way both AsyncViewQueryResult and AsyncN1qlQueryResult deal with errors reported by the services is to directly expose them as JsonObjects, not throw an Exception.

That is something we are looking at improving on, but we’d need to maintain some sort of backward compatibility. So we probably want a helper class or a request flag to switch from "errors as a separate errors() Observable" semantics to “errors as onError in the rows() observable” semantics.


#11

Yes I do see the JsonObject in the errors() observable.

Based on what I found at http://blog.couchbase.com/error-handling-java-client-library I was under the impression that I would see an exception in the event of an error. “In the Java client library, there are exceptions, no errors.”

Other than the original issue (index timeout), the way this works, the async error requires me to also subscribe and block on the errors observable before continuing my applications work flow. From a just a practical perspective if we can’t get an exception, I would prefer errors in the rows observable, since we are already streaming the rows back.My applications flow needs the rows to complete successfully before continuing. Anyway, I added awareness of the async errors observable to our application so it’s not such a high priority now.

As for the index scan timeout, I think it would be nice if it could inherit what I pass in for the N1qlParams.serverSideTimeout with the query.


#12

@geraldss @keshav_m what do you thing of the last point above? (forwarding the server side timeout sent with the request to the indexer service as well)

@jberglund yes, we might re-unify these multiple Observables into a single Observable<N1qlRow> into which errors are represented as exceptions triggering onError.

What about other metadata that only comes once all rows have been streamed from the server, like metrics? We were thinking it could be made part of the last N1qlRow, having a special sub-interface like LastN1qlRow on which the metrics can also be retrieved… But if you do eg. a take(1) on the row stream, you’ll lose that information :frowning:

Also, this can’t happen out of a MAJOR version number change, as this would completely change the API, sadly.


#13

@simonbasle @jberglund in the future, we may add a parameter for index scan timeout and then pass that along to the indexer.