Is there any way to improve this query? Does the scan timeout might be due to the time required to send all IDs to the query client? The current scan timeout is 10 minutes. When the scan times out, the client returns with a portion of IDs, around 2.5 million.
The following index uses covered Scan and gives results in order by META().id
CREATE INDEX ixthread ON general(META().id) WHERE type = "thread";
id = ""
limit = 50,000
loop for ever
SELECT META(general).id AS id FROM general WHERE type = "thread" AND META().id > $id LIMIT $limit;
If result size < $limit value then
break
else
id = last id in the results
end if
end loop
Also as you are getting so many documents reduce bytes transfer between client and query service
1) Use RAW This gives array of document ids instead of array of objects
SELECT RAW META().id FROM .....
2) Use pretty=false
1a. I have indexer scan timeout modified on each node.
1b. I have also modified queryEndpoints to 15 (defaults to 1) in the client. (No increase in throughput have been observed.)
Does the second solution you have presented might be subjected to parallel execution? I’m using Spark to get all IDs to a single partition, repartition IDs across the cluster, then bulk-get documents with a higher throughput. It would work, however the initial fetch of IDs is a huge bottleneck. I will attempt to implement your solution to achieve a faster fetch.
(The current limitation of the Couchbase Spark Connector is that it can only execute a query via the SDK in a single JVM - can not scale horizontally.)
However if there are 1024 virtual buckets in the general bucket, is there any way to access and acquire the content of a whole vBucket via any client API? I could read each vBucket to a Spark partition in parallel, including the data as well, and filter by threads during data processing. @vsr1 what do you think?