Index scan timed out when getting all IDs of a document type

A simple query of

SELECT ALL META(general).id AS id FROM general WHERE type = "thread"

returns with

{"msg":"Index scan timed out - cause: Index scan timed out","code":12015}

on a bucket of 25 million documents spread across 4 nodes.

There are several indexes on general bucket, and only two “types” of documents exist: thread and post.

Existing indexes:
CREATE INDEXgeneral_typeONgeneral(type)

Is there any way to improve this query? Does the scan timeout might be due to the time required to send all IDs to the query client? The current scan timeout is 10 minutes. When the scan times out, the client returns with a portion of IDs, around 2.5 million.

Thanks,
Zoltán

You have following options.

  1. Increase the indexer level scan timeout setting @deepkaran.salooja

  2. The following index uses covered Scan and gives results in order by META().id
    
    CREATE INDEX ixthread ON general(META().id) WHERE type = "thread";
    
     id = ""
     limit = 50,000
    loop for ever
    SELECT META(general).id  AS id FROM general WHERE type = "thread" AND META().id > $id LIMIT $limit;
    If result size < $limit value then 
         break
    else
         id = last id in the results
    end if
    end loop
    

Also as you are getting so many documents reduce bytes transfer between client and query service

 1) Use RAW  This gives array of document ids instead of array of objects
           SELECT RAW META().id  FROM .....
  2) Use pretty=false
1 Like

1a. I have indexer scan timeout modified on each node.
1b. I have also modified queryEndpoints to 15 (defaults to 1) in the client. (No increase in throughput have been observed.)

  1. Does the second solution you have presented might be subjected to parallel execution? I’m using Spark to get all IDs to a single partition, repartition IDs across the cluster, then bulk-get documents with a higher throughput. It would work, however the initial fetch of IDs is a huge bottleneck. I will attempt to implement your solution to achieve a faster fetch.
    (The current limitation of the Couchbase Spark Connector is that it can only execute a query via the SDK in a single JVM - can not scale horizontally.)

However if there are 1024 virtual buckets in the general bucket, is there any way to access and acquire the content of a whole vBucket via any client API? I could read each vBucket to a Spark partition in parallel, including the data as well, and filter by threads during data processing. @vsr1 what do you think?

Thanks for your kind help!

Zoltán

@zoltan.zvara, you can try to increase the index service memory quota for better response(UI->Settings).

@deepkaran.salooja Tried that as well, there is just too much data transferred to a single the query client.