How to quickly list 1M document identifiers?

towler · March 13, 2019, 4:22pm

I have a bucket containing 1 million small documents. What I would like to do is quickly list all the document identifiers, either in a single query or a query that can be quickly paginated. By quickly I would expect this to run in <100ms or so.

The simplistic query is SELECT meta(t).id FROM testbucket order by meta(t).id - this takes around 4s, which seems very slow to me. Limiting to the first 1000 results is only a few microseconds, but if I use OFFSET to select a later page it takes around 4s again. Explain suggests it is an “primaryscan” - which I initially thought was an index scan, but now I am not so sure.

{
“plan”: {
“#operator”: “Sequence”,
“~children”: [
{
“#operator”: “Sequence”,
“~children”: [
{
“#operator”: “PrimaryScan3”,
“index”: “#primary”,
“index_projection”: {
“primary_key”: true
},
“keyspace”: “testbucket”,
“namespace”: “default”,
“using”: “gsi”
},
{
“#operator”: “Fetch”,
“keyspace”: “testbucket”,
“namespace”: “default”
},
{
“#operator”: “Parallel”,
“~child”: {
“#operator”: “Sequence”,
“~children”: [
{
“#operator”: “InitialProject”,
“result_terms”: [
{
“expr”: “(meta((testbucket.t)).id)”
}
]
}
]
}
}
]
},
{
“#operator”: “Order”,
“sort_terms”: [
{
“expr”: “(meta((testbucket.t)).id)”
}
]
},
{
“#operator”: “FinalProject”
}
]
},
“text”: “SELECT meta(t).id FROM testbucket order by meta(t).id”
}

I found OFFSET and Keyset Pagination in N1QL Query | Couchbase which discusses this, what I am unsure about though is how to create an effective index on meta.id - as they explain shows it is an index scan anyhow I am unsure if I can improve on this?

The query I would really like to run is

SELECT meta(t).id FROM testbucket AS t where doctype='CrystalStructure' order by meta(t).id

I tried creating the following index, but it didn’t change the speed.

CREATE INDEX DocumentTypeAndIDIndex ON testbucket(doctype, metadata.id)

I also tried the simpler SELECT doctype FROM testbucket order by doctype, given that I have an index on doctype alone, and that is still similarly slow.

I have wondered whether queries on meta.id are just not very optimal, and I should change my document to contain its own id so I can index and query on that - I am interested to know if anyone feels this would help?

vsr1 · March 13, 2019, 4:37pm

SELECT RAW meta(t).id FROM testbucket AS t ;

If you need Paginate give OFFSET x LIMIT x;
As there is no predicate it uses primary index which is ordered by document key.

If still want to use keyset pagination use following replace predicate last value from previous results.

SELECT RAW meta(t).id 
FROM testbucket AS t 
WHERE META(t).id > "xxxx" 
ORDER BY META(t).id 
LIMIT 100;

You can also use multiple queries with all possible values in LIKE (“a%”, “b%”, “A%”,…)
SELECT RAW meta(t).id
FROM testbucket AS t
WHERE META(t).id LIKE “a%”;

towler · March 14, 2019, 8:59am

Thanks. SELECT RAW means I don’t have to parse lots of JSON when the result is returned which is helpful, but it doesn’t make the query any faster - it’s still 4.4s for listing 1M identifiers (using query SELECT RAW meta(t).id FROM testbucket as t order by meta(t).id) despite the query plan stating it is an IndexScan3.

I can work round this in my application with a bit of thought, though I worry it is indicative of general query speeds which would make this database not very usable for my purposes.

mreiche · June 22, 2022, 4:43pm

It shouldn’t take 4.4s for 1M ids. I get 250ms for the query execution, and 1.350 seconds for the retrieval for a total of 1.6 seconds. Also, the first id will be returned in less than 1ms.
The sort is not necessary as id’s are already sorted.
Another option is creating a Map Reduce View as described here : python - how to get all the key from one bucket in couchbase? - Stack Overflow