How to quickly list 1M document identifiers?


#1

I have a bucket containing 1 million small documents. What I would like to do is quickly list all the document identifiers, either in a single query or a query that can be quickly paginated. By quickly I would expect this to run in <100ms or so.

The simplistic query is SELECT meta(t).id FROM testbucket order by meta(t).id - this takes around 4s, which seems very slow to me. Limiting to the first 1000 results is only a few microseconds, but if I use OFFSET to select a later page it takes around 4s again. Explain suggests it is an “primaryscan” - which I initially thought was an index scan, but now I am not so sure.

{
“plan”: {
#operator”: “Sequence”,
“~children”: [
{
#operator”: “Sequence”,
“~children”: [
{
#operator”: “PrimaryScan3”,
“index”: “#primary”,
“index_projection”: {
“primary_key”: true
},
“keyspace”: “testbucket”,
“namespace”: “default”,
“using”: “gsi”
},
{
#operator”: “Fetch”,
“keyspace”: “testbucket”,
“namespace”: “default”
},
{
#operator”: “Parallel”,
“~child”: {
#operator”: “Sequence”,
“~children”: [
{
#operator”: “InitialProject”,
“result_terms”: [
{
“expr”: “(meta((testbucket.t)).id)”
}
]
}
]
}
}
]
},
{
#operator”: “Order”,
“sort_terms”: [
{
“expr”: “(meta((testbucket.t)).id)”
}
]
},
{
#operator”: “FinalProject”
}
]
},
“text”: “SELECT meta(t).id FROM testbucket order by meta(t).id”
}

I found https://blog.couchbase.com/offset-keyset-pagination-n1ql-query-couchbase/ which discusses this, what I am unsure about though is how to create an effective index on meta.id - as they explain shows it is an index scan anyhow I am unsure if I can improve on this?

The query I would really like to run is

SELECT meta(t).id FROM testbucket AS t wheredoctype='CrystalStructure' order by meta(t).id

I tried creating the following index, but it didn’t change the speed.

CREATE INDEX DocumentTypeAndIDIndex ON testbucket(doctype, metadata.id)

I also tried the simpler SELECT doctype FROM testbucket order by doctype, given that I have an index on doctype alone, and that is still similarly slow.

I have wondered whether queries on meta.id are just not very optimal, and I should change my document to contain its own id so I can index and query on that - I am interested to know if anyone feels this would help?


#2

SELECT RAW meta(t).id FROM testbucket AS t ;

If you need Paginate give OFFSET x LIMIT x;
As there is no predicate it uses primary index which is ordered by document key.

If still want to use keyset pagination use following replace predicate last value from previous results.

SELECT RAW meta(t).id 
FROM testbucket AS t 
WHERE META(t).id > "xxxx" 
ORDER BY META(t).id 
LIMIT 100;

You can also use multiple queries with all possible values in LIKE (“a%”, “b%”, “A%”,…)
SELECT RAW meta(t).id
FROM testbucket AS t
WHERE META(t).id LIKE “a%”;


#3

Thanks. SELECT RAW means I don’t have to parse lots of JSON when the result is returned which is helpful, but it doesn’t make the query any faster - it’s still 4.4s for listing 1M identifiers (using query SELECT RAW meta(t).id FROM testbucket as t order by meta(t).id) despite the query plan stating it is an IndexScan3.

I can work round this in my application with a bit of thought, though I worry it is indicative of general query speeds which would make this database not very usable for my purposes.