Explanation on Primary Scan / Secondary Scan


#1

Hi,

I am a little bit perplex about the execution time of queries like SELECT META(myBucket).id FROM myBucket WHERE META(myBucket).id LIKE “prefix:%”…

The execution time of this query is exploding the more documents there are in the bucket, like it was doing a full scan of the bucket instead of using the primary index, as I would expect.

My test goes as follows:

  • Create 500 documents with the id being prefix:
  • Doing the select as above, it executes in average around 800ms on my local 1 node vagrant cluster
  • Now I had 100000 documents whose prefix is not shared. id being anotherPrefix:
  • The query above takes 12s on average to be executed

I would expect the query to be executed at around the same amount of time, at least, it is my understanding of how a primary index on document id should work and allow to quickly find the right ids by pattern.

The explain on the query shows it uses the primary index:
[{"~children":[{“namespace”:“default”,“keyspace”:“myBucket”,“index”:“primaryindex”,"#operator":“PrimaryScan”,“using”:“gsi”},{"~child":{"~children":[{“namespace”:“default”,“keyspace”:“myBucket”,"#operator":“Fetch”},{"#operator":“Filter”,“condition”:"((meta(myBucket).id) like “prefix:%”)"},{"#operator":“InitialProject”,“result_terms”:[{“expr”:"(meta(myBucket).id)"}]},{"#operator":“FinalProject”}],"#operator":“Sequence”},"#operator":“Parallel”}],"#operator":“Sequence”}]

Am i doing something completely wrong?

Interestingly, I also added the document id to my document and created a secondary index on it. In this case, the query SELECT id FROM myBucket WHERE id LIKE “prefix:%” only take 200ms on average.
The explain is similar to the primary one:
[{"~children":[{“keyspace”:“myBucket”,“spans”:[{“Range”:{“High”:["“prefix;”"],“Low”:["“prefix:”"],“Inclusion”:1},“Seek”:null}],"#operator":“IndexScan”,“using”:“gsi”,“limit”:9.223372036854776E18,“namespace”:“default”,“index”:“secondaryIndexOnId”},{"~child":{"~children":[{“namespace”:“default”,“keyspace”:“myBucket”,"#operator":“Fetch”},{"#operator":“Filter”,“condition”:"((myBucket.id) like “prefix:%”)"},{"#operator":“InitialProject”,“result_terms”:[{“expr”:"(myBucket.id)"}]},{"#operator":“FinalProject”}],"#operator":“Sequence”},"#operator":“Parallel”}],"#operator":“Sequence”}]

The main difference being the range scan added to the index scan, making the query running in constant time of 200ms, with or without the extra 100000 documents…

Why is the range query not added when using the META().id query style?

Many thanks for your help.


#2

Hi, this optimization will be added. for now, the workaround is to create a secondary index as you observed. But yes, you are right that a range scan should be performed on the primary index. Thanks.


#3

Many thanks.
Any idea when this will be added to the query engine?
Because this render the use of a primary index to nothing if it can only do
full scan!
Regards.


#4

Primary indexes are optional. So if you don’t need primary indexes, you can just create the secondary indexes that you need. For that reason, this particular optimization is not urgent, so I would say several months away.

Thanks,
Gerald


#5

HI @geraldss,

I am working on employee-scheduling application and I have use case where employee can add their day availabilities to calendar and then I want to query these data by date range.

The availability unique key looks like this employeeId::date (this guarantee that user can create or overwrite only one availability (either available/unvailable) per day to avoid duplications). The availability object look like this

{
  "availability": "available",
  "note": "Lorem ipsum1",
  "type": "availability"
}

I didn’t store employeeId and date attributes in availability object because I think this would be just duplication of data as these two fields are already stores as a unique key.

My question is regarding to best practice how to query date range and index these data. I have employeeID and date in compound key so my date range query look like this (startDate and endDate coming from query API call e.g. /api/availabilities?startData=20151108&endDate20151219):

SELECT META().id as id, SPLIT(META().id,'::')[1] as date, SPLIT(META().id,'::')[0] as employeeId, \${db.bucketName}`.* FROM `${db.bucketName}` WHERE type = ‘${DOC_TYPE}’ AND SPLIT(META().id,’::’)[1] BETWEEN ‘${startDate}’ AND ‘${endDate}’`

I have secondary index on type attribute and then I am using primary index on date as date is store as part of compound key. Is my query correct to achieve best query performance if I store date in compound key and then query this key to use string functions or should I store employeeId and date attributes into availability object and then use secondary indexes against these fields to achieve best query performance? I like option one because I don’t need to duplicate employeeId and date data but I am not sure if this is best solution regarding to query performance. Thanks

Option 1:

key b87ebd2d-0967-49a9-b196-56bfa73f5191::20151212

{
  "availability": "available",
  "note": "Lorem ipsum1",
  "type": "availability"
}

query: USE secondary index for type attribute and primary index on key

SELECT META().id as id, SPLIT(META().id,'::')[1] as date, SPLIT(META().id,'::')[0] as employeeId, \${db.bucketName}`.* FROM `${db.bucketName}` WHERE type = ‘${DOC_TYPE}’ AND SPLIT(META().id,’::’)[1] BETWEEN ‘${startDate}’ AND ‘${endDate}’`

Option 2:

key b87ebd2d-0967-49a9-b196-56bfa73f5191::20151212

{
  "availability": "available",
  "note": "Lorem ipsum1",
  "employeeId": "b87ebd2d-0967-49a9-b196-56bfa73f5191"
  "date": "20151212",
  "type": "availability"
}

query: USE secondary indexes for type, employeeId and date attributes

SELECT META().id as id, date, employeeId, \${db.bucketName}`.* FROM `${db.bucketName}` WHERE type = ‘${DOC_TYPE}’ AND date BETWEEN ‘${startDate}’ AND ‘${endDate}’`


#6

I would recommend option 2. Good call on the secondary indexes. You can also use composite indexes on (type, date) or (date, type).

Finally, if using 4.1, you can use composite indexes that contain all the attributes in your query. This bypasses the document lookup, and you can compare this performance vs. simpler indexes.


#7

I have a follow up question…

I have added 10000 documents with id following the pattern “prefix:”. Each of this document has a field “key” which contains its document id and I also have a secondary index on the field “key”.

Running the N1QL query "SELECT key FROM myBucket WHERE key LIKE “prefix:%”

This roughly takes 4seconds using the java driver, and nearly the same directly using cbq (couple of hundreds milliseconds quicker at most)…

Running an explain shows it is using the secondary index with a rangeScan.

Is this to be expected? I would have expected this query to be quick as it should only touch the secondary index in memory and never need to access the document! As the more documents I had the slowest this gets. Any idea for improvements or something wrong I am doing?

I tried using "SELECT META(myBucket).id FROM myBucket WHERE key LIKE “prefix:%” in case the value return from index was document id but this seems to take the same amount of time.

Thanks,
Laurent.


#8

This is much slower than expected. How many results are you getting?

Are you using 4.1? 4.1 would definitely be faster for this query.


#9

I am getting 10000 keys out of 10000 documents…

And I also tried 10000 keys out of 50000 documents, where I had added 40000 documents whose key does not follow the pattern…

And in both cases this is consistent, it takes on average 4s… So this is not growing with the number of keys.

In both cases, the explain looks like this:
cbq> explain select key from myBuket where key like “prefix:%”;
{
“requestID”: “6cac0796-3b71-464c-a471-e274b044c129”,
“signature”: “json”,
“results”: [
{
"#operator": “Sequence”,
"~children": [
{
"#operator": “IndexScan”,
“index”: “indexOnKey”,
“keyspace”: “myBuket”,
“limit”: 9.223372036854776e+18,
“namespace”: “default”,
“spans”: [
{
“Range”: {
“High”: [
"“prefix;”"
],
“Inclusion”: 1,
“Low”: [
"“prefix:”"
]
},
“Seek”: null
}
],
“using”: “gsi”
},
{
"#operator": “Parallel”,
"~child": {
"#operator": “Sequence”,
"~children": [
{
"#operator": “Fetch”,
“keyspace”: “myBucket”,
“namespace”: “default”
},
{
"#operator": “Filter”,
“condition”: “((myBucket.key) like “prefix:%”)”
},
{
"#operator": “InitialProject”,
“result_terms”: [
{
“expr”: “(myBucket.key)”
}
]
},
{
"#operator": “FinalProject”
}
]
}
}
]
}
],
“status”: “success”,
“metrics”: {
“elapsedTime”: “11.250005ms”,
“executionTime”: “11.146432ms”,
“resultCount”: 1,
“resultSize”: 2086
}
}

And in both cases, doing the query directly with cbq give similar execution time:
with 10000 total
"status": “success”,
“metrics”: {
“elapsedTime”: “3.479584237s”,
“executionTime”: “3.4795336s”,
“resultCount”: 10000,
“resultSize”: 528894
}
and with 50000 total
"status": “success”,
“metrics”: {
“elapsedTime”: “4.111441218s”,
“executionTime”: “4.111224571s”,
“resultCount”: 10000,
“resultSize”: 528894
}

It is all done with 4.0… Any idea of improvement?

Thanks.


#10

You need to use 4.1.

Download the bits tomorrow.


#11

Thank you.
I saw 4.1 is now GA so will try.

But going through running many times these tests I encountered something new regarding index compaction. See Index Fragmentation

Laurent.


#12

Hi,

I installed 4.1 GA to test covering indexes and I am a little baffled by some results…

My documents are something like
{
key:“12344”,
value: <could be anything in term of type, string, number, json object, json array>
}

I create an index on key: create index idxkey on myBucket (key) USING GSI;
The query explain select key from myBucket where key like “prefix:%” shows that it uses covering index and will not do the fetch.

But the real query I would like to do is select key from myBucket where key like “prefix:%” and value is valued
So I create another index create index idxkeywithvalue on myBucket (key) WHERE value is valued USING GSI;
In this case the explain shows it uses the new index created but do not use any covering index - but I am select only key which is part of the index!

So finally, I created an index to cover key and value, create index idxkeyvalue on myBucket (key,value) WHERE value is valued USING GSI… And in this case the explain shows the result expect which is to use cover index for the query…

But in my last case, as value could be anything, I do not want it to be indexed!

Is it a flow in the query engine to not take into account the fact that the filter is covered by the where clause if the index?

I am not sure how to make this work, except maybe adding another field like boolean to indicate that the value is valued or not, and indexing on this?

Thanks.


#13

The following index should work:

CREATE INDEX idx ON mybucket(`key`, `value` IS VALUED) WHERE `value` IS VALUED;


#14

Great, this works and I can see that the fetch part disappeared now when doing the explain. It only uses the covering index.

I will keep in mind for future reference that we can index expression as well as values!

It has slightly improved my execution time (closer to 3s on average rather than 4s now) on my vagrant test machine, cbq is also running in roughly the same amount of time. I guess I am at the limit of my little 1 node test vagrant machine here.

Many thanks for you help.