Initial index performance

lizard · May 1, 2017, 12:42pm

I have 3M documents in a bucket and I’m doing a simple query filtering based on 1 indexed field (string) using equality operator. The query returns only document IDs, result count is usually less than 2000.
It’s a simple query like SELECT RAW META().id FROM bucket WHERE some_field = "some_value". server version: 4.5.0-2601

When I try to query for a new value (something that hasn’t been queried before), then the query takes several seconds to complete (5.5 seconds for 1500 results). When I run the exact same query again, it takes less than 50ms to complete.

My question is - can I do something to improve query performance when using new values for filtering, like a index warmup or something like that?

Thank you

vsr1 · May 1, 2017, 1:10pm

You composite covered index.

CREATE INDEX default(a,b,c);
SELECT b,c FROM default WHERE a = “some_value”;

lizard · May 1, 2017, 1:38pm

Thank you, I’ve tried that, but it still doesn’t help me. The first “uncached” query is still more than 10x slower than following queries using the same filter value.

my index:
CREATE INDEX idx on default(a,b,c) WHERE type = "x";

query:
SELECT c FROM default WHERE type = "x" AND a = "something" AND b = "something";

I’ve tried using meta().id and also some other field from the document as c, it’s about the same performance.

vsr1 · May 1, 2017, 1:41pm

Can you check if the query using newly created index.
EXPLAIN SELECT c FROM default WHERE type = “x” AND a = “something” AND b = “something”;

lizard · May 1, 2017, 2:01pm

Yes it is. I think if it wasn’t using the index it would take much longer time or fail, there is no other index on these fields and also no primary index. Subsequent queries are running fine (response times in the order of tens of millisenconds), it’s the first query for a given filter that takes 1-5 seconds (depending on result size). One user action consists of multiple queries like this, so the response times add up when querying for new values.

Here is the output of explain:

[
{
“plan”: {
“#operator”: “Sequence”,
“~children”: [
{
“#operator”: “IndexScan”,
“covers”: [
“cover ((`default`.`a`))”,
“cover ((`default`.`b`))”,
“cover ((`default`.`c`))”,
“cover ((meta(`default`).`id`))”
],
“filter_covers”: {
“cover ((`default`.`type`))”: “x”
},
“index”: “idx”,
“index_id”: “5cd62c8fec528092”,
“keyspace”: “default”,
“namespace”: “default”,
“spans”: [
{
“Range”: {
“High”: [
“"something_A"”,
“successor("something_B")”
],
“Inclusion”: 1,
“Low”: [
“"something_A"”,
“"something_B"”
]
}
}
],
“using”: “gsi”
},
{
“#operator”: “Parallel”,
“~child”: {
“#operator”: “Sequence”,
“~children”: [
{
“#operator”: “Filter”,
“condition”: “(((cover ((`default`.`type`)) = "x") and (cover ((`default`.`a`)) = "something_A")) and (cover ((`default`.`b`)) = "something_B"))”
},
{
“#operator”: “InitialProject”,
“result_terms”: [
{
“expr”: “cover ((`default`.`c`))”
}
]
},
{
“#operator”: “FinalProject”
}
]
}
}
]
},
“text”: “select c from default\nwhere type = "x"\nand a = "something_A"\nand b = "something_B";”
}
]

lizard · May 1, 2017, 3:55pm

Got >10x performance increase on uncached queries when using SSD to store indexes (from 5-10 seconds on queries returning 1500+ results down to 500ms). Not ideal, but much better than before. Subsequent queries are under 100ms. I guess MOIs would help, but those are not available in the community edition, unfortunately.

keshav_m · May 1, 2017, 5:14pm

What’s is the type of index you have? Global Secondary Index or Memory Optimized index?

/cc @jliang @deepkaran.salooja

lizard · May 1, 2017, 5:22pm

It’s global secondary index (can’t create MOI in community edition)

deepkaran.salooja · May 1, 2017, 6:58pm

What is the indexer memory quota specified(UI->Settings)? You can lookup the “data size” of your index on UI statistics page and make sure the memory quota specified is greater than that.

lizard · May 2, 2017, 10:45am

The index quota was 1024MB. I have created 2 indexes with the following sizes:

data size 362MB, disk size 903MB (this is the one I’m mentioning in this post)
data size 380MB, disk size 1050MB (same documents, different fields)

After droping all indexes, increasing the memory quota to 2500 MB and recreating only the first index, the performance of the first query was the same as subsequent queries (it looks like the index was all in memory), so that’s good.
However, after creating the second (different) index on the same node (without changing the quota), the performance of both indexes dropped significantly for uncached queries (30-60 seconds uncached vs 120ms cached on query with 4300 results). After droping one index and restarting the node, the performance on first query was good again.
I tried the same scenario, but creating the second index on different node, and performance of both indexes was ok.

My question then is - what’s the difference between index data size and index disk size? How should I scale index memory quota in order to fit all index in memory? When I set more quota than all index size summed together, the index was still fetching data from the disk. I did some more benchmarks and indexes stored on SSDs have consistently better performance on first uncached queries. Swap is always turned off.

I looked at index stats on nodes with indexes (2 nodes, each node holding 1 index, 2500MB index RAM quota):

node with index 1 - 362MB index data size, 594MB index disk size, ~5% max index RAM used (2.4G remaining), 40% index fragmentation
node with index 2 - 380MB index data size, 1.12GB index disk size, ~48% max index RAM used (1.33G remaining), 65% index fragmentation

I wonder why different nodes with similar indexes use very different RAM size. All the nodes in the cluster have the same amount of RAM. I restarted both nodes just to be sure, and after few minutes, the result was again the same - ~5% max index RAM used on one node, ~45% max index RAM used on the other node (the only difference was index disk size, which in the case of index 2 dropped to 595MB). The node with index 1 is CentOS + HDD, node with index 2 is Debian + SSD, not sure if that should be relevant.

deepkaran.salooja · May 8, 2017, 10:43pm

My question then is - what's the difference between index data size and index disk size?

Disk size includes data size plus all the stale index/data blocks(due to fragmentation). The more the fragmentation ratio, the greater the disk size.

The fragmentation on Index2 is 65%. The compaction may not have run or still running in the background. This can impact the scan performance and cause sub-optimal use of cache.

For 4.5.x, it is recommended to use circular block reuse to avoid compaction issues.
More details here.

How should I scale index memory quota in order to fit all index in memory?

Only 40-60% of memory quota goes to cache(below 4GB its 40%, otherwise 60%). Ideally, The sum of data of all indexes when fully compacted should fit in cache. Plus give some additional memory for overheads.