How create or optimize Analytics Index with array values:

Hi Guys,

Need advice and suggestions. 

How to create an Analytics Index when you have array values in IN Clause

// My Query looks like this:

SELECT
 AVG(averageOrderSize) AS averageOrderSize,
 SUM(sales) AS totalSales,
 SUM(salesNotShipped) AS salesNotShipped,
 SUM(discountsGiven) AS totalDiscounts,
 SUM(qtyOrdered) AS totalOrders,
 ROUND((SUM(discountsGiven) / SUM(sales)) * 100, 0) AS percentDiscounts
FROM company_analytics.`product-aggregates`
WHERE `type` = 'product-aggregate'
AND entityId IN [
       "entityId::0000",
       "entityId::territory::0001",
       "entityId::territory::0002",
       ….
  ]
AND periodAt BETWEEN '2019-04-04T00:00:00.000Z' AND '2020-05-04T00:00:00.000Z'
GROUP BY `type`

So, the entityId can have more than 50+. So, all the docs that have id’s there will be retrieve.
Do you have any idea how can optimize my query also?

First, I’d create a dataset with a filter on “type” field:

CREATE DATASET company_analytics.`product-aggregates` ON ... WHERE `type` = 'product-aggregate'

Then let’s remove WHERE and GROUP BY from the query because the dataset would only contain documents of “product-aggregate” type. So the query would be:

SELECT
 AVG(averageOrderSize) AS averageOrderSize,
 SUM(sales) AS totalSales,
 SUM(salesNotShipped) AS salesNotShipped,
 SUM(discountsGiven) AS totalDiscounts,
 SUM(qtyOrdered) AS totalOrders,
 ROUND((SUM(discountsGiven) / SUM(sales)) * 100, 0) AS percentDiscounts
FROM company_analytics.`product-aggregates`
WHERE entityId IN [
       "entityId::0000",
       "entityId::territory::0001",
       "entityId::territory::0002"
  ]
AND periodAt BETWEEN '2019-04-04T00:00:00.000Z' AND '2020-05-04T00:00:00.000Z'

Then create an index on “periodAt” field and try running the query.

CREATE INDEX idx_periodAt ON company_analytics.`product-aggregates`(periodAt:STRING)

Alternatively you could try creating an index on entityId (remove index on periodAt first)

CREATE INDEX idx_entityId ON company_analytics.`product-aggregates`(entityId:STRING)

Hi @dmitry.lychagin,

Sorry for the late response, I already applied the index you suggested. Thanks.
Can you help me optimized my Analytics Query ?

So, the query above took 11s to execute:
Execution_Time

Do you have any idea how to optimized it. ?

Thanks :slight_smile:

  1. Which index was used for this query? idx_periodAt or idx_entityId?
  2. Could you run the following queries and paste the result?

SELECT COUNT(*) FROM company_analytics.product-aggregates;

SELECT COUNT(*) FROM company_analytics.product-aggregates WHERE entityId IN [
“entityId::0000”, “entityId::territory::0001”, “entityId::territory::0002”, … remaining values …
]

SELECT COUNT(*) FROM company_analytics.product-aggregates WHERE periodAt BETWEEN ‘2019-04-04T00:00:00.000Z’ AND ‘2020-05-04T00:00:00.000Z’

SELECT COUNT(*) FROM company_analytics.product-aggregates WHERE entityId IN [
“entityId::0000”, “entityId::territory::0001”, “entityId::territory::0002”, … remaining values …
] AND periodAt BETWEEN ‘2019-04-04T00:00:00.000Z’ AND ‘2020-05-04T00:00:00.000Z’

Thanks!

Hi @dmitry.lychagin

SELECT COUNT(*) FROM company_analytics.product-aggregates;
[
  {
    "$1": 1102315
  }
]
SELECT COUNT(*) FROM company_analytics.product-aggregates WHERE entityId IN [
“entityId::0000”, “entityId::territory::0001”, “entityId::territory::0002”, … remaining values …
]
[
  {
    "$1": 288651
  }
]
SELECT COUNT(*) FROM company_analytics.product-aggregates WHERE periodAt BETWEEN ‘2019-04-04T00:00:00.000Z’ AND ‘2020-05-04T00:00:00.000Z’
[
  {
    "$1": 957832
  }
]
SELECT COUNT(*) FROM company_analytics.product-aggregates WHERE entityId IN [
“entityId::0000”, “entityId::territory::0001”, “entityId::territory::0002”, … remaining values …
] AND periodAt BETWEEN ‘2019-04-04T00:00:00.000Z’ AND ‘2020-05-04T00:00:00.000Z’
[
  {
    "$1": 250053
  }
]

Your query’s WHERE clause (entityId IN … AND periodAt BETWEEN …) retains about 22% of the dataset (250,053/1,102,315). If you expect this kind of ratio in production then perhaps the full dataset scan would be faster than going through the index lookup. What’s the execution time of your query if you don’t have any indexes created? I.e this query:

SELECT
 AVG(averageOrderSize) AS averageOrderSize,
 SUM(sales) AS totalSales,
 SUM(salesNotShipped) AS salesNotShipped,
 SUM(discountsGiven) AS totalDiscounts,
 SUM(qtyOrdered) AS totalOrders,
 ROUND((SUM(discountsGiven) / SUM(sales)) * 100, 0) AS percentDiscounts
FROM company_analytics.product-aggregates
WHERE entityId IN [
       "entityId::0000",
       "entityId::territory::0001",
       "entityId::territory::0002"
        ....
  ]
AND periodAt BETWEEN '2019-04-04T00:00:00.000Z' AND '2020-05-04T00:00:00.000Z'

Hi @dmitry.lychagin,

Thanks man! I've tried your suggestion removing the **INDEX** of my DATASET

and run the above query again.
Here’s the execution time:
EXECTIME01
from 11s to 6s max. That’s Cool and weird :slight_smile:

Why is this so? I don’t fully understand removing the index is much faster :slight_smile:

Hi @dmitry.lychagin,

Thanks man, just want to add some clarification
regarding on the dataset index. So, the product-aggregates dataset
is used on other queries which needs Indexing.

It is possible and advisable to create another DATASET i.e (product-aggregates-2) with the same type with product-aggregates ?

Need some advice.

Thanks :slight_smile:

You can tell the query optimizer to avoid idx_periodAt by using /* +skip-index */ hint in your WHERE clause.

WHERE periodAt /* +skip-index */ BETWEEN ‘2019-04-04T00:00:00.000Z’ AND ‘2020-05-04T00:00:00.000Z’

Unfortunately I could not find a way to skip idx_entityId because it’s used in the IN operator. If you need to skip that index then you could create a second dataset (product-aggregates-2) with the same content (as you suggested) and not have any indexes on that dataset.

As for your previous question (performance of index access vs full dataset scan). Accessing a dataset via an index has some overhead, so it only pays off if your query fetches a small subset of all documents in the dataset. However if your query fetches a significant number of documents (22% in your case) then going through the index will just make it slower than getting those documents from the dataset directly.

Hi @lyndondonz,

I think I found a way to skip idx_entityId (in addition to skipping idx_periodAt index). Here’s the query:

SELECT
 AVG(averageOrderSize) AS averageOrderSize,
 SUM(sales) AS totalSales,
 SUM(salesNotShipped) AS salesNotShipped,
 SUM(discountsGiven) AS totalDiscounts,
 SUM(qtyOrdered) AS totalOrders,
 ROUND((SUM(discountsGiven) / SUM(sales)) * 100, 0) AS percentDiscounts
FROM company_analytics.`product-aggregates`
WHERE tostring(entityId) IN ["entityId::0000","entityId::territory::0001","entityId::territory::0002"]
AND tostring(periodAt) BETWEEN '2019-04-04T00:00:00.000Z' AND '2020-05-04T00:00:00.000Z'  

This query does not use any indexes, so with this query you don’t need to create the second dataset (product-aggregates-2).

Hi @dmitry.lychagin,

wow, cool.. thanks man :) 

but I’m a bit confused why TOSTRING() ?

Hi @dmitry.lychagin,

Sorry, to bother you. :frowning: Need some advice.

I think still gonna need the INDEX for small sub-set of data.
this query will just take 34.3ms to execute - (with index entityId and periodAt)

SELECT
     AVG(averageOrderSize) AS averageOrderSize,
     SUM(sales) AS totalSales,
    SUM(salesNotShipped) AS salesNotShipped,
    SUM(discountsGiven) AS totalDiscounts,
    SUM(qtyOrdered) AS totalOrders,
    ROUND((SUM(discountsGiven) / SUM(sales)) * 100, 0) AS percentDiscounts
FROM company_analytics.`product-aggregates`
WHERE entityId IN ['territory::000432']
AND periodAt BETWEEN '2020-01-01T04:01:00.000Z' AND '2020-07-01T13:50:33.155Z'

but, this query - (NO NEED TO USE INDEX) in order to optimised it.
the entityId will have 100+ more of docKeys soon.

SELECT
 AVG(averageOrderSize) AS averageOrderSize,
 SUM(sales) AS totalSales,
 SUM(salesNotShipped) AS salesNotShipped,
 SUM(discountsGiven) AS totalDiscounts,
 SUM(qtyOrdered) AS totalOrders,
 ROUND((SUM(discountsGiven) / SUM(sales)) * 100, 0) AS percentDiscounts
FROM company_analytics.`product-aggregates`
WHERE entityId IN ["entityId::0000","entityId::territory::0001","entityId::territory::0002".... 100+ more]
AND  periodAt BETWEEN '2019-04-04T00:00:00.000Z' AND '2020-05-04T00:00:00.000Z'

My question is do you have any idea or approach we can manage to use the INDEX for both query scenario ?
really need help here.

Thanks :slight_smile:

Hi @lyndondonz,

Why do you want to use the index for the second query? I thought we established earlier that the second query would run faster with full dataset scan. Therefore we wanted to have indexes so the first query would use them, and I suggested to use tostring() workaround for the second query to ignore those indexes so it’d do full dataset scan.

Hi @dmitry.lychagin

Thanks for the TOSTRING() info. Appreciated.
Already use the TOSTRING()

Sorry, to confused you :slight_smile:

the Second Query is already good now. INDEXES should not be use to optimised
base on we established earlier :slight_smile:

So, regarding on the First Query I post.
there would be a case that WHERE entityId IN ['territory::000432'] is just 1 or 4 doc ids.
So, WHERE entityId IN ['territory::000432'] would use INDEXES so that execution time would be fast.
Because, that one doesn’t need a full scan.

So, my question is do you have any idea or approach we can manage to use the INDEX for both query scenario ?

Thanks and sorry for the confusion :frowning:

Hi @lyndondonz,

Sorry, but I’m still confused. When you say “… use the INDEX for both query scenario”, which two queries are you referring to? Could you just paste their respective WHERE clauses?

Currently if we have two above indexes idx_periodAt and idx_entityId then the query with the WHERE clause:

WHERE entityId IN [‘territory::000432’]
AND periodAt BETWEEN ‘2020-01-01T04:01:00.000Z’ AND ‘2020-07-01T13:50:33.155Z’

will use both indexes. If you look at the query plan then you’d see that the query searches idx_entityId and fetches document ids from it, searches idx_periodAt and fetches documents ids from it, then intersects those document ids, then fetches documents from the main dataset storage using these document ids. So both indexes are probed by this query.

Note that if

WHERE entityId IN [‘territory::000432’]

results in a small number of doc ids (1 or 4 as you said), then we can further optimize this query by not searching idx_periodAt and intersecting with its doc ids. I.e. there’s no point of searching the second index (idx_periodAt) if the first index (idx_entityId) already returns a small number of doc ids. So in this case you could either delete idx_periodAt index (if no other query needs it) or disable it just for this query using either skip-index hint or tostring() workaround:

WHERE entityId IN [‘territory::000432’]
AND periodAt /* +skip-index */ BETWEEN ‘2020-01-01T04:01:00.000Z’ AND ‘2020-07-01T13:50:33.155Z’

It is possible and advisable to create another DATASET i.e ( product-aggregates-2 ) with the same type with product-aggregates ?

Hi @oliverthom707,

It is possible, but in general not advisable. That’ why I was suggesting how to make the above queries work with a single dataset.