Help with Analytics Query - Optimize and Index

Hi,

I have the below query, which gives me the results I need. The problem is that the query runs for 6+ secs. My intention is to try and re-write the query to make it more efficient and use indexes where possible.

 WITH available_cb_docs AS ( WITH available_daily_docs AS (
        SELECT RAW td.DocNum
        FROM tranData td
        WHERE td.`$MdfdTmstmp` BETWEEN "{report_date}" and "{report_end_date}"
        AND td.RteId = "{report_route}"
        AND td.`$Type`="DailyReport"
        UNION ALL
        SELECT RAW mch.DocNum
        FROM tranData td
        UNNEST td.ActvCustOrdr AS mch
        WHERE td.`$MdfdTmstmp` BETWEEN "{report_date}" and "{report_end_date}"
        AND td.RteId = "{report_route}"
        AND td.`$Type`="DailyReport"
    )
    SELECT DISTINCT raw DocNum
    FROM tranData
    WHERE DocNum IN (available_daily_docs)and `$MdfdTmstmp` between "{previous_date}" and "{report_end_date}") ,
    available_daily_report_docs AS (
    SELECT RAW td.DocNum
    FROM tranData td
    WHERE td.`$MdfdTmstmp` BETWEEN  "{report_date}" AND "{report_end_date}"
        AND td.RteId = "{report_route}"
        AND td.`$Type`="DailyReport"
    UNION ALL
    SELECT RAW mch.DocNum
    FROM tranData td
    UNNEST td.ActvCustOrdr AS mch
    WHERE td.`$MdfdTmstmp` BETWEEN  "{report_date}" AND "{report_end_date}"
        AND td.RteId = "{report_route}"
        AND td.`$Type`="DailyReport"
    )
    SELECT array_symdiff(available_daily_report_docs, available_cb_docs) AS cust_order_missing_docs;

One of the things I want to do is replace the Union query with a Variable so it can be executed only one. The problem I run into is with the DocNum IN (available_daily_docs) part. For whatever reason, whenever there is an IN clause, Analytics does not want to pick the index defined on DocNum. If I do DocNum = "123", the index is used and response is in milliseconds. But if I do DocNum in ["123", "456"], it takes 6 secs and does not use the index.

Any suggestion on optimizing this?

I am not analytic expert. Checkout Analytics Index with IN - #2 by dmitry.lychagin

Can you rewrite as follows without repeatedly scan of the data?

WITH available_report_docs AS (
        SELECT td.DocNum, td.`$MdfdTmstmp` AS timestamp
        FROM tranData td
        WHERE td.`$MdfdTmstmp` BETWEEN "{report_date}" and "{report_end_date}"
        AND td.RteId = "{report_route}"
        AND td.`$Type`="DailyReport"
        UNION ALL
        SELECT mch.DocNum, td.`$MdfdTmstmp` AS timestamp
        FROM tranData td
        UNNEST td.ActvCustOrdr AS mch
        WHERE td.`$MdfdTmstmp` BETWEEN "{report_date}" and "{report_end_date}"
        AND td.RteId = "{report_route}"
        AND td.`$Type`="DailyReport"
    ),
    available_cb_docs AS ( SELECT DISTINCT RAW c.DocNum
                           FROM available_report_docs AS c
                           WHERE c.timestamp BETWEEN "{previous_date}" and "{report_end_date}"),
    available_daily_report_docs AS ( SELECT DISTINCT RAW c.DocNum FROM available_report_docs AS c)
    SELECT ARRAY_SYMDIFF(available_daily_report_docs, available_cb_docs) AS cust_order_missing_docs;

adjust based on which date is earliest {previous_date} or “{report_date}”
If Analytic supports EXCEPT see if works for you

...............
SELECT RAW c1.DocNum FROM available_report_docs AS c1
 EXCEPT
 SELECT RAW c.DocNum FROM available_report_docs AS c WHERE c.timestamp BETWEEN "{previous_date}" and "{report_end_date}"

Hello Amrish:

Can you please let us know what version you are using.

Also, I tried a simple experiment in house.

select count () from dataset where col in [1];
and
select count (
) from dataset where col in [1, 2, 3];

In both cases an index plan was picked.

This is how I created the index.

create index index1 on dataset (col:int);

Please forward your create index statement.

Murali.

Hi Murali,

We are on CB 6.6.4 EE.

Index is defined as

missing_doc_ix (RteId:string)
($Type:string)
($MdfdTmstmp:string)

Below query picks up index as expected

Select `$MdfdTmstmp`,meta().id from tranData where `$Type` = "DailyReport" 
and `$MdfdTmstmp` > "2021-06-07" and RteId in ["91940"]
limit 5;

Note the IN cause for RteId but with only 1 item in the IN Array. Returned the result in ~50ms

 "expressions": ["index-search(\"missing_doc_ix\", 0, \"Default\", \"tranData\", FALSE, FALSE, 2, $$31, $$32, 2, $$33, $$34, TRUE, TRUE, TRUE)"
                                                                          ],

Now, if I run the same query again with multiple items in the IN clause. I just copied the same route again in the IN

Select `$MdfdTmstmp`,meta().id from tranData where `$Type` = "DailyReport" 
and `$MdfdTmstmp` > "2021-06-07" and RteId in ["91940","91940"]
limit 5;

Query ran for 10.15s and still got the 2 docs back. No index used. Happy to log a support ticket if that will help with the investigation. Bottom line for us is that I have never ever been able to get Analytics indexes to work with an IN clause.

Please create an additional index on only that one field – RteId.