Creating Index when 2 documents having array fields as predicates are joined

Hi I have Document A as-

{
      "_id": "TimeSlots",
      "timeSlots": [
        {
          "startTime": "00:00:00"
          "endTime": "01:00:00",
        },
        {
          "startTime": "01:00:00"
          "endTime": "02:00:00",
        },
        {
          "startTime": "02:00:00"
          "endTime": "03:00:00",
        },
        {
          "startTime": "03:00:00"
          "endTime": "04:00:00",
        }............,

        {
          "startTime": "23:00:00"
          "endTime": "23:59:59",
        }
 ] }

And Document B As -

{
"_id": "User::33243"
"userID": 33243,
"type": "User"
"userStatuses": [
        {
          "status": "ONLINE",
          "statusChangedDate": "2017-07-20T08:20:00Z",
        },
        {
          "status": "OFFLINE",
          "statusChangedDate": "2017-07-20T09:21:00Z",
        },
        {
          "status": "ONLINE",
          "statusChangedDate": "2017-07-20T16:30:00Z",
        },
       {
        "status": "OFFLINE",
        "statusChangedDate": "2017-07-20T09:21:00Z",
      }
  ]
}

What I’m trying to do is, to retrieve a list of all the timeslots having a corresponding count of the users Online at a particular day.

I designed the following query but it’s too slow and taking approx 2-3 minutes in execution.

Select  TS.startTime || ' - ' || TS.endTime as Duration, Count(U.userID) as Users_Online
                From example U
                Unnest U.userStatuses S
                Join example T on Keys 'TimeSlots'
                Unnest T.timeSlots TS 
                Where U.type = 'User'
                And T.type = 'TimeSlots'
                And DATE_FORMAT_STR(S.statusChangedDate, '1111-11-11') = '2017-07-20'             
                And (DATE_FORMAT_STR(S.statusChangedDate, '01:01:01') >= TS.startTime 
                         And DATE_FORMAT_STR(S.statusChangedDate, '01:01:01') < TS.endTime) 
               And S.status = 'ONLINE'                                  
Group by TS.startTime, TS.endTime
Order by TS.startTime;

Can anyone suggest me how to create an index for this query so that its execution speed improves?

Thanks,
Krishan

Try this query. If you have still issues post the EXPLAIN output of the query.

CREATE INDEX ix1 ON example (DISTINCT ARRAY DATE_FORMAT_STR(us.statusChangedDate, '1111-11-11') FOR us IN userStatuses END) WHERE type = "User";
SELECT TS.startTime || ' - ' || TS.endTime AS Duration, Count(U.userID) AS Users_Online
FROM example AS U
UNNEST U.userStatuses AS us
UNNEST (SELECT RAW T.timeSlots FROM example AS T USE KEYS "TimeSlots" WHERE T.type = 'TimeSlots')[0] AS TS
WHERE U.type = 'User'
      AND us.status = 'ONLINE'
      AND DATE_FORMAT_STR(us.statusChangedDate, '1111-11-11') = '2017-07-20'
      AND (DATE_FORMAT_STR(us.statusChangedDate, '01:01:01') >= TS.startTime
      AND DATE_FORMAT_STR(us.statusChangedDate, '01:01:01') < TS.endTime)
GROUP BY TS.startTime, TS.endTime
ORDER BY TS.startTime;

There is improvement but its still taking 45-50 seconds.

EXPLAIN output:

[
  {
    "plan": {
      "#operator": "Sequence",
      "~children": [
        {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "IntersectScan",
              "scans": [
                {
                  "#operator": "IndexScan",
                  "index": "idx_example_type",
                  "index_id": "faf5d26b152a30c2",
                  "keyspace": "example",
                  "namespace": "default",
                  "spans": [
                    {
                      "Exact": true,
                      "Range": {
                        "High": [
                          "\"User\""
                        ],
                        "Inclusion": 3,
                        "Low": [
                          "\"User\""
                        ]
                      }
                    }
                  ],
                  "using": "gsi"
                },
                {
                  "#operator": "DistinctScan",
                  "scan": {
                    "#operator": "IndexScan",
                    "index": "idx_example_statusChangedDate",
                    "index_id": "d8a1e16fc0376475",
                    "keyspace": "example",
                    "namespace": "default",
                    "spans": [
                      {
                        "Exact": true,
                        "Range": {
                          "High": [
                            "\"2017-07-20\""
                          ],
                          "Inclusion": 3,
                          "Low": [
                            "\"2017-07-20\""
                          ]
                        }
                      }
                    ],
                    "using": "gsi"
                  }
                }
              ]
            },
            {
              "#operator": "Fetch",
              "as": "U",
              "keyspace": "example",
              "namespace": "default"
            },
            {
              "#operator": "Parallel",
              "~child": {
                "#operator": "Sequence",
                "~children": [
                  {
                    "#operator": "Unnest",
                    "as": "us",
                    "expr": "(`U`.`userStatuses`)"
                  }
                ]
              }
            },
            {
              "#operator": "Parallel",
              "~child": {
                "#operator": "Sequence",
                "~children": [
                  {
                    "#operator": "Unnest",
                    "as": "TS",
                    "expr": "((select raw (`T`.`timeSlots`) from `example` as `T` use keys \"TimeSlots\" where ((`T`.`type`) = \"TimeSlots\"))[0])"
                  }
                ]
              }
            },
            {
              "#operator": "Parallel",
              "~child": {
                "#operator": "Sequence",
                "~children": [
                  {
                    "#operator": "Filter",
                    "condition": "(((((`U`.`type`) = \"User\") and ((`us`.`status`) = \"ONLINE\")) and (date_format_str((`us`.`statusChangedDate`), \"1111-11-11\") = \"2017-07-20\")) and (((`TS`.`startTime`) <= date_format_str((`us`.`statusChangedDate`), \"01:01:01\")) and (date_format_str((`us`.`statusChangedDate`), \"01:01:01\") < (`TS`.`endTime`))))"
                  },
                  {
                    "#operator": "InitialGroup",
                    "aggregates": [
                      "count((`U`.`userID`))"
                    ],
                    "group_keys": [
                      "(`TS`.`startTime`)",
                      "(`TS`.`endTime`)"
                    ]
                  }
                ]
              }
            },
            {
              "#operator": "IntermediateGroup",
              "aggregates": [
                "count((`U`.`userID`))"
              ],
              "group_keys": [
                "(`TS`.`startTime`)",
                "(`TS`.`endTime`)"
              ]
            },
            {
              "#operator": "FinalGroup",
              "aggregates": [
                "count((`U`.`userID`))"
              ],
              "group_keys": [
                "(`TS`.`startTime`)",
                "(`TS`.`endTime`)"
              ]
            },
            {
              "#operator": "Parallel",
              "~child": {
                "#operator": "Sequence",
                "~children": [
                  {
                    "#operator": "InitialProject",
                    "result_terms": [
                      {
                        "as": "Duration",
                        "expr": "(((`TS`.`startTime`) || \" - \") || (`TS`.`endTime`))"
                      },
                      {
                        "as": "Users_Online",
                        "expr": "count((`U`.`userID`))"
                      }
                    ]
                  }
                ]
              }
            }
          ]
        },
        {
          "#operator": "Order",
          "sort_terms": [
            {
              "expr": "(`TS`.`startTime`)"
            }
          ]
        },
        {
          "#operator": "FinalProject"
        }
      ]
    },
    "text": "SELECT TS.startTime || ' - ' || TS.endTime AS Duration, Count(U.userID) AS Users_Online\r\nFROM example AS U\r\nUNNEST U.userStatuses AS us\r\nUNNEST (SELECT RAW T.timeSlots FROM example AS T USE KEYS \"TimeSlots\" WHERE T.type = 'TimeSlots')[0] AS TS\r\nWHERE U.type = 'User'\r\n      AND us.status = 'ONLINE'\r\n      AND DATE_FORMAT_STR(us.statusChangedDate, '1111-11-11') = '2017-07-20'\r\n      AND (DATE_FORMAT_STR(us.statusChangedDate, '01:01:01') >= TS.startTime\r\n      AND DATE_FORMAT_STR(us.statusChangedDate, '01:01:01') < TS.endTime)\r\nGROUP BY TS.startTime, TS.endTime\r\nORDER BY TS.startTime;"
  }
]

Drop the index idx_example_type and try again.

all of the queries checks status = ONLINE you can move the condition to index like below

CREATE INDEX ix2 ON example (DISTINCT ARRAY DATE_FORMAT_STR(us.statusChangedDate, '1111-11-11') FOR us IN userStatuses WHEN us.status = 'ONLINE' END) WHERE type = "User";

Also try the following query and see if this gives any better timing. Please make sure it is not using IntersectScan (by dropping other index or providing USE INDEX option).

SELECT TS.startTime || ' - ' || TS.endTime AS Duration, SUM(ARRAY_LENGTH(ac)) AS Users_Online
FROM example AS U
UNNEST (SELECT RAW T.timeSlots FROM example AS T USE KEYS "TimeSlots" WHERE T.type = 'TimeSlots')[0] AS TS
LET ac = ARRAY true FOR v IN U.userStatuses WHEN (U.UserID IS VALUED AND DATE_FORMAT_STR(v.statusChangedDate, '1111-11-11') = '2017-07-20' AND v.status = 'ONLINE' 
 AND DATE_FORMAT_STR(v.statusChangedDate, '01:01:01') >= TS.startTime AND DATE_FORMAT_STR(v.statusChangedDate, '01:01:01') < TS.endTime)   END
WHERE U.type = 'User'
      AND ANY us IN U.userStatuses SATISFIES DATE_FORMAT_STR(us.statusChangedDate, '1111-11-11') = '2017-07-20' AND us.status = 'ONLINE' END
      AND (ARRAY_LENGTH(ac) > 0 OR  U.userID IS NOT VALUED)
GROUP BY TS.startTime, TS.endTime
ORDER BY TS.startTime;

Hi,
This query is showing incorrect count. Since I’m not as good at N1QL, so I couldn’t figure out the problem with it. Apart from that I dropped the ‘type’ index and ran this query but the execution time is still same (approx 50 seconds). The InsertsectScan is not being used.
Sorry, I forgot to mention that total documents in this bucket are 6,836.
Also, another query will be needed for offline users as well.

I have updated query above. If you need OFFLINE too use ix1