Efficient projection of a single nested document array to multiple output aliases

rob.desbois · March 19, 2018, 3:54pm

I’ve written a query to retrieve a parent document by its key, and nest many child documents of specific types. The child documents contain the foreign key of the parent and exist in the same bucket.

The query I’ve got is akin to the following:

SELECT
   parent.*,
   ARRAY c FOR c IN children WHEN c.type = 'category-1' END AS ones,
   ARRAY c FOR c IN children WHEN c.type = 'category-2' END AS twos,
   ARRAY c FOR c IN children WHEN c.type = 'category-3' END AS threes
FROM `bucket` parent
USE KEYS 'parent-doc-key'
LEFT NEST `bucket` children ON KEY children.parent FOR parent;

There may be tens or hundreds of thousands of child documents, and there are currently only 5 type tags for them.
Profiling the execution time of this, it obviously traverses the children array once for every category. While this is unsurprising, it’s undesirable given the number of objects we may traverse multiple times.

How can I avoid the repeated traversals?
TIA – rob

vsr1 · March 19, 2018, 5:03pm

Try this

CREATE INDEX ix1 ON `bucket` (parent);

SELECT parent.*, (OBJECT v.type:v.arr FOR v IN childrens END)
FROM  `bucket` parent
USE KEYS 'parent-doc-key'
LET childrens = (SELECT c.type, ARRAY_AGG(c) AS arr FROM `bucket` AS c WHERE c.parent IN ['parent-doc-key'] GROUP BY c.type)

rob.desbois · March 19, 2018, 5:50pm

Thanks for the reply

It looks good, but I’m finding that query takes 50% longer to execute than my original version. The primary document uses a KeyScan as expected, but as far as I can tell the sub-query isn’t using the index; plan text shows no IndexScan, even if I add USE INDEX (ix1) to the sub-query:

{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "KeyScan",
        "keys": "\"parent-doc-key\""
      },
      {
        "#operator": "Fetch",
        "as": "parent",
        "keyspace": "bucket",
        "namespace": "default"
      },
      {
        "#operator": "Parallel",
        "maxParallelism": 1,
        "~child": {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "Let",
              "bindings": [
                {
                  "expr": "(select (`c`.`type`), array_agg((`c`.`data`)) as `arr` from `bucket` as `c` where ((`c`.`parent`) in [\"parent-doc-key\"])  group by (`c`.`type`))",
                  "var": "children"
                }
              ]
            },
            {
              "#operator": "InitialProject",
              "result_terms": [
                {
                  "expr": "`parent`",
                  "star": true
                },
                {
                  "expr": "object (`v`.`type`) : (`v`.`arr`) for `v` in `children` end"
                }
              ]
            },
            {
              "#operator": "FinalProject"
            }
          ]
        }
      }
    ]
  },
  "text": "SELECT\n   parent.*,\n   (OBJECT v.type:v.arr FOR v IN children END)\nFROM `bucket` parent\nUSE KEYS 'parent-doc-key'\nLET children = (\n   SELECT c.type, ARRAY_AGG(c) AS arr\n   FROM `bucket` AS c\n   USE INDEX (ix1)\n   WHERE c.parent IN ['parent-doc-key']\n   GROUP BY c.type\n);"
}

Is there something I’ve missed? I’m familiar with RDBMS’s primarily so there’s quite a lot that’s new to me here.

vsr1 · March 19, 2018, 5:56pm

EPXLAIN includes subquery plan when it is in FROM clause.
Try

EXPLAIN on subquery. It should have used the index

EXPLAIN SELECT c.type, ARRAY_AGG(c) AS arr 
FROM  `bucket` AS c WHERE c.parent IN ['parent-doc-key'] GROUP BY c.type

rob.desbois · March 20, 2018, 4:38pm

Ah thanks, I hadn’t realised.

Using it on the sub-query alone does indeed show it uses the index:

{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "KeyScan",
        "keys": "\"parent-doc-key\""
      },
      {
        "#operator": "Fetch",
        "as": "parent",
        "keyspace": "bucket",
        "namespace": "default"
      },
      {
        "#operator": "Parallel",
        "maxParallelism": 1,
        "~child": {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "Let",
              "bindings": [
                {
                  "expr": "(select (`c`.`type`), array_agg((`c`.`data`)) as `arr` from `bucket` as `c` where ((`c`.`parent`) in [\"parent-doc-key\"])  group by (`c`.`type`))",
                  "var": "children"
                }
              ]
            },
            {
              "#operator": "InitialProject",
              "result_terms": [
                {
                  "expr": "`parent`",
                  "star": true
                },
                {
                  "expr": "object (`v`.`type`) : (`v`.`arr`) for `v` in `children` end"
                }
              ]
            },
            {
              "#operator": "FinalProject"
            }
          ]
        }
      }
    ]
  },
  "text": "SELECT\n   parent.*,\n   (OBJECT v.type:v.arr FOR v IN children END)\nFROM `bucket` parent\nUSE KEYS 'parent-doc-key'\nLET children = (\n   SELECT c.type, ARRAY_AGG(c) AS arr\n   FROM `bucket` AS c\n   USE INDEX (ix1)\n   WHERE c.parent IN ['parent-doc-key']\n   GROUP BY c.type\n);"
}

If this is the case then why would it take longer than the nest approach that traverses the documents once per projected type? I’ve measured the full query-plus-subquery a number of times, with 5400 child documents; projecting with a WHEN clause in the OBJECT operator that projects only 1 of these docs has 642ms mean execution time; dropping the WHEN to project with all 5400 adds over 200ms to this.

My original query with multiple ARRAY operators in the projection had a mean execution time of ~540ms.

vsr1 · March 20, 2018, 6:09pm

GROUP of operation takes time