ORDER BY extremely slow when ordering by array value

Robert_Mirabelle · May 28, 2022, 6:14pm

Given 20k documents of this shape:

{
	"type": "report",
	"id": "12345",
	"answers": [
		{
			"q": "question1", 
			"v": "123", 
			"lbl": "Yes"
		},
		{
			"q": "question2", 
			"v": "234", 
			"lbl": "No"
		}
	]
}

And given this very simple query:

SELECT id, 
ARRAY a.v FOR a IN answers WHEN a.q = 'question1' END AS myAnswer
FROM bucket
WHERE type = 'report'
ORDER BY myAnswer

As soon as I add the ORDER BY clause, performance slows to a crawl (from 50ms to over 10 seconds).

I can find no way to create ANY index which will improve performance. I can create indexes for root-level properties (like type and id) that work, but none that improve sorting on answer.v

I tried filtering the results such that myAnswer is not null, but this didn’t help:

SELECT id, 
ARRAY a.v FOR a IN answers WHEN a.q = 'question1' END AS myAnswer
FROM bucket
WHERE type = 'report'
--- doesn't help
AND ANY a IN answers SATISFIES a.q='question1' AND a.v IS NOT NULL END
ORDER BY myAnswer

How on earth can I create an index that will improve sorting on answer.v?

vsr1 · May 28, 2022, 7:14pm

What version of CB you are using. If you have CB 7.1 can suggest different solution

index array elements and Sorting different array elements as to explicit sort (i.e produce all possible values).

This works only if you have equality q= “question1”
Check EXPLAIN and see if it covers and uses index order (I.e. NO Order operator at the end)

CREATE INDEX ix1 ON bucket ( ALL ARRAY [a.q, a.v] FOR a IN answers END, id ) WHERE type = "report";
SELECT a.id,fltr[1] AS myAnswer
FROM bucket AS b
UNNEST b.answers AS a
LET fltr = [a.q, a.v]
WHERE b.type = "report" AND fltr >= ["question1"] AND fltr < [SUCCESSOR("question1")
ORDER BY fltr

Robert_Mirabelle · May 29, 2022, 5:10pm

Thanks for the info. We’re using 7.0.

I created the index and query you suggested, but it provided no performance improvement at all. The query plan shows IndexScan (using the new index) as the first entry, but still takes over 7 seconds to execute with just 15k documents.

In addition, I have no idea what is meant by:

Index array elements and Sorting different array elements as to explicit sort (i.e produce all possible values).

The above isn’t a sentence, so it’s quite challenging to decipher. Are you suggesting that array elements should be indexed and that sorting must explicitly specified in the WHERE clause of the index via the use of >= and < operators?

I also have no idea what is meant by This works only if you have equality q= “question1”. Does this mean that you must filter by question1 in the WHERE clause of the index using only equality operators?

vsr1 · May 29, 2022, 5:18pm

Please post the explain
OR profile stats (execute query in UI query work bench, goto plan text tab and post the info from there)

Robert_Mirabelle · May 29, 2022, 6:34pm

I created the index you suggested (named aaa so that I had any hope of finding it quickly in the Indexes list, which always automatically limits the number of indexes shown to just 15).

My bucket is called npky and my document type is called 'pcr'

[
  {
    "plan": {
      "#operator": "Sequence",
      "~children": [
        {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "DistinctScan",
              "scan": {
                "#operator": "IndexScan3",
                "as": "b",
                "index": "aaa",
                "index_id": "87d26d7c3b3d9bf8",
                "index_projection": {
                  "primary_key": true
                },
                "keyspace": "npky",
                "namespace": "default",
                "spans": [
                  {
                    "exact": true,
                    "range": [
                      {
                        "high": "[successor(\"eTimes.03\")]",
                        "inclusion": 1,
                        "low": "[\"eTimes.03\"]"
                      }
                    ]
                  }
                ],
                "using": "gsi"
              }
            },
            {
              "#operator": "Fetch",
              "as": "b",
              "keyspace": "npky",
              "namespace": "default"
            },
            {
              "#operator": "Parallel",
              "~child": {
                "#operator": "Sequence",
                "~children": [
                  {
                    "#operator": "Filter",
                    "condition": "(((`b`.`type`) = \"pcr\") and is_array((`b`.`answers`)))"
                  },
                  {
                    "#operator": "Unnest",
                    "as": "a",
                    "expr": "(`b`.`answers`)",
                    "filter": "(([\"eTimes.03\"] <= [(`a`.`q`), (`a`.`v`)]) and ([(`a`.`q`), (`a`.`v`)] < [successor(\"eTimes.03\")]) and (`a` is not missing))"
                  }
                ]
              }
            },
            {
              "#operator": "Parallel",
              "~child": {
                "#operator": "Sequence",
                "~children": [
                  {
                    "#operator": "Let",
                    "bindings": [
                      {
                        "expr": "[(`a`.`q`), (`a`.`v`)]",
                        "var": "fltr"
                      }
                    ]
                  },
                  {
                    "#operator": "Filter",
                    "condition": "(([\"eTimes.03\"] <= `fltr`) and (`fltr` < [successor(\"eTimes.03\")]))"
                  },
                  {
                    "#operator": "InitialProject",
                    "result_terms": [
                      {
                        "expr": "(`a`.`id`)"
                      },
                      {
                        "as": "myAnswer",
                        "expr": "(`fltr`[1])"
                      }
                    ]
                  }
                ]
              }
            }
          ]
        },
        {
          "#operator": "Order",
          "limit": "30",
          "sort_terms": [
            {
              "expr": "`fltr`"
            }
          ]
        },
        {
          "#operator": "Limit",
          "expr": "30"
        }
      ]
    },
    "text": "SELECT a.id,fltr[1] AS myAnswer\r\nFROM npky AS b\r\nUNNEST b.answers AS a\r\nLET fltr = [a.q, a.v]\r\nWHERE b.type = \"pcr\" AND fltr >= [\"eTimes.03\"] AND fltr < [SUCCESSOR(\"eTimes.03\")]\r\nORDER BY fltr\r\nLIMIT 30"
  }
]

vsr1 · May 29, 2022, 8:11pm

Change ORDER BY [a.q, a.v] vs LET clause, change a.id to b.id Also post the index definition of aaa

SELECT b.id,fltr[1] AS myAnswer
FROM bucket AS b
UNNEST b.answers AS a
LET fltr = [a.q, a.v]
WHERE b.type = "report" AND fltr >= ["question1"] AND fltr < [SUCCESSOR("question1")
ORDER BY [a.q, a.v]

Robert_Mirabelle · May 30, 2022, 5:11pm

Thanks again. My index was missing the id column.

So now, I can select a single question to get its answer. But how can I now select multiple questions?

The pseudo-code of the query I’m attempting is:

SELECT id,
answer.v as 'question1' WHERE answer.q = 'question1', 
answer.v AS 'question2' WHERE answer.q = 'question2',
answer.v AS 'question3' WHERE answer.q = 'question3'
ORDER BY question3

Looking for results like the following:

[
     {"id": "12345", "question1": "123", "question2": "234", "question3": 345"},
     {"id": "98765", "question1": "987", "question2": null, "question3": 765"}
]

Tried the following (creating an additional fltr):

SELECT b.id, 
fltr[1] AS `question1`,
fltr2[1] AS `question2`
FROM bucket AS b
UNNEST b.answers AS a
LET fltr = [a.q, a.v], fltr2 = [a.q, a.v]
WHERE b.type = 'record' 
AND fltr >= ['question1'] AND fltr < [SUCCESSOR('question1')]
AND fltr2 >= ['question2'] AND fltr2 > [SUCCESSOR('question2')]
ORDER BY fltr (no need to order by [a.q, a.v])
LIMIT 30

But this returns the wrong data:

{
    "question1": "123",
    "question2": "123" -- should be '234',
    "id": "3QTTK1UT3VV6"
  }

As a side note, it would also be useful to understand what fltr is even doing here, especially with regard to >= ['question1'], etc. If fltr is an array, why are we applying equality operators >= to the entire array, rather than a single element of the array, something like fltr[0] > 'question1'. And what does SUCCESSOR('question1') actually mean?

Thanks in advance.

vsr1 · May 30, 2022, 11:49pm

As i mentioned earlier this works on single equality, If you looking results in different format it may not able to optimize.

The following link has some details. (MB-32506 or links attached to this Issue.

SELECT b.id, {fltr[0]:fltr2[1]}.*
FROM bucket AS b
UNNEST b.answers AS a
LET fltr = [a.q, a.v]
WHERE b.type = 'record'
AND ((fltr >= ['question1'] AND fltr < [SUCCESSOR('question1')])
     OR (fltr >= ['question2'] AND fltr > [SUCCESSOR('question2')])
     OR (fltr >= ['question3'] AND fltr > [SUCCESSOR('question3')]))
ORDER BY fltr
LIMIT 30;

If you are looking the all questions in same document

CREATE INDEX ix10 ON bucket(DISTINCT ARRAY a.q FOR a IN answers END) WHERE type = 'record';

SELECT b.id, obj.*
FROM bucket AS b
LET obj = OBJECT a.q:a.v FOR a IN b.answers WHEN a.q IN [ "question1", "question2", "question3"] END
WHERE b.type = 'record'
      AND ANY a IN b.answers SATISFIES a.q IN [ "question1", "question2", "question3"] END
ORDER BY obj.`question3`;

fltr = [a.q, a.v]
Takes values : [“question1”,1] , [“question1”,2] , [“question1”,30], [“question2”,0]
a.q = “question1” means it needs 3 values (this is scalar value comparision)
Now fltr is ARRAY of 2 values. Index selection/push the values index predicate expression must match exactly with index key i.e [a.q, a.v]
As we don’t know second part of array value only knows first part we are converting equality to range of 2 values any value in second position return

(fltr >= [‘question1’] AND fltr < [SUCCESSOR(‘question1’)]

Use with caution. In 7.1 this makes much easier. As you looking your output in different format this all may not work. Use one suggested ix10 and query

Robert_Mirabelle · May 31, 2022, 8:46pm

Tried this query, but it’s still completely dependent on the WHERE clause pre-filtering to fewer documents (fewer than our parsley 15K).

It now seems apparent that there is NO WAY to sort documents based on the value of an array in a performant way. This is DEEPLY disappointing as I consider it a basic use case.

Currently, the ONLY way I can sort without total loss of performance is to create an additional root-level property in EVERY document containing the answer for any question I may need to sort by.

For example, I now need to update my report document schema to look like the following:

{
    "type": "report",
    "id": "12345",
    "question1" : "123", --copied from answers
    "question2": "234" -- copied from answers
    "answers": [
        {
            "q": "question1", 
            "v": "123", 
            "lbl": "Yes"
        },
        {
            "q": "question2", 
            "v": "234", 
            "lbl": "No"
        },
        {
             "q": "question3",
             "v" "456",
             "lbl": "456"
        }
    ]
}

Then I can create explicit indexes for question1 AND question2, which will yield satisfactory sort performance for a query like:

SELECT id, 
ARRAY a.v FOR a IN answers WHEN a.q = 'question1' END AS myAnswer
FROM bucket
WHERE type = 'report'
ORDER BY question1

But this muddies up our schema by effectively duplicating data for the sole purpose of making query performance reasonable.

Also, this is an ad-hoc query for a front-end search engine, so I won’t know in advance which answer the user may want to sort on. This query seemed like a perfect candidate for the Analytics Service. So I tried replicating the query there, only to discover that Analytics appears to perform just as poorly when attempting to sort based on the value of an array.