Index very slow when using result of subquery

martin.hogge · January 24, 2019, 9:20am

Giving the following index:

CREATE INDEX idx_event_type_join ON DATA (ifmissingornull(eventTypeId, configurableMetadataId)) WHERE (type = "EVENT")

I have a complexe query using this index that I will simplify here for the purpose of the example:

SELECT DISTINCT RAW event.visitId FROM DATA event 
WHERE event.`type` = 'EVENT'
AND IFMISSINGORNULL(event.eventTypeId, event.configurableMetadataId) IN ['ET_CONSULTATION_EXT', 'ET_CONSULTATION_NUTRITION', 'ET_CONSULTATION_KINE', 'ET_CONSULTATION_OPHT']

Running this query is ok but if I replace the last line by this:

AND IFMISSINGORNULL(event.eventTypeId, event.configurableMetadataId) IN
	ARRAY_FLATTEN(
		(
			SELECT RAW ARRAY_ADD(IFMISSINGORNULL(vdo.options, []), META().id)
			FROM METADATA vdo USE KEYS ['ALL_CONSULT']
		),
		2
	)

The query execution time is then ~20x slower.
The subquery is returning the exact same thing as if I was using the hardcoded version of the query:

SELECT RAW ARRAY_FLATTEN((SELECT RAW ARRAY_ADD(IFMISSINGORNULL(vdo.options, []), META().id) FROM METADATA vdo USE KEYS ['SNIS_ALL_CONSULT']), 2)

I saw that the both queries are using the index idx_event_type_join but the second one has spans with an inclusion of 0 and low to null.

Here is te comparaison between the first and the second query plan (extracting only what is relevant here, if you need more please tell me):

First query plan (without subquery)

{
  "#operator": "IndexScan2",
  "index": "idx_event_type_join",
  "index_id": "9e35a46cfe282ef2",
  "index_projection": {
    "primary_key": true
  },
  "keyspace": "DATA",
  "namespace": "default",
  "spans": [
    {
      "exact": true,
      "range": [
        {
          "high": "\"ET_CONSULTATION_EXT\"",
          "inclusion": 3,
          "low": "\"ET_CONSULTATION_EXT\""
        }
      ]
    },
    {
      "exact": true,
      "range": [
        {
          "high": "\"ET_CONSULTATION_KINE\"",
          "inclusion": 3,
          "low": "\"ET_CONSULTATION_KINE\""
        }
      ]
    },
    {
      "exact": true,
      "range": [
        {
          "high": "\"ET_CONSULTATION_NUTRITION\"",
          "inclusion": 3,
          "low": "\"ET_CONSULTATION_NUTRITION\""
        }
      ]
    },
    {
      "exact": true,
      "range": [
        {
          "high": "\"ET_CONSULTATION_OPHT\"",
          "inclusion": 3,
          "low": "\"ET_CONSULTATION_OPHT\""
        }
      ]
    }
  ],
  "using": "gsi"
}

Second query plan (with subquery)

{
  "#operator": "IndexScan2",
  "index": "idx_event_type_join",
  "index_id": "9e35a46cfe282ef2",
  "index_projection": {
    "primary_key": true
  },
  "keyspace": "DATA",
  "namespace": "default",
  "spans": [
    {
      "range": [
        {
          "inclusion": 0,
          "low": "null"
        }
      ]
    }
  ],
  "using": "gsi"
}

I’m using Couchbase Server Community Edition 5.1.1 build 5723

Thanks for the help.

vsr1 · January 24, 2019, 1:29pm

Plan is generated before execution. Subquery results are not available until parent query produced single row (i.e. execution). So parent query need to work for all possible values. So it will get all the items form the index. That is why it is slow.

If right side of IN clause is dynamic, it can’t be pushed to indexer.
Best option will be split in 2 queries.
Query 1: Subquery, execute it, get results and pass them to Query 2 as query parameters
Query 2: Parent query with query parameter as results of Query 1, execute as adhoc=True

martin.hogge · January 24, 2019, 1:59pm

Thank you for the information. It’s clearer why the index doesn’t work as expected.

But I’m not sure to understand what you meant by:

execute ad adhoc=True

And just to be sure, their is no way of achieving this without running a subquery? Maybe some tricks using join? I ask it because the query is part of a set of queries that are all executed sequentially and aims to be atomic (without any dependencies to other query results).

vsr1 · January 24, 2019, 2:07pm

@martin.hogge

If you don’t want separate queries into 2, no other option.

If using 5.5 try ANSI JOIN , https://blog.couchbase.com/ansi-join-support-n1ql/

SELECT DISTINCT RAW event.visitId
FROM (SELECT RAW vu
      FROM METADATA vdo AS v USE KEYS ['ALL_CONSULT']
      UNNEST ARRAY_ADD(IFMISSINGORNULL(v.options, []), META().id) AS vu) AS f
INNER JOIN DATA AS event  ON f = IFMISSINGORNULL(event.eventTypeId, event.configurableMetadataId)
WHERE event.`type` = 'EVENT';

OR

SELECT DISTINCT RAW event.visitId
FROM (SELECT ARRAY_ADD(IFMISSINGORNULL(vdo.options, []), META().id) AS ev
            FROM METADATA vdo USE KEYS ['ALL_CONSULT']) AS m
INNER JOIN DATA event
ON IFMISSINGORNULL(event.eventTypeId, event.configurableMetadataId) IN m.ev
WHERE event.`type` = 'EVENT';

Scans LEFT side first and uses dynamic values from left during right side index scan as described in the article Example 3

ingenthr · January 24, 2019, 6:55pm

I believe what he was referring to here is that from an application using the SDK, there is a method of setting a query option that indicates it’s not adhoc, which will have the SDK automatically setup a prepared statement on the cluster.

This is covered in the documentation in the Querying with N1QL section (Java section, for example).