Bad performance with named parameters IN set / list


#1

Trying to used named parameters in a N1QL query and that seems to work fine for equals values, however, our query has IN and a List is provided, the query plan changes and performance is awful.

our query:

SELECT count(1) FROM `bucket` WHERE `user` is not null" +
            " AND `user`.`type` = $accountType" +
            " AND `user`.`id` = $accountId" +
            " AND split(meta().id, '::')[0] in ["e1", "e2", "e3", "e4"];

or (with replacement for IN)

SELECT count(1) FROM `bucket` WHERE `user` is not null" +
            " AND `user`.`type` = $accountType" +
            " AND `user`.`id` = $accountId" +
            " AND split(meta().id, '::')[0] in $environments";

Setting up the $accountType and $accountId work just fine, but $environments is a List of String. When that comes into play, the query plan changes and performance is very bad (timeout when using a direct string instead works in 100ms).

The String based query plan is like this:

{
    "plan":
    {
        "~children": [
        {
            "scan":
            {
                "keyspace": "bucket",
                "spans": [
                {
                    "Range":
                    {
                        "High": ["\"e1\"", "$accountType", "successor($accountId)"],
                        "Low": ["\"e1\"", "$accountType", "$accountId"],
                        "Inclusion": 1
                    }
                },
                {
                    "Range":
                    {
                        "High": ["\"e2\"", "$accountType", "successor($accountId)"],
                        "Low": ["\"e2\"", "$accountType", "$accountId"],
                        "Inclusion": 1
                    }
                },
                {
                    "Range":
                    {
                        "High": ["\"e3\"", "$accountType", "successor($accountId)"],
                        "Low": ["\"e3\"", "$accountType", "$accountId"],
                        "Inclusion": 1
                    }
                },
                {
                    "Range":
                    {
                        "High": ["\"e4\"", "$accountType", "successor($accountId)"],
                        "Low": ["\"e4\"", "$accountType", "$accountId"],
                        "Inclusion": 1
                    }
                }],
                "#operator": "IndexScan",
                "using": "gsi",
                "namespace": "default",
                "index": "user_idx",
                "index_id": "619388af1e312812",
                "covers": ["cover ((split((meta(`bucket`).`id`), \"::\")[0]))", "cover (((`bucket`.`user`).`type`))", "cover (((`bucket`.`user`).`id`))", "cover (((`bucket`.`user`).`ttlInSec`))", "cover ((meta(`bucket`).`id`))"],
                "filter_covers":
                {
                    "cover (((`bucket`.`user`) is not null))": true
                }
            },
            "#operator": "DistinctScan"
        },
        {
            "~child":
            {
                "~children": [
                {
                    "#operator": "Filter",
                    "condition": "(((cover (((`bucket`.`user`) is not null)) and (cover (((`bucket`.`user`).`type`)) = $accountType)) and (cover (((`bucket`.`user`).`id`)) = $accountId)) and (cover ((split((meta(`bucket`).`id`), \"::\")[0])) in [\"e1\", \"e2\", \"e3\", \"e4\"]))"
                },
                {
                    "group_keys": [],
                    "#operator": "InitialGroup",
                    "aggregates": ["count(1)"]
                }],
                "#operator": "Sequence"
            },
            "#operator": "Parallel"
        },
        {
            "group_keys": [],
            "#operator": "IntermediateGroup",
            "aggregates": ["count(1)"]
        },
        {
            "group_keys": [],
            "#operator": "FinalGroup",
            "aggregates": ["count(1)"]
        },
        {
            "~child":
            {
                "~children": [
                {
                    "#operator": "InitialProject",
                    "result_terms": [
                    {
                        "as": "count",
                        "expr": "count(1)"
                    }]
                },
                {
                    "#operator": "FinalProject"
                }],
                "#operator": "Sequence"
            },
            "#operator": "Parallel"
        }],
        "#operator": "Sequence"
    },
    "text": "SELECT count(1) as count FROM `bucket` WHERE `user` is not null AND `user`.`type` = $accountType AND `user`.`id` = $accountId AND split(meta().id, '::')[0] in ['e1','e2','e3','e4']"
}

and the named parameter based list returns:

{
    "plan":
    {
        "~children": [
        {
            "keyspace": "bucket",
            "spans": [
            {
                "Range":
                {
                    "Low": ["null", "$accountType", "$accountId"],
                    "Inclusion": 0
                }
            }],
            "#operator": "IndexScan",
            "using": "gsi",
            "namespace": "default",
            "index": "user_idx",
            "index_id": "619388af1e312812",
            "covers": ["cover ((split((meta(`bucket`).`id`), \"::\")[0]))", "cover (((`bucket`.`user`).`type`))", "cover (((`bucket`.`user`).`id`))", "cover (((`bucket`.`user`).`ttlInSec`))", "cover ((meta(`bucket`).`id`))"],
            "filter_covers":
            {
                "cover (((`bucket`.`user`) is not null))": true
            }
        },
        {
            "~child":
            {
                "~children": [
                {
                    "#operator": "Filter",
                    "condition": "(((cover (((`bucket`.`user`) is not null)) and (cover (((`bucket`.`user`).`type`)) = $accountType)) and (cover (((`bucket`.`user`).`id`)) = $accountId)) and (cover ((split((meta(`bucket`).`id`), \"::\")[0])) in $environments))"
                },
                {
                    "group_keys": [],
                    "#operator": "InitialGroup",
                    "aggregates": ["count(1)"]
                }],
                "#operator": "Sequence"
            },
            "#operator": "Parallel"
        },
        {
            "group_keys": [],
            "#operator": "IntermediateGroup",
            "aggregates": ["count(1)"]
        },
        {
            "group_keys": [],
            "#operator": "FinalGroup",
            "aggregates": ["count(1)"]
        },
        {
            "~child":
            {
                "~children": [
                {
                    "#operator": "InitialProject",
                    "result_terms": [
                    {
                        "as": "count",
                        "expr": "count(1)"
                    }]
                },
                {
                    "#operator": "FinalProject"
                }],
                "#operator": "Sequence"
            },
            "#operator": "Parallel"
        }],
        "#operator": "Sequence"
    },
    "text": "SELECT count(1) as count FROM `bucket` WHERE `user` is not null AND `user`.`type` = $accountType AND `user`.`id` = $accountId AND split(meta().id, '::')[0] in $environments"
}

#2

IndexScans are based on range scans (i.e low, high).
The right side of the IN clause requires array (i.e key IN $environments ) if it is named/query parameter the value will not know at prepare time (because query plan is generated at prepare time). So query plan can’t convert the array values into multiple range values of OR clause. It request indexScan on whole range of that key (some times BETWEEN ARRAY_MIN($enviroments) AND ARRAY_MAX($environments) ) reapply predicates in query engine. This causes lot of false positives and impacts performance.

At present you have one of the following options.

  1. If you have finite array elements try individual element as query parameters
    ex: key IN [ $qp1, $qp2, $qp3]
  2. adhoc=true
  3. In example above you can use equality and repeat query for each IN element and add the result in client may perform better because If it uses IndexCountScan
  4. Also if user.type, user.id always equality predicates move those leading , followed by split(meta().id, ‘::’)[0] in inedex keys. This makes named parameters explain will have high range

#3
  1. Unfortunately, no. It varies. Although, we could generate the query to define each sub parameter.
  2. adhoc(true) was actually our query mode. Didn’t work.
  3. Yep, that’s similar to IN and would be similar to 1.
  4. That could help, but essentially would require a scan of the results… That winds up being slow.

I think it may be worth trying 1. :slight_smile:

Thanks - H


#4

You should try with 3 & 4

CREATE INDEX ix1 ON `bucket`(`user`.type, `user`.accountId, split(meta().id, '::')[0]);

SELECT count(1) FROM `bucket` WHERE  `user`.`type` = $accountType  AND `user`.`id` = $accountId AND split(meta().id, '::')[0]  = $e1;

In the client
        cnt = 0;
        for v IN environments
             set  e1 named parameter to v
             execute query
             cnt = cnt + result from query
       end for

#5

Oh making multiple queries? Later in our code we actually use the count to do other work (querying data again)… Best to keep it all together.

I’d prefer to do one of the following:

SELECT count(1) FROM `bucket` WHERE  `user`.`type` = $accountType  AND `user`.`id` = $accountId AND split(meta().id, '::')[0]  IN [ $e1, $e2, $e3, ....];

or

SELECT count(1) FROM `bucket` WHERE  `user`.`type` = $accountType  AND `user`.`id` = $accountId AND 
(split(meta().id, '::')[0]  = $e1 or split(meta().id, '::')[0]  = $e2 or split(meta().id, '::')[0]  = $e3 ....);

#6

In that case change the index suggested and try your original query you should see some improvement.

"keyspace": "bucket",
            "spans": [
            {
                "Range":
                {
                    "Low": ["null", "$accountType", "$accountId"],
                    "Inclusion": 0
                }
            }],

As you see above there is no high value (if any key high missing, following keys high will not pushed to indexer). example accountId=10 it gets all accountIds>=10 and applied filter. That will change with new index.

FYI: 5.0+ this has been improved


#7

I changed it to :

SELECT count(1) FROM `bucket` WHERE  `user`.`type` = $accountType  AND `user`.`id` = $accountId AND split(meta().id, '::')[0]  IN [ $e1, $e2, $e3, ....];

And the existing index is fine. The query plan is very similar to the first:

{
    "plan":
    {
        "~children": [
        {
            "scan":
            {
                "keyspace": "bucket",
                "spans": [
                {
                    "Range":
                    {
                        "High": ["$e0", "$accountType", "successor($accountId)"],
                        "Low": ["$e0", "$accountType", "$accountId"],
                        "Inclusion": 1
                    }
                },
                {
                    "Range":
                    {
                        "High": ["$e1", "$accountType", "successor($accountId)"],
                        "Low": ["$e1", "$accountType", "$accountId"],
                        "Inclusion": 1
                    }
                },
                {
                    "Range":
                    {
                        "High": ["$e2", "$accountType", "successor($accountId)"],
                        "Low": ["$e2", "$accountType", "$accountId"],
                        "Inclusion": 1
                    }
                },
                {
                    "Range":
                    {
                        "High": ["$e3", "$accountType", "successor($accountId)"],
                        "Low": ["$e3", "$accountType", "$accountId"],
                        "Inclusion": 1
                    }
                }],
                "#operator": "IndexScan",
                "using": "gsi",
                "namespace": "default",
                "index": "user_idx",
                "index_id": "619388af1e312812",
                "covers": ["cover ((split((meta(`bucket`).`id`), \"::\")[0]))", "cover (((`bucket`.`user`).`type`))", "cover (((`bucket`.`user`).`id`))", "cover (((`bucket`.`user`).`ttlInSec`))", "cover ((meta(`bucket`).`id`))"],
                "filter_covers":
                {
                    "cover (((`bucket`.`user`) is not null))": true
                }
            },
            "#operator": "DistinctScan"
        },
        {
            "~child":
            {
                "~children": [
                {
                    "#operator": "Filter",
                    "condition": "(((cover (((`bucket`.`user`) is not null)) and (cover (((`bucket`.`user`).`type`)) = $accountType)) and (cover (((`bucket`.`user`).`id`)) = $accountId)) and (cover ((split((meta(`bucket`).`id`), \"::\")[0])) in [$e0, $e1, $e2, $e3]))"
                },
                {
                    "group_keys": [],
                    "#operator": "InitialGroup",
                    "aggregates": ["count(1)"]
                }],
                "#operator": "Sequence"
            },
            "#operator": "Parallel"
        },
        {
            "group_keys": [],
            "#operator": "IntermediateGroup",
            "aggregates": ["count(1)"]
        },
        {
            "group_keys": [],
            "#operator": "FinalGroup",
            "aggregates": ["count(1)"]
        },
        {
            "~child":
            {
                "~children": [
                {
                    "#operator": "InitialProject",
                    "result_terms": [
                    {
                        "as": "count",
                        "expr": "count(1)"
                    }]
                },
                {
                    "#operator": "FinalProject"
                }],
                "#operator": "Sequence"
            },
            "#operator": "Parallel"
        }],
        "#operator": "Sequence"
    },
    "text": "SELECT count(1) as count FROM `bucket` WHERE `user` is not null AND `user`.`type` = $accountType AND `user`.`id` = $accountId AND split(meta().id, '::')[0] in [$e0,$e1,$e2,$e3]"
}

Thanks so much for the suggestion! Big help to have less bad code.


#8

@vsr1 So, following up a bit. Although the query plan looks good, sometimes we are seeing these queries time out, still.

I found that manually querying IN […] within our application, we see timeouts. So, I take the same query to the Admin UI, and query - takes a long time. Then, I queried just for = one of the items in the list (for all the items). Those were fast. Then, queried the original query, and it’s fast…

Now, we could break up our query into 4 queries (or X as the case may be) as you mentioned earlier, but this seems pretty weird…

All of this is currently anecdotal - I need to be able to find / duplicate / etc the case.

-H


#9

@unhuman, Is the number of IN elements are more when timeout, what about load, you can increase the timeout. In 5.x lot of these are improved https://developer.couchbase.com/documentation/server/current/performance/index_pushdowns.html