Why does N1QL not use my new index...?

jda · February 29, 2020, 12:45pm

I have recently added some functionality to calculate stats on the fly. So I have a couple of queries that I want to run fast - and therefore, I have tried to create an index that should be the best match - but the query does not use it, so I must have misunderstood something

Here is a small sample with two queries that I thought would use the same index. First one for a specific user:

SELECT specieskey,SUM(count) count 
FROM data
WHERE type='Catch' AND userkey='2124DEFEC111BA8FC1257ED20034B387' 
AND specieskey IN ['12','1','2','13','25','4','49','38','6','7','41'] 
AND locationtype is valued 
AND statslocation is valued 
AND statspublic is valued 
AND statstrippublic is valued 
AND statstriptarget 
AND year is valued 
AND month is valued 
GROUP BY specieskey

And then the same for all users (for comparison):

SELECT specieskey,SUM(count) count 
FROM data
WHERE type='Catch' AND userkey is valued 
AND specieskey IN ['12','1','2','13','25','4','49','38','6','7','41'] 
AND locationtype is valued 
AND statslocation is valued 
AND statspublic 
AND statstrippublic 
AND statstriptarget 
AND year is valued 
AND month is valued 
GROUP BY specieskey

It is worth mentioning that there are some minor differences in the “visibility” for the docs (for me: ALL, for everybody else: ONLY Public). There are more fields not used here (but defined for more granular searches by year, location, etc.)

I have created an index that I thought would be used for both of the above queries:

CREATE INDEX stats_catch_species_count
ON data (specieskey,count,locationtype,year)
WHERE d.type='Catch' 
AND userkey is valued 
AND specieskey is valued
AND locationtype is valued 
AND statslocation is valued 
AND statspublic is valued
AND statstrippublic is valued
AND statstriptarget is valued
AND year is valued 
AND month is valued

But when the above queries run then the “Explain” shows that they don’t use this index:

The first query uses:

{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "IndexScan3",
        "index": "def_type_user_1",
        "index_id": "ef1ea0928c66ab90",
        "index_projection": {
          "primary_key": true
        },
        "keyspace": "data",
        "namespace": "default",
        "spans": [
          {
            "exact": true,
            "range": [
              {
                "high": "\"Catch\"",
                "inclusion": 3,
                "low": "\"Catch\""
              },
              {
                "high": "\"2124DEFEC111BA8FC1257ED20034B387\"",
                "inclusion": 3,
                "low": "\"2124DEFEC111BA8FC1257ED20034B387\""
              }
            ]
          }
        ],
        "using": "gsi"
      },
      {
        "#operator": "Fetch",
        "keyspace": "data",
        "namespace": "default"
      },
      {
        "#operator": "Parallel",
        "~child": {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "Filter",
              "condition": "(((((((((((`data`.`type`) = \"Catch\") and ((`data`.`userkey`) = \"2124DEFEC111BA8FC1257ED20034B387\")) and ((`data`.`specieskey`) in [\"12\", \"1\", \"2\", \"13\", \"25\", \"4\", \"49\", \"38\", \"6\", \"7\", \"41\"])) and ((`data`.`locationtype`) is valued)) and ((`data`.`statslocation`) is valued)) and ((`data`.`statspublic`) is valued)) and ((`data`.`statstrippublic`) is valued)) and (`data`.`statstriptarget`)) and ((`data`.`year`) is valued)) and ((`data`.`month`) is valued))"
            },
            {
              "#operator": "InitialGroup",
              "aggregates": [
                "sum((`data`.`count`))"
              ],
              "group_keys": [
                "(`data`.`specieskey`)"
              ]
            }
          ]
        }
      },
      {
        "#operator": "IntermediateGroup",
        "aggregates": [
          "sum((`data`.`count`))"
        ],
        "group_keys": [
          "(`data`.`specieskey`)"
        ]
      },
      {
        "#operator": "FinalGroup",
        "aggregates": [
          "sum((`data`.`count`))"
        ],
        "group_keys": [
          "(`data`.`specieskey`)"
        ]
      },
      {
        "#operator": "Parallel",
        "~child": {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "InitialProject",
              "result_terms": [
                {
                  "expr": "(`data`.`specieskey`)"
                },
                {
                  "as": "count",
                  "expr": "sum((`data`.`count`))"
                }
              ]
            },
            {
              "#operator": "FinalProject"
            }
          ]
        }
      }
    ]
  },
  "text": "SELECT specieskey,SUM(count) count \n    FROM data\n    WHERE type='Catch' AND userkey='2124DEFEC111BA8FC1257ED20034B387' \n    AND specieskey IN ['12','1','2','13','25','4','49','38','6','7','41'] \n    AND locationtype is valued \n    AND statslocation is valued \n    AND statspublic is valued \n    AND statstrippublic is valued \n    AND statstriptarget \n    AND year is valued \n    AND month is valued \n    GROUP BY specieskey"
}

The def_type_user_1 index is defined as:

CREATE INDEX def_type_user_1 ON data(type,userkey) WHERE (userkey is valued)

In this case it’s not really a problem as the query runs in approx. 20 ms…

For the second query, however, this DOES become a problem. It uses the same index:

{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "IndexScan3",
        "index": "def_type_user_1",
        "index_id": "ef1ea0928c66ab90",
        "index_projection": {
          "primary_key": true
        },
        "keyspace": "data",
        "namespace": "default",
        "spans": [
          {
            "exact": true,
            "range": [
              {
                "high": "\"Catch\"",
                "inclusion": 3,
                "low": "\"Catch\""
              },
              {
                "inclusion": 0,
                "low": "null"
              }
            ]
          }
        ],
        "using": "gsi"
      },
      {
        "#operator": "Fetch",
        "keyspace": "data",
        "namespace": "default"
      },
      {
        "#operator": "Parallel",
        "~child": {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "Filter",
              "condition": "(((((((((((`data`.`type`) = \"Catch\") and ((`data`.`userkey`) is valued)) and ((`data`.`specieskey`) in [\"12\", \"1\", \"2\", \"13\", \"25\", \"4\", \"49\", \"38\", \"6\", \"7\", \"41\"])) and ((`data`.`locationtype`) is valued)) and ((`data`.`statslocation`) is valued)) and (`data`.`statspublic`)) and (`data`.`statstrippublic`)) and (`data`.`statstriptarget`)) and ((`data`.`year`) is valued)) and ((`data`.`month`) is valued))"
            },
            {
              "#operator": "InitialGroup",
              "aggregates": [
                "sum((`data`.`count`))"
              ],
              "group_keys": [
                "(`data`.`specieskey`)"
              ]
            }
          ]
        }
      },
      {
        "#operator": "IntermediateGroup",
        "aggregates": [
          "sum((`data`.`count`))"
        ],
        "group_keys": [
          "(`data`.`specieskey`)"
        ]
      },
      {
        "#operator": "FinalGroup",
        "aggregates": [
          "sum((`data`.`count`))"
        ],
        "group_keys": [
          "(`data`.`specieskey`)"
        ]
      },
      {
        "#operator": "Parallel",
        "~child": {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "InitialProject",
              "result_terms": [
                {
                  "expr": "(`data`.`specieskey`)"
                },
                {
                  "as": "count",
                  "expr": "sum((`data`.`count`))"
                }
              ]
            },
            {
              "#operator": "FinalProject"
            }
          ]
        }
      }
    ]
  },
  "text": "SELECT specieskey,SUM(count) count \n    FROM data\n    WHERE type='Catch' AND userkey is valued \n    AND specieskey IN ['12','1','2','13','25','4','49','38','6','7','41'] \n    AND locationtype is valued \n    AND statslocation is valued \n    AND statspublic \n    AND statstrippublic \n    AND statstriptarget \n    AND year is valued \n    AND month is valued \n    GROUP BY specieskey"
}

This query takes ~10 seconds to run…

So what am I doing wrong here? I have tried to include all the fields used for the condition and for the resultset. And I also try to optimize it by naming all of the fields in the condition in the same order (just as …is valued) to make a closer match to the index.

Follow up question: Once this has been “fixed” I will need to run a similar query on CB Lite (.Net) - so what would be the best index on that platform?

I’m on Couchbase Server 6.0.0 Community Edition

jda · February 29, 2020, 1:09pm

Ok, I just tried to use the USE INDEX phrase - but that just made everything slower… - so something must be wrong with the index.

Tried this query:

SELECT specieskey,SUM(count) count 
    FROM data
    USE INDEX (stats_catch_species_count USING GSI)
    WHERE type='Catch' AND userkey is valued 
    AND specieskey IN ['12','1','2','13','25','4','49','38','6','7','41'] 
    AND locationtype is valued 
    AND statslocation is valued 
    AND statspublic 
    AND statstrippublic 
    AND statstriptarget 
    AND year is valued 
    AND month is valued 
    GROUP BY specieskey

I’ve seen it take from 8 to 24 seconds!

The explain looks like this (so I guess it doesn’t use my “hint”):

{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "IndexScan3",
        "index": "def_type_user_1",
        "index_id": "ef1ea0928c66ab90",
        "index_projection": {
          "primary_key": true
        },
        "keyspace": "data",
        "namespace": "default",
        "spans": [
          {
            "exact": true,
            "range": [
              {
                "high": "\"Catch\"",
                "inclusion": 3,
                "low": "\"Catch\""
              },
              {
                "inclusion": 0,
                "low": "null"
              }
            ]
          }
        ],
        "using": "gsi"
      },
      {
        "#operator": "Fetch",
        "keyspace": "data",
        "namespace": "default"
      },
      {
        "#operator": "Parallel",
        "~child": {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "Filter",
              "condition": "(((((((((((`data`.`type`) = \"Catch\") and ((`data`.`userkey`) is valued)) and ((`data`.`specieskey`) in [\"12\", \"1\", \"2\", \"13\", \"25\", \"4\", \"49\", \"38\", \"6\", \"7\", \"41\"])) and ((`data`.`locationtype`) is valued)) and ((`data`.`statslocation`) is valued)) and (`data`.`statspublic`)) and (`data`.`statstrippublic`)) and (`data`.`statstriptarget`)) and ((`data`.`year`) is valued)) and ((`data`.`month`) is valued))"
            },
            {
              "#operator": "InitialGroup",
              "aggregates": [
                "sum((`data`.`count`))"
              ],
              "group_keys": [
                "(`data`.`specieskey`)"
              ]
            }
          ]
        }
      },
      {
        "#operator": "IntermediateGroup",
        "aggregates": [
          "sum((`data`.`count`))"
        ],
        "group_keys": [
          "(`data`.`specieskey`)"
        ]
      },
      {
        "#operator": "FinalGroup",
        "aggregates": [
          "sum((`data`.`count`))"
        ],
        "group_keys": [
          "(`data`.`specieskey`)"
        ]
      },
      {
        "#operator": "Parallel",
        "~child": {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "InitialProject",
              "result_terms": [
                {
                  "expr": "(`data`.`specieskey`)"
                },
                {
                  "as": "count",
                  "expr": "sum((`data`.`count`))"
                }
              ]
            },
            {
              "#operator": "FinalProject"
            }
          ]
        }
      }
    ]
  },
  "text": "SELECT specieskey,SUM(count) count \n    FROM data\n    USE INDEX (stats_catch_species_count USING GSI)\n    WHERE type='Catch' AND userkey is valued \n    AND specieskey IN ['12','1','2','13','25','4','49','38','6','7','41'] \n    AND locationtype is valued \n    AND statslocation is valued \n    AND statspublic \n    AND statstrippublic \n    AND statstriptarget \n    AND year is valued \n    AND month is valued \n    GROUP BY specieskey"
}

vsr1 · February 29, 2020, 2:26pm

The problem is there is extra d. (must be type - “Catch”) in the index where .

Try this index. (If needed use https://index-advisor.couchbase.com/indexadvisor/#1) then adjust index for partial index.

CREATE INDEX stats_catch_species_count ON data (specieskey, userkey, `count`,locationtype, year)
      WHERE dtype='Catch' ;

jda · February 29, 2020, 3:09pm

Oh boy… - yes I got that out of my query (adjusted it from another query with an unnest) - but overlooked it here!

But that works more like I would expect. Now it is using the index - and the time is down to 0,5-1 sec.

I also tried to create the index as you suggested. It uses the index when a userkey is specified - but it goes back to the def_type_user_1 again if userkey is not specified. I have even tried to remove the “… is valued” fields in the WHERE clause - but it does not make any difference. And using this index it now takes 5-8 seconds…

Your suggestion is actually exactly the same as I started out with (don’t recall if I had the “d.” issue in that index though). But as it was slower than expected I then decided to add all the fields that could be used in the resultset AND in the WHERE clause of the queries - and add them with “… is valued” and in the same order as in the index. But I don’t know if this is the best way to do it - although it seems to be faster in this case…

jda · February 29, 2020, 4:16pm

Sorry, I can see that I had an example testing here (after I wrote the question) that seems to make a difference - it includes all of the referenced fields (which I see could be confusing from my last message).

So I have tried to use this index:

CREATE INDEX stats_catch_species_count
ON data (type,userkey,specieskey,locationtype,statslocation,statspublic,statstrippublic,statstriptarget,year,month,count,length,weight)
WHERE type='Catch' 
AND userkey is valued 
AND specieskey is valued
AND locationtype is valued 
AND statslocation is valued 
AND statspublic is valued
AND statstrippublic is valued
AND statstriptarget is valued
AND year is valued 
AND month is valued

And the same in a simpler form:

CREATE INDEX stats_catch_species_count
ON data (type,userkey,specieskey,locationtype,statslocation,statspublic,statstrippublic,statstriptarget,year,month,count,length,weight)
WHERE type='Catch' 
AND userkey is valued

Both of these seem to work most efficient if I specify all of the fields in the where clause - in the same order. But the response times vary - so could just be a coincidence?

vsr1 · February 29, 2020, 7:27pm

You can specify USE INDEX hint,