Trouble with index

Ok, I need someone to explain this to me :slight_smile:

I have this index: def_type:

CREATE INDEX `def_type` ON `data`(`type`)

… and then I have docs. with this structure:

{
  "areakey": "1",
  "ispublic": true,
  "name": "Kattegat Nord",
  "type": "SeaLocalArea",
  "key": "1",
  "points": [
    {
      "lon": 10.1735065284,
      "lat": 56.7035020083
    },
    {
      "lon": 10.366207197,
      "lat": 57.5101324097
    },
    {
      "lon": 10.3695737543,
      "lat": 57.5999336965
    },
    : 
    : 
    {
      "lon": 12.0185944208,
      "lat": 57.3016435524
    }
  ]
}

Now, if I just write a query to find all of the points in total (over a handful of “types”):

SELECT sum(ARRAY_COUNT(points)) AS size FROM data WHERE type IN ["Lake","Stream","CoastLocalArea","SeaLocalArea","PutTakeLake"] and points is valued

I get this result:

[
  {
    "size": 42223
  }
]

All is good - except this query takes 1.5 second. The Explain looks like this:

{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "IndexScan2",
        "index": "def_type",
        "index_id": "aac6039ac62cc592",
        "index_projection": {
          "primary_key": true
        },
        "keyspace": "data",
        "namespace": "default",
        "spans": [
          {
            "exact": true,
            "range": [
              {
                "high": "\"CoastLocalArea\"",
                "inclusion": 3,
                "low": "\"CoastLocalArea\""
              }
            ]
          },
          {
            "exact": true,
            "range": [
              {
                "high": "\"Lake\"",
                "inclusion": 3,
                "low": "\"Lake\""
              }
            ]
          },
          {
            "exact": true,
            "range": [
              {
                "high": "\"PutTakeLake\"",
                "inclusion": 3,
                "low": "\"PutTakeLake\""
              }
            ]
          },
          {
            "exact": true,
            "range": [
              {
                "high": "\"SeaLocalArea\"",
                "inclusion": 3,
                "low": "\"SeaLocalArea\""
              }
            ]
          },
          {
            "exact": true,
            "range": [
              {
                "high": "\"Stream\"",
                "inclusion": 3,
                "low": "\"Stream\""
              }
            ]
          }
        ],
        "using": "gsi"
      },
      {
        "#operator": "Fetch",
        "keyspace": "data",
        "namespace": "default"
      },
      {
        "#operator": "Parallel",
        "~child": {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "Filter",
              "condition": "(((`data`.`type`) in [\"Lake\", \"Stream\", \"CoastLocalArea\", \"SeaLocalArea\", \"PutTakeLake\"]) and ((`data`.`points`) is valued))"
            },
            {
              "#operator": "InitialGroup",
              "aggregates": [
                "sum(array_count((`data`.`points`)))"
              ],
              "group_keys": []
            }
          ]
        }
      },
      {
        "#operator": "IntermediateGroup",
        "aggregates": [
          "sum(array_count((`data`.`points`)))"
        ],
        "group_keys": []
      },
      {
        "#operator": "FinalGroup",
        "aggregates": [
          "sum(array_count((`data`.`points`)))"
        ],
        "group_keys": []
      },
      {
        "#operator": "Parallel",
        "~child": {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "InitialProject",
              "result_terms": [
                {
                  "as": "size",
                  "expr": "sum(array_count((`data`.`points`)))"
                }
              ]
            },
            {
              "#operator": "FinalProject"
            }
          ]
        }
      }
    ]
  },
  "text": "SELECT sum(ARRAY_COUNT(points)) AS size FROM data WHERE type IN [\"Lake\",\"Stream\",\"CoastLocalArea\",\"SeaLocalArea\",\"PutTakeLake\"] and points is valued"
}

So I decided to create an index for this:

CREATE INDEX `def_points` ON `data`(`type`,`points`) where type IN ["Lake","Stream","CoastLocalArea","SeaLocalArea","PutTakeLake"] and points is valued

When I run the same query as above I get this:

[
  {
    "size": 40101
  }
]

And the Explain looks like this (which I think looks Ok):

{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "DistinctScan",
        "scan": {
          "#operator": "IndexScan2",
          "covers": [
            "cover ((`data`.`type`))",
            "cover ((`data`.`points`))",
            "cover ((meta(`data`).`id`))"
          ],
          "filter_covers": {
            "cover (((`data`.`points`) is valued))": true,
            "cover (((`data`.`type`) in [\"Lake\", \"Stream\", \"CoastLocalArea\", \"SeaLocalArea\", \"PutTakeLake\"]))": true
          },
          "index": "def_points",
          "index_id": "e11843af240694d8",
          "keyspace": "data",
          "namespace": "default",
          "spans": [
            {
              "range": [
                {
                  "high": "\"CoastLocalArea\"",
                  "inclusion": 3,
                  "low": "\"CoastLocalArea\""
                },
                {
                  "inclusion": 0,
                  "low": "null"
                }
              ]
            },
            {
              "range": [
                {
                  "high": "\"Lake\"",
                  "inclusion": 3,
                  "low": "\"Lake\""
                },
                {
                  "inclusion": 0,
                  "low": "null"
                }
              ]
            },
            {
              "range": [
                {
                  "high": "\"PutTakeLake\"",
                  "inclusion": 3,
                  "low": "\"PutTakeLake\""
                },
                {
                  "inclusion": 0,
                  "low": "null"
                }
              ]
            },
            {
              "range": [
                {
                  "high": "\"SeaLocalArea\"",
                  "inclusion": 3,
                  "low": "\"SeaLocalArea\""
                },
                {
                  "inclusion": 0,
                  "low": "null"
                }
              ]
            },
            {
              "range": [
                {
                  "high": "\"Stream\"",
                  "inclusion": 3,
                  "low": "\"Stream\""
                },
                {
                  "inclusion": 0,
                  "low": "null"
                }
              ]
            }
          ],
          "using": "gsi"
        }
      },
      {
        "#operator": "Parallel",
        "~child": {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "Filter",
              "condition": "(cover (((`data`.`type`) in [\"Lake\", \"Stream\", \"CoastLocalArea\", \"SeaLocalArea\", \"PutTakeLake\"])) and cover (((`data`.`points`) is valued)))"
            },
            {
              "#operator": "InitialGroup",
              "aggregates": [
                "sum(array_count(cover ((`data`.`points`))))"
              ],
              "group_keys": []
            }
          ]
        }
      },
      {
        "#operator": "IntermediateGroup",
        "aggregates": [
          "sum(array_count(cover ((`data`.`points`))))"
        ],
        "group_keys": []
      },
      {
        "#operator": "FinalGroup",
        "aggregates": [
          "sum(array_count(cover ((`data`.`points`))))"
        ],
        "group_keys": []
      },
      {
        "#operator": "Parallel",
        "~child": {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "InitialProject",
              "result_terms": [
                {
                  "as": "size",
                  "expr": "sum(array_count(cover ((`data`.`points`))))"
                }
              ]
            },
            {
              "#operator": "FinalProject"
            }
          ]
        }
      }
    ]
  },
  "text": "SELECT sum(ARRAY_COUNT(points)) AS size FROM data WHERE type IN [\"Lake\",\"Stream\",\"CoastLocalArea\",\"SeaLocalArea\",\"PutTakeLake\"] and points is valued"
}

Why does the query not return the same nuber?

If I use this query instead (without specifying that the points should be valued):

SELECT sum(ARRAY_COUNT(points)) AS size FROM data WHERE type IN ["Lake","Stream","CoastLocalArea","SeaLocalArea","PutTakeLake"] 

Then it returns (as expected):

[
  {
    "size": 42223
  }
]

But this is probably due to it using the def_type index instead (and it takes 1.5-2 seconds):

{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "IndexScan2",
        "index": "def_type",
        "index_id": "aac6039ac62cc592",
        "index_projection": {
          "primary_key": true
        },
        "keyspace": "data",
        "namespace": "default",
        "spans": [
          {
            "exact": true,
            "range": [
              {
                "high": "\"CoastLocalArea\"",
                "inclusion": 3,
                "low": "\"CoastLocalArea\""
              }
            ]
          },
          {
            "exact": true,
            "range": [
              {
                "high": "\"Lake\"",
                "inclusion": 3,
                "low": "\"Lake\""
              }
            ]
          },
          {
            "exact": true,
            "range": [
              {
                "high": "\"PutTakeLake\"",
                "inclusion": 3,
                "low": "\"PutTakeLake\""
              }
            ]
          },
          {
            "exact": true,
            "range": [
              {
                "high": "\"SeaLocalArea\"",
                "inclusion": 3,
                "low": "\"SeaLocalArea\""
              }
            ]
          },
          {
            "exact": true,
            "range": [
              {
                "high": "\"Stream\"",
                "inclusion": 3,
                "low": "\"Stream\""
              }
            ]
          }
        ],
        "using": "gsi"
      },
      {
        "#operator": "Fetch",
        "keyspace": "data",
        "namespace": "default"
      },
      {
        "#operator": "Parallel",
        "~child": {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "Filter",
              "condition": "((`data`.`type`) in [\"Lake\", \"Stream\", \"CoastLocalArea\", \"SeaLocalArea\", \"PutTakeLake\"])"
            },
            {
              "#operator": "InitialGroup",
              "aggregates": [
                "sum(array_count((`data`.`points`)))"
              ],
              "group_keys": []
            }
          ]
        }
      },
      {
        "#operator": "IntermediateGroup",
        "aggregates": [
          "sum(array_count((`data`.`points`)))"
        ],
        "group_keys": []
      },
      {
        "#operator": "FinalGroup",
        "aggregates": [
          "sum(array_count((`data`.`points`)))"
        ],
        "group_keys": []
      },
      {
        "#operator": "Parallel",
        "~child": {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "InitialProject",
              "result_terms": [
                {
                  "as": "size",
                  "expr": "sum(array_count((`data`.`points`)))"
                }
              ]
            },
            {
              "#operator": "FinalProject"
            }
          ]
        }
      }
    ]
  },
  "text": "SELECT sum(ARRAY_COUNT(points)) AS size FROM data WHERE type IN [\"Lake\",\"Stream\",\"CoastLocalArea\",\"SeaLocalArea\",\"PutTakeLake\"]"
}

Note:
I have narrowed it down to 3 docs. not being included in the query with a total number of points making up the difference. But I have absolutely no clue why? There is nothing “special” about theses three docs - they are just missing

You might have lot entries in points and this makes index wide. There is limit on index keysize. If exceeds indexer skip indexing that document. Check indexer.log any skipped items.

Right index will be
CREATE INDEX def_count_points ON data(type, ARRAY_COUNT(points));

Oh yes, of course. And now we’re down to 130 ms :slight_smile:

Thanks!

if you need AND points IS VALUED and add this index WHERE

Yeah, wouldn’t that make sense? Otherwise I guess there would be entries with a count of zero, right?

Yes. results will not change but avoid them. Also make query covered and make it fast

Thanks, makes sense! :+1:

Hmmmm… This index is n0t used by the query:

CREATE INDEX def_count_points ON data(type, ARRAY_COUNT(points)) where points is valued;

Did I misunderstand you?

I’m on Community Edition 5.1.1 - if that matters…

It should.

   CREATE INDEX def_count_points ON data(type, ARRAY_COUNT(points)) WHERE points IS VALUED;
    SELECT sum(ARRAY_COUNT(points)) AS size
    FROM data
    WHERE type IN ["Lake","Stream","CoastLocalArea","SeaLocalArea","PutTakeLake"] AND points IS VALUED;

OR

CREATE INDEX def_count_points ON default(type, ARRAY_COUNT(points)) WHERE ARRAY_COUNT(points) > 0 ;
SELECT sum(ARRAY_COUNT(points)) AS size
FROM default
WHERE type IN ["Lake","Stream","CoastLocalArea","SeaLocalArea","PutTakeLake"] AND ARRAY_COUNT(points) > 0;

Ah, yes. That makes sense.

Not sure it matters though, as the response time in this case is identical - but it would give a smaller index. But for the query to use the index it would need to have the exact same extra conditions…

Thanks! I’m learning… :wink: :+1:

Query must have subset of index conditions to use so that it can answer correctly. If not it will not use that index because it result in wrong results.

Yeah, point taken! It makes sense… - but could result in more indeces if different subsets are queried - so one “just” needs to understand that :blush: