Strange issue with index

I have a datastructure like this:

{
      "ispublic": true,
      "key": "187",
      "name": "Moserne i Nikkelborg (v. Nørrekær)",
      "points": [
        {
          "lat": 56.5084931472,
          "lon": 8.9840306006
        }
      ],
      "revisioninfo": {
        "created": "2019-03-28T16:31:19+0100",
        "createdby": "John Dalsgaard/BA171123846CEBF1C1257CB2002DA330/Fangst",
        "modifiedcount": 0
      },
      "type": "Lake"
    }

I would like to count the geo points. I have 5 data types that contain these points arrays that could have one entry (like here) or more.

I use this query:

SELECT sum(ARRAY_COUNT(points)) AS size FROM data WHERE type IN ["Lake","Stream","CoastLocalArea","SeaLocalArea","PutTakeLake"]

I have created two index like this:

CREATE INDEX `def_count_points` ON `data`(`type`,array_count(`points`)) WHERE (`points` is valued)

I would expect the query to use that index… But it prefers this:

CREATE INDEX `def_type` ON `data`(`type`)

The Explain looks like this:

{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "IndexScan3",
        "index": "def_type",
        "index_id": "941b5fdba8f0e7f8",
        "index_projection": {
          "primary_key": true
        },
        "keyspace": "data",
        "namespace": "default",
        "spans": [
          {
            "exact": true,
            "range": [
              {
                "high": "\"CoastLocalArea\"",
                "inclusion": 3,
                "low": "\"CoastLocalArea\""
              }
            ]
          },
          {
            "exact": true,
            "range": [
              {
                "high": "\"Lake\"",
                "inclusion": 3,
                "low": "\"Lake\""
              }
            ]
          },
          {
            "exact": true,
            "range": [
              {
                "high": "\"PutTakeLake\"",
                "inclusion": 3,
                "low": "\"PutTakeLake\""
              }
            ]
          },
          {
            "exact": true,
            "range": [
              {
                "high": "\"SeaLocalArea\"",
                "inclusion": 3,
                "low": "\"SeaLocalArea\""
              }
            ]
          },
          {
            "exact": true,
            "range": [
              {
                "high": "\"Stream\"",
                "inclusion": 3,
                "low": "\"Stream\""
              }
            ]
          }
        ],
        "using": "gsi"
      },
      {
        "#operator": "Fetch",
        "keyspace": "data",
        "namespace": "default"
      },
      {
        "#operator": "Parallel",
        "~child": {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "Filter",
              "condition": "((`data`.`type`) in [\"Lake\", \"Stream\", \"CoastLocalArea\", \"SeaLocalArea\", \"PutTakeLake\"])"
            },
            {
              "#operator": "InitialGroup",
              "aggregates": [
                "sum(array_count((`data`.`points`)))"
              ],
              "group_keys": []
            }
          ]
        }
      },
      {
        "#operator": "IntermediateGroup",
        "aggregates": [
          "sum(array_count((`data`.`points`)))"
        ],
        "group_keys": []
      },
      {
        "#operator": "FinalGroup",
        "aggregates": [
          "sum(array_count((`data`.`points`)))"
        ],
        "group_keys": []
      },
      {
        "#operator": "Parallel",
        "~child": {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "InitialProject",
              "result_terms": [
                {
                  "as": "size",
                  "expr": "sum(array_count((`data`.`points`)))"
                }
              ]
            },
            {
              "#operator": "FinalProject"
            }
          ]
        }
      }
    ]
  },
  "text": "SELECT sum(ARRAY_COUNT(points)) AS size FROM data WHERE type IN [\"Lake\",\"Stream\",\"CoastLocalArea\",\"SeaLocalArea\",\"PutTakeLake\"]"
}

Now, the reason I found it is that I have optimized the def_type index to not include some activity logs that I would never query from inside the system - so I thought it was better to take them out of the index (at present there are 230K of these log documents). Therefore, I changed the def_type index to this:

CREATE INDEX `def_type` ON `data`(`type`) WHERE (not (`type` = "ActivityLog")) 

However, with the above count statement I now get:

[
  {
    "code": 4000,
    "msg": "No index available on keyspace data that matches your query. Use CREATE INDEX or CREATE PRIMARY INDEX to create an index, or check that your expected index is online.",
    "query_from_user": "SELECT sum(ARRAY_COUNT(points)) AS size FROM data WHERE points is valued"
  }
]

I tried to play with a couple of variations:

SELECT sum(ARRAY_COUNT(points)) AS size FROM data WHERE type='Lake'

This one works - though of course it only gives me the points for lakes.

SELECT sum(ARRAY_COUNT(points)) AS size FROM data WHERE type IN ['Lake']

This one fails with the above “no index…” message.

SELECT sum(ARRAY_COUNT(points)) AS size FROM data WHERE points is valued

This one also fails with same “no index…” message.
I have also tried to create the count index like this:

CREATE INDEX `def_count_points` ON `data`(`type`,`points`) WHERE (`points` is valued)

But that isn’t used either - so the above results are the exact same :frowning:

Can any one spread a little light on where my misinterpretation of the index usage lies?

I’m on a community server: Community Edition 6.0.0 build 1693

If index is partial index.

  1. Index where clause must be present in query
  2. Index leading key must be present in the query.

Use the following index and make sure queries have type predicate. One don’t have add type IS NOT MISSING

CREATE INDEX def_count_points ON data(type,array_count(points));

Ok, so just removing the WHERE clause (to help the indexer) made it abandon using the index!

I had forgotten to include my original SELECT in the post. I have added it now… - for better understanding of what I try to do.

Leaving out the points is valued from the index creation solves the issue as it now uses the index. One thing I don’t understand, though, is the problems of using ... WHERE type IN ['Lake','Stream',...] - why does that not work when using “=” for one of them does?

I have been looking in stats for the server to see if I could find queries that take long time to run. I understand that I may have to create more indexes than I already have - to ensure the right connection between the index creation statement and the actual queries. But I have not been able to find references to the queries that take long time (e.g. over 500ms, over 1000 ms, etc.). Is there a way to identify these?

Thanks for helping!

Ok, so now the “Count” statement works. However, I have another query that should use a different index. It has now started to use the count index above… :thinking:

This query finds the newest modification time:

SELECT m[0] AS modified,m[1] AS modifiedby,d.revisioninfo.modifiedcount FROM data AS d LET m = ARRAY_MAX(ARRAY [IFMISSINGORNULL(u.modified,u.created), IFMISSINGORNULL(u.modifiedby,u.createdby)] 
FOR u IN TO_ARRAY(IFMISSINGORNULL(revisioninfo.updates,revisioninfo)) END) WHERE d.type IN ['Lake','Stream','PutTakeLake','SeaLocalArea','CoastLocalArea'] ORDER BY ARRAY_MAX(ARRAY [IFMISSINGORNULL(u.modified,u.created), IFMISSINGORNULL(u.modifiedby,u.createdby)] 
FOR u IN TO_ARRAY(IFMISSINGORNULL(revisioninfo.updates,revisioninfo)) END) DESC LIMIT 1

When I use this index then it works:

CREATE INDEX `def_type_modified` ON `data`(`type`,array_max(array [ifmissingornull((`u`.`modified`), (`u`.`created`)), ifmissingornull((`u`.`modifiedby`), (`u`.`createdby`))] 
for `u` in to_array(ifmissingornull((`revisioninfo`.`updates`), `revisioninfo`)) end) DESC,(`revisioninfo`.`modifiedcount`)) 

However, if I try to “tune” this index by not including those 230K entries that I know for sure I would never query:

CREATE INDEX `def_type_modified` ON `data`(`type`,array_max(array [ifmissingornull((`u`.`modified`), (`u`.`created`)), ifmissingornull((`u`.`modifiedby`), (`u`.`createdby`))] 
for `u` in to_array(ifmissingornull((`revisioninfo`.`updates`), `revisioninfo`)) end) DESC,(`revisioninfo`.`modifiedcount`)) 
WHERE (not (`type` = "ActivityLog")) 

… then it won’t use that index any more! And I still have the type field in my query. So how can this be? Based on what you mentioned then I would have thought that should work Ok - but just have a much smaller index to check to find the result…

So I guess that perhaps it comes down to whether one should try (and can?) limit entries in an index for better performance like I try here by excluding a certain document type (the type “ActivityLog”)?

  • You specify USE INDEX clause if you need specific index to use.
  • Index Selection is based on query predicate only When there are multiple indexes can be used and query is not covered it will pick random index. provide hint through USE INDEX.
  • If you don’t want include documents in the index you must create partial index. To use that index you must specify the index condition in query predicate.

Ok, thanks for clarifying this.