Designing a multi-purpose index

n1ql

#1

Hi,

I am trying to create N1QL indexes for a couple of queries our application wants to perform. The issue is, I’m not able to design a single index which can be used efficiently for all of the queries even though I have this feeling I should be able to.

The setup we are using is this:

We are using document ID (meta().id) to encode some application specific type information. For example, a document ID “a:_role/X” would mean that there is a role X in the system. The “a:” part means the document is a reference document and only contains a reference ID for the real document.

Example:

Document 1: _a:role/X -> { ref: “c:YYY” } // reference to real data
Document 2: c:YYY -> { … } // actual data

Now, we have the following types of queries we want to perform (we have more, but this should be enough to explain the issue):

  1. Find all the roles defined in the system.
  2. Find all references to “c:YYY”.

We have currenly defined one index for this as follows:

CREATE INDEX alias_index on bucket(meta().id, ref) WHERE meta().id LIKE ‘a:%’;

Query for use-case (1):

_SELECT meta().id FROM bucket USE INDEX (alias_index USING GSI) WHERE meta().id LIKE ‘a:%’ AND meta().id LIKE 'a:role%’

Query for use-case (2):

SELECT meta().id FROM bucket USE INDEX (alias_index USING GSI) WHERE meta().id LIKE ‘a:%’ AND ref=‘c:YYY’

Both queries work, but are now as fast as they should be. I am currently testing with a bucket of ~ 200k documents, and both types of queries takes ~ 400-500 ms on my local development machine.

The issue is (I believe) the usage / non-usage of the first index key (meta().id). If I want to use the same index for both queries, meta().id needs to be part of the index definition (I get a ‘No index available on keyspace bucket that matches your query’ otherwise) because of the first query, but query (2) doesn’t really have a value for that key, so the best I can do there is a wildcard (%), which seems non-optimal.

I know I can get really fast indexes for these queries since using two different indexes (with only ref defined for index (2) and only meta().id defined for index (1)) give me timings in the range of ~ 5 ms.

I briefly tried looking at adaptive indexes as well, but don’t really know how to fit meta().id in there.

Would the better way here be to have separate indexes for these queries? I really want to avoid too many indexes since it can lead to a lot of overhead for different queries that are almost identical.

Regards,
Andreas Nilsson


#2

It should not have such huge difference. Check out EXPLAIN. Also try with CB 5.x
Do you have primary index also. Query 1 primary index should work.


#3

Hi!

I’m currently running with CB 5.0.1 CE.

I created a primary index (create primary index on cmbucket USING GSI) and removed all other indexes to be sure. Query (1) behaves as follows:

success | elapsed: 523.63ms | execution: 523.59ms | count: 12 | size: 562

{
  "plan": {
"#operator": "Sequence",
"~children": [
  {
    "#operator": "IndexScan2",
    "covers": [
      "cover ((meta(`cmbucket`).`id`))"
    ],
    "index": "#primary",
    "index_id": "16d89d8c69437dc0",
    "keyspace": "cmbucket",
    "namespace": "default",
    "spans": [
      {
        "range": [
          {
            "high": "\"a;\"",
            "inclusion": 1,
            "low": "\"a:\""
          }
        ]
      }
    ],
    "using": "gsi"
  },
  {
    "#operator": "Parallel",
    "~child": {
      "#operator": "Sequence",
      "~children": [
        {
          "#operator": "Filter",
          "condition": "((cover ((meta(`cmbucket`).`id`)) like \"a:%\") and (cover ((meta(`cmbucket`).`id`)) like \"a:_role%\"))"
        },
        {
          "#operator": "InitialProject",
          "result_terms": [
            {
              "expr": "cover ((meta(`cmbucket`).`id`))"
            }
          ]
        },
        {
          "#operator": "FinalProject"
        }
      ]
    }
  }
]
  },
  "text": "SELECT meta().id FROM `cmbucket` WHERE meta().id LIKE 'a:%' AND meta().id LIKE 'a:_role%'"
}

Query (2) also “works” using only primary index:

success | elapsed: 885.12ms | execution: 885.09ms | count: 2 | size: 114

{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "IndexScan2",
        "index": "#primary",
        "index_id": "16d89d8c69437dc0",
        "keyspace": "cmbucket",
        "namespace": "default",
        "spans": [
          {
            "exact": true,
            "range": [
              {
                "high": "\"a;\"",
                "inclusion": 1,
                "low": "\"a:\""
              }
            ]
          }
        ],
        "using": "gsi"
      },
      {
        "#operator": "Fetch",
        "keyspace": "cmbucket",
        "namespace": "default"
      },
      {
        "#operator": "Parallel",
        "~child": {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "Filter",
              "condition": "(((meta(`cmbucket`).`id`) like \"a:%\") and ((`cmbucket`.`contentId`) = \"c:YmRjMDU4YzgtZDg3Yy00\"))"
            },
            {
              "#operator": "InitialProject",
              "result_terms": [
                {
                  "expr": "(meta(`cmbucket`).`id`)"
                }
              ]
            },
            {
              "#operator": "FinalProject"
            }
          ]
        }
      }
    ]
  },
  "text": "SELECT meta().id FROM `cmbucket` WHERE meta().id LIKE 'a:%' AND contentId='c:YmRjMDU4YzgtZDg3Yy00'"
}

All of this works, of course, but performance is far from usable in these cases.

The slower (2) is explained by having to fetch data from the data service, I assume.

With my old index (create index alias_index on cmbucket(meta().id, contentId) where meta().id like ‘a:%’), the explains instead looked as follows.

Query (1):

success | elapsed: 528.31ms | execution: 528.21ms | count: 12 | size: 562

{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "IndexScan2",
        "covers": [
          "cover ((meta(`cmbucket`).`id`))",
          "cover ((`cmbucket`.`contentId`))",
          "cover ((meta(`cmbucket`).`id`))"
        ],
        "filter_covers": {
          "cover ((\"a:\" <= (meta(`cmbucket`).`id`)))": true,
          "cover (((meta(`cmbucket`).`id`) < \"a;\"))": true,
          "cover (((meta(`cmbucket`).`id`) like \"a:%\"))": true
        },
        "index": "alias_index",
        "index_id": "308979758e93673e",
        "index_projection": {
          "primary_key": true
        },
        "keyspace": "cmbucket",
        "namespace": "default",
        "spans": [
          {
            "range": [
              {
                "high": "\"a;\"",
                "inclusion": 1,
                "low": "\"a:\""
              }
            ]
          }
        ],
        "using": "gsi"
      },
      {
        "#operator": "Parallel",
        "~child": {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "Filter",
              "condition": "(cover (((meta(`cmbucket`).`id`) like \"a:%\")) and (cover ((meta(`cmbucket`).`id`)) like \"a:_role%\"))"
            },
            {
              "#operator": "InitialProject",
              "result_terms": [
                {
                  "expr": "cover ((meta(`cmbucket`).`id`))"
                }
              ]
            },
            {
              "#operator": "FinalProject"
            }
          ]
        }
      }
    ]
  },
  "text": "SELECT meta().id FROM `cmbucket` WHERE meta().id LIKE 'a:%' AND meta().id LIKE 'a:_role%'"
}

Query (2):

success | elapsed: 474.09ms | execution: 473.99ms | count: 2 | size: 114

{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "IndexScan2",
        "covers": [
          "cover ((meta(`cmbucket`).`id`))",
          "cover ((`cmbucket`.`contentId`))",
          "cover ((meta(`cmbucket`).`id`))"
        ],
        "filter_covers": {
          "cover ((\"a:\" <= (meta(`cmbucket`).`id`)))": true,
          "cover (((meta(`cmbucket`).`id`) < \"a;\"))": true,
          "cover (((meta(`cmbucket`).`id`) like \"a:%\"))": true
        },
        "index": "alias_index",
        "index_id": "308979758e93673e",
        "index_projection": {
          "entry_keys": [
            1
          ],
          "primary_key": true
        },
        "keyspace": "cmbucket",
        "namespace": "default",
        "spans": [
          {
            "exact": true,
            "range": [
              {
                "high": "\"a;\"",
                "inclusion": 1,
                "low": "\"a:\""
              },
              {
                "high": "\"c:YmRjMDU4YzgtZDg3Yy00\"",
                "inclusion": 3,
                "low": "\"c:YmRjMDU4YzgtZDg3Yy00\""
              }
            ]
          }
        ],
        "using": "gsi"
      },
      {
        "#operator": "Parallel",
        "~child": {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "Filter",
              "condition": "(cover (((meta(`cmbucket`).`id`) like \"a:%\")) and (cover ((`cmbucket`.`contentId`)) = \"c:YmRjMDU4YzgtZDg3Yy00\"))"
            },
            {
              "#operator": "InitialProject",
              "result_terms": [
                {
                  "expr": "cover ((meta(`cmbucket`).`id`))"
                }
              ]
            },
            {
              "#operator": "FinalProject"
            }
          ]
        }
      }
    ]
  },
  "text": "SELECT meta().id FROM `cmbucket` WHERE meta().id LIKE 'a:%' AND contentId='c:YmRjMDU4YzgtZDg3Yy00'"
}

An optimized index for query (2) (create index alias_index oncmbucket(contentId) where meta().id like 'a:%') is explained by the query engine as follows:

success | elapsed: 3.59ms | execution: 3.56ms | count: 2 | size: 114

{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "IndexScan2",
        "covers": [
          "cover ((`cmbucket`.`contentId`))",
          "cover ((meta(`cmbucket`).`id`))"
        ],
        "filter_covers": {
          "cover ((\"a:\" <= (meta(`cmbucket`).`id`)))": true,
          "cover (((meta(`cmbucket`).`id`) < \"a;\"))": true,
          "cover (((meta(`cmbucket`).`id`) like \"a:%\"))": true
        },
        "index": "alias_index",
        "index_id": "572f95a8c063eeaa",
        "keyspace": "cmbucket",
        "namespace": "default",
        "spans": [
          {
            "exact": true,
            "range": [
              {
                "high": "\"c:YmRjMDU4YzgtZDg3Yy00\"",
                "inclusion": 3,
                "low": "\"c:YmRjMDU4YzgtZDg3Yy00\""
              }
            ]
          }
        ],
        "using": "gsi"
      },
      {
        "#operator": "Parallel",
        "~child": {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "Filter",
              "condition": "(cover (((meta(`cmbucket`).`id`) like \"a:%\")) and (cover ((`cmbucket`.`contentId`)) = \"c:YmRjMDU4YzgtZDg3Yy00\"))"
            },
            {
              "#operator": "InitialProject",
              "result_terms": [
                {
                  "expr": "cover ((meta(`cmbucket`).`id`))"
                }
              ]
            },
            {
              "#operator": "FinalProject"
            }
          ]
        }
      }
    ]
  },
  "text": "SELECT meta().id FROM `cmbucket` WHERE meta().id LIKE 'a:%' AND contentId='c:YmRjMDU4YzgtZDg3Yy00'"
}

Regards,
Andreas Nilsson


#4

In LIKE _ is wild card character. you need to escape.

SELECT meta().id 
FROM `cmbucket` 
WHERE meta().id LIKE 'a:%' AND meta().id LIKE 'a:\\_role%'

You should see explain some thing like below

    "spans": [
                    {
                        "range": [
                            {
                                "high": "\"a:_rolf\"",
                                "inclusion": 1,
                                "low": "\"a:_role\""
                            }
                        ]
                    }

Query 2 You should use create index alias_index on cmbucket(contentId) where meta().id like ‘a:%’)

Follow Rule #11 . Otherwise indexer needs lot of index pages (due to leading range key) and process the results.

N1QL Profiling

N1QL books


#5

Hi,

You are totally correct, escaping the ‘_’ char made the primary index query really fast. Many thanks!

Now… what would actually be the benefit of using a primary index here? I don’t really have a need for ad-hoc queries, and as long as I have to create an index, wouldn’t a specialised secondary index almost always be better? Also, as far as I understand primary indexes, they don’t really store any data (to “cover” queries), so as soon as I need to select data that isn’t meta information, it will have to retrieve it using the data service?

As an alternative, say I decide to instead use a secondary index for the query type to “find all roles”… the “_role” key prefix isn’t the only type of key prefix I’m gonna want to query on, as we also have prefixes like “_type” and “_workspace”. In order to not have to create three different secondary indexes, it seems I can create the index like this:

CREATE INDEX alias_index2 ONcmbucket(meta().id) where meta().id like 'a:\\_role%' OR meta().id like 'a:\\_type%' OR meta().id like 'a:\\_workspace%'

Example:

SELECT meta().id FROMcmbucketUSE INDEX (alias_index2 USING GSI) WHERE meta().id LIKE 'a:\\_workspace%' AND meta().id LIKE 'a:_workspace/admin-workspace-%'

success | elapsed: 4.94ms | execution: 4.90ms | count: 2 | size: 182

{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "IndexScan2",
        "covers": [
          "cover ((meta(`cmbucket`).`id`))",
          "cover ((meta(`cmbucket`).`id`))"
        ],
        "index": "alias_index2",
        "index_id": "db7eaaef803942c1",
        "index_projection": {
          "primary_key": true
        },
        "keyspace": "cmbucket",
        "namespace": "default",
        "spans": [
          {
            "range": [
              {
                "high": "\"a:_workspacf\"",
                "inclusion": 1,
                "low": "\"a:_workspace\""
              }
            ]
          }
        ],
        "using": "gsi"
      },
      {
        "#operator": "Parallel",
        "~child": {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "Filter",
              "condition": "((cover ((meta(`cmbucket`).`id`)) like \"a:\\\\_workspace%\") and (cover ((meta(`cmbucket`).`id`)) like \"a:_workspace/admin-workspace-%\"))"
            },
            {
              "#operator": "InitialProject",
              "result_terms": [
                {
                  "expr": "cover ((meta(`cmbucket`).`id`))"
                }
              ]
            },
            {
              "#operator": "FinalProject"
            }
          ]
        }
      }
    ]
  },
  "text": "SELECT meta().id FROM `cmbucket` USE INDEX (alias_index2 USING GSI) WHERE meta().id LIKE 'a:\\\\_workspace%' AND meta().id LIKE 'a:_workspace/admin-workspace-%';"
}

Except for the need to to actually OR every type of document when creating the index like this, wouldn’t this really perform the same job as a primary index?

Queries for all three key prefix types seems to be fast with this index (5-10 ms). Now, I’m just trying to understand the benefit of a secondary index versus a primary index. Won’t a primary index grow very large since it encompasses every document?

Thanks for all your help btw!

Regards,
Andreas Nilsson


#6

You can create secondary index with meta().id in this situations index may be little bigger due to key is stored twice (index key, document key).

If you are using USE INDEX it will fine otherwise non-covered query may try intersect scan which you need to avoid if it causes performance issue.