Covering index range spans

I’ve stumbled across something that I can’t find mentioned in the documentation, but apologies if I’ve just missed it.

If I create an index with definition:

CREATE INDEX `Idx_Core_User_n`
ON `core_north-europe`(`docType`,`role`,`establishmentId`)
WHERE `docType`="User" USING GSI

…and get the EXPLAIN plan for query:

SELECT META().id
FROM `core_north-europe`
WHERE `docType`="User"
AND `establishmentId`="FaS87F40jUWzPQtVFkFGaQ"

…I get:

[
  {
	"plan": {
	  "#operator": "Sequence",
	  "~children": [
		{
		  "#operator": "IndexScan",
		  "covers": [
			"cover ((`core_north-europe`.`docType`))",
			"cover ((`core_north-europe`.`role`))",
			"cover ((`core_north-europe`.`establishmentId`))",
			"cover ((meta(`core_north-europe`).`id`))"
		  ],
		  "filter_covers": {
			"cover ((`core_north-europe`.`docType`))": "User"
		  },
		  "index": "Idx_Core_User_n",
		  "index_id": "943ea07e4d48bf37",
		  "keyspace": "core_north-europe",
		  "namespace": "default",
		  "spans": [
			{
			  "Range": {
				"High": [
				  "successor(\"User\")"
				],
				"Inclusion": 1,
				"Low": [
				  "\"User\""
				]
			  }
			}
		  ],
		  "using": "gsi"
		},
		{
		  "#operator": "Parallel",
		  "~child": {
			"#operator": "Sequence",
			"~children": [
			  {
				"#operator": "Filter",
				"condition": "((cover ((`core_north-europe`.`docType`)) = \"User\") and (cover ((`core_north-europe`.`establishmentId`)) = \"FaS87F40jUWzPQtVFkFGaQ\"))"
			  },
			  {
				"#operator": "InitialProject",
				"result_terms": [
				  {
					"expr": "cover ((meta(`core_north-europe`).`id`))"
				  }
				]
			  },
			  {
				"#operator": "FinalProject"
			  }
			]
		  }
		}
	  ]
	},
	"text": "SELECT META().id\n    FROM `core_north-europe`\n    WHERE `docType`=\"User\"\n    AND `establishmentId`=\"FaS87F40jUWzPQtVFkFGaQ\""
  }
]

This executes in 150ms with a result count of 11.

If however I include a value for the ‘role’ attribute in the query:

SELECT META().id
FROM `core_north-europe`
WHERE `docType`="User"
AND `establishmentId`="FaS87F40jUWzPQtVFkFGaQ" AND `role`="TeachingStaff"

…I get:

[
  {
	"plan": {
	  "#operator": "Sequence",
	  "~children": [
		{
		  "#operator": "IndexScan",
		  "covers": [
			"cover ((`core_north-europe`.`docType`))",
			"cover ((`core_north-europe`.`role`))",
			"cover ((`core_north-europe`.`establishmentId`))",
			"cover ((meta(`core_north-europe`).`id`))"
		  ],
		  "filter_covers": {
			"cover ((`core_north-europe`.`docType`))": "User"
		  },
		  "index": "Idx_Core_User_n",
		  "index_id": "943ea07e4d48bf37",
		  "keyspace": "core_north-europe",
		  "namespace": "default",
		  "spans": [
			{
			  "Range": {
				"High": [
				  "\"User\"",
				  "\"TeachingStaff\"",
				  "\"FaS87F40jUWzPQtVFkFGaQ\""
				],
				"Inclusion": 3,
				"Low": [
				  "\"User\"",
				  "\"TeachingStaff\"",
				  "\"FaS87F40jUWzPQtVFkFGaQ\""
				]
			  }
			}
		  ],
		  "using": "gsi"
		},
		{
		  "#operator": "Parallel",
		  "~child": {
			"#operator": "Sequence",
			"~children": [
			  {
				"#operator": "Filter",
				"condition": "(((cover ((`core_north-europe`.`docType`)) = \"User\") and (cover ((`core_north-europe`.`establishmentId`)) = \"FaS87F40jUWzPQtVFkFGaQ\")) and (cover ((`core_north-europe`.`role`)) = \"TeachingStaff\"))"
			  },
			  {
				"#operator": "InitialProject",
				"result_terms": [
				  {
					"expr": "cover ((meta(`core_north-europe`).`id`))"
				  }
				]
			  },
			  {
				"#operator": "FinalProject"
			  }
			]
		  }
		}
	  ]
	},
	"text": "SELECT META().id\n    FROM `core_north-europe`\n    WHERE `docType`=\"User\"\n    AND `establishmentId`=\"FaS87F40jUWzPQtVFkFGaQ\" AND `role`=\"TeachingStaff\""
  }
]

…which executes in 18ms with a result count of 8.

The main difference here is the range scan. In the latter, all query params are included, whereas in the former only ‘User’ is included in the range. From what I can see, these query params are only included in the range based on them lining up with the index definition.

For example, if I changed the order of the attributes to include in the index definition as follows (‘establishmentId’ first):

CREATE INDEX `Idx_Core_User_n`
ON `core_north-europe`(`docType`,`establishmentId`,`role`)
WHERE `docType`="User" USING GSI

…I can now run the first query again:

SELECT META().id
FROM `core_north-europe`
WHERE `docType`="User"
AND `establishmentId`="FaS87F40jUWzPQtVFkFGaQ"

…with EXPLAIN:

[
  {
	"plan": {
	  "#operator": "Sequence",
	  "~children": [
		{
		  "#operator": "IndexScan",
		  "covers": [
			"cover ((`core_north-europe`.`docType`))",
			"cover ((`core_north-europe`.`establishmentId`))",
			"cover ((`core_north-europe`.`role`))",
			"cover ((meta(`core_north-europe`).`id`))"
		  ],
		  "filter_covers": {
			"cover ((`core_north-europe`.`docType`))": "User"
		  },
		  "index": "Idx_Core_User_n",
		  "index_id": "8df51bceebda4650",
		  "keyspace": "core_north-europe",
		  "namespace": "default",
		  "spans": [
			{
			  "Range": {
				"High": [
				  "\"User\"",
				  "successor(\"FaS87F40jUWzPQtVFkFGaQ\")"
				],
				"Inclusion": 1,
				"Low": [
				  "\"User\"",
				  "\"FaS87F40jUWzPQtVFkFGaQ\""
				]
			  }
			}
		  ],
		  "using": "gsi"
		},
		{
		  "#operator": "Parallel",
		  "~child": {
			"#operator": "Sequence",
			"~children": [
			  {
				"#operator": "Filter",
				"condition": "((cover ((`core_north-europe`.`docType`)) = \"User\") and (cover ((`core_north-europe`.`establishmentId`)) = \"FaS87F40jUWzPQtVFkFGaQ\"))"
			  },
			  {
				"#operator": "InitialProject",
				"result_terms": [
				  {
					"expr": "cover ((meta(`core_north-europe`).`id`))"
				  }
				]
			  },
			  {
				"#operator": "FinalProject"
			  }
			]
		  }
		}
	  ]
	},
	"text": "SELECT META().id\n    FROM `core_north-europe`\n    WHERE `docType`=\"User\"\n    AND `establishmentId`=\"FaS87F40jUWzPQtVFkFGaQ\""
  }
]

…and execution in 20ms.

That’s 7.5x faster just by swapping the order of the index attributes to better suit one of my most executed queries.

Is there anything documented to let people know that they need to be thinking about this?

Thanks,
Fraser

2 Likes

some useful material, see also <04_QueryOptimizationAndTuning_v5.pdf> P34:
Index stores data is pre-sorted by the index keys.

1 Like

thanks @frasdav for raising this. Will add details to CREATE INDEX & related documentation.
When defining compound indexes, the order of index keys is very important, both for selecting the index for a query, and to leverage the index order (to avoid sorting in the N1QL query engine).

Infact, prefix keys are required to use the index. For ex: In following index,
CREATE INDEX idx1 on default (idx_key1, idx_key2, idx_key3, …)

  • {idx_key1} is prefix-key for idx_key2,
  • {idx_key1, idx_key2} are prefix-keys for idx_key3, and so on…

To leverage the index, a SELECT query must have reference to idx_key1, which is a minimum prefix. It is better to have as many prefix keys as possible. In fact, for better performance, the keys with low-selectivity should be placed before the keys with higher selectivities.

Following blog explains index-key selectivities and composite indexes:

Understanding index spans:
https://developer.couchbase.com/documentation/server/4.6/performance/index-scans.html

hth,
-Prasad

1 Like