COUNT condition: INDEX Speed-up improvements

Sidenote: Please read this topic for background on this question (specifically #8 response)

Why doesn’t my index kick-in (in the first 3 queries) to give me blazing-fast results? I guess that array operations / filters are hard but the main keys are already there

cbq> select * from system:indexes;
{
    "requestID": "cf3b9ed6-18e0-475a-92f4-0cf3489dc8a2",
    "signature": {
        "*": "*"
    },
    "results": [
        {
            "indexes": {
                "datastore_id": "http://127.0.0.1:8091",
                "id": "582e2bb37cdebf18",
                "index_key": [],
                "is_primary": true,
                "keyspace_id": "sync_gateway",
                "name": "#primary",
                "namespace_id": "default",
                "state": "online",
                "using": "gsi"
            }
        },
        {
            "indexes": {
                "condition": "(`type` = \"bcn_scan\")",
                "datastore_id": "http://127.0.0.1:8091",
                "id": "3847cba3cb75f8a9",
                "index_key": [
                    "`did`",
                    "`t`"
                ],
                "keyspace_id": "sync_gateway",
                "name": "bcn_scan-did-t",
                "namespace_id": "default",
                "state": "online",
                "using": "gsi"
            }
        },
        {
            "indexes": {
                "datastore_id": "http://127.0.0.1:8091",
                "id": "a8c8c25786449f1",
                "index_key": [],
                "is_primary": true,
                "keyspace_id": "playground",
                "name": "#primary",
                "namespace_id": "default",
                "state": "online",
                "using": "gsi"
            }
        }
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "203.155093ms",
        "executionTime": "203.108497ms",
        "resultCount": 3,
        "resultSize": 1325
    }
}
  1. I tried to read https://dzone.com/articles/understanding-index-scans-in-couchbase-n1ql-query; however I do not understand what kind of index would be necessary to make those queries faster?

  2. Between those 2 queries, which would be considered “faster” post-index (and why)?

(Code follow - parser needs an extra line to make preformatted text … preformatted! :slight_smile: )

SELECT did, {date : {
		"foull_entries" : ARRAY_SUM(ARRAY_AGG(TONUMBER(ARRAY_LENGTH(bcn)<>0))),
		"nulld_entries" : ARRAY_SUM(ARRAY_AGG(TONUMBER(ARRAY_LENGTH(bcn)==0))),
		"total_entries" : COUNT(*)
	} 
}
FROM sync_gateway
LET date = SUBSTR(t, 0, 10)
WHERE type="bcn_scan" AND date BETWEEN "2016-11-01" AND "2016-12-15"
GROUP BY did, date
ORDER BY did ASC, date ASC;

"metrics": {
	"elapsedTime": "3m19.595350541s",
	"executionTime": "3m19.595267868s",
	"resultCount": 215,
	"resultSize": 59099,
	"sortCount": 215
}

and

SELECT date, {did : {
		"foull_entries" : COUNT(NULLIF(ARRAY_LENGTH(bcn), 0)),
		"total_entries" : COUNT(*),
		"nulld_entries" : COUNT(*) - COUNT(NULLIF(ARRAY_LENGTH(bcn), 0))
	}
}
FROM sync_gateway
LET date = SUBSTR(t, 0, 10)
WHERE type="bcn_scan" AND date BETWEEN "2016-11-01" AND "2016-12-15"
GROUP BY date, did
ORDER BY date ASC, did ASC;

"metrics": {
    "elapsedTime": "3m5.42893805s",
    "executionTime": "3m5.428881988s",
    "resultCount": 215,
    "resultSize": 59314,
    "sortCount": 215
}

Obviously differences ~5s are negligible pre-index - maybe post-index array operations will be significant.
Maybe some syntax makes more sense to professionals.
Maybe there are serious issues with one of the two syntaxes.

Or […] idk. All comments / assistance welcome

Let’s start with the fundamentals.

Your query should have a WHERE clause. That is called a query predicate.

WHERE f(a) AND g(b) AND h(c) ...

That means, some function / condition on a, b, and c. a, b, c are fields in your documents. f, g, and h are any expressions. For example, f(a) can be a > 5.

Your index has keys. Index keys are the fields/expressions in parentheses in theCREATE INDEX statement.

CREATE INDEX idx ON mybucket( a, b, c, d, e );

In this example, a, b, c, d, e are index keys.

This is the basic rule of indexing: for a query to an index, the query predicate must be able to use the leftmost keys of the index (including possibly all the keys of the index).

The query above can use the index, because the query is able to use a, b, c from the leftmost part of the index.

Forget about index conditions for now.

If I understand correctly in my both queries, predicates are date BETWEEN "2016-11-01" AND "2016-12-15" (type="bcn_scan" is too, but it’s covered from Index Condition)

However, this is what EXPLAIN gives me:

cbq> EXPLAIN SELECT did, {date : {
   > 		"foull_entries" : ARRAY_SUM(ARRAY_AGG(TONUMBER(ARRAY_LENGTH(bcn)<>0))),
   > 		"nulld_entries" : ARRAY_SUM(ARRAY_AGG(TONUMBER(ARRAY_LENGTH(bcn)==0))),
   > 		"total_entries" : COUNT(*)
   > 	} 
   > }
   > FROM sync_gateway
   > LET date = SUBSTR(t, 0, 10)
   > WHERE type="bcn_scan" AND date BETWEEN "2016-11-01" AND "2016-12-15"
   > GROUP BY did, date
   > ORDER BY did ASC, date ASC;
{
	"requestID": "859b77cb-c367-4c17-83c2-e5e5b5795cef",
	"signature": "json",
	"results": [
		{
			"plan": {
				"#operator": "Sequence",
				"~children": [
					{
						"#operator": "Sequence",
						"~children": [
							{
								"#operator": "PrimaryScan",
								"index": "#primary",
								"keyspace": "sync_gateway",
								"namespace": "default",
								"using": "gsi"
							},
							[...]

and

cbq> EXPLAIN SELECT date, {did : {
   > 		"foull_entries" : COUNT(NULLIF(ARRAY_LENGTH(bcn), 0)),
   > 		"total_entries" : COUNT(*),
   > 		"nulld_entries" : COUNT(*) - COUNT(NULLIF(ARRAY_LENGTH(bcn), 0))
   > 	}
   > }
   > FROM sync_gateway
   > LET date = SUBSTR(t, 0, 10)
   > WHERE type="bcn_scan" AND date BETWEEN "2016-11-01" AND "2016-12-15"
   > GROUP BY date, did
   > ORDER BY date ASC, did ASC;
{
	"requestID": "4193b014-6dcf-4963-b19a-5d61558452af",
	"signature": "json",
	"results": [
		{
			"plan": {
				"#operator": "Sequence",
				"~children": [
					{
						"#operator": "Sequence",
						"~children": [
							{
								"#operator": "PrimaryScan",
								"index": "#primary",
								"keyspace": "sync_gateway",
								"namespace": "default",
								"using": "gsi"
							},
							[...]

Is there a problem that did is fully scanned instead of indexed?
Isn’t the “big” index enough to catch both (i.e. should I create separate indexes to scan did AND t separately?)

I guess working on the predicates (SUBSTR) and fetching / working non-predicate fields (COUNT and all the ARRAY_* operations), shouldn’t alter the result (as you can first fetch then work on them).

But something obviously is not the case

As I said earlier, I can only process one thought at a time. Tell me one single thing we can look at next.

Let’s try this one now: :slight_smile:

Ok. Can you format the query and put each section on a separate line. And can you add the CREATE INDEX statement. Just those two items are sufficient.

The created indexes (regarding sync_gateway):

CREATE PRIMARY INDEX ON `sync_gateway` USING GSI;
CREATE INDEX `bcn_scan-did-t` ON sync_gateway(did, t) WHERE type='bcn_scan';

The query (as you proposed) is:

SELECT date, {did : {
		"foull_entries" : COUNT(NULLIF(ARRAY_LENGTH(bcn), 0)),
		"total_entries" : COUNT(*),
		"nulld_entries" : COUNT(CASE WHEN ARRAY_LENGTH(bcn)==0 THEN 1 ELSE NULL END)
	}
}
FROM sync_gateway
LET date = SUBSTR(t, 0, 10)
WHERE type="bcn_scan" AND date BETWEEN "2016-11-01" AND "2016-12-15"
GROUP BY date, did
ORDER BY date ASC, did ASC;

I think I explained this in another post. Your query predicate must use your index keys.

Okay - so indeed I need to split up the two indexes:

CREATE INDEX `bcn_scan-did-t` ON sync_gateway(did) WHERE type='bcn_scan';
CREATE INDEX `bcn_scan-did-t` ON sync_gateway(t) WHERE type='bcn_scan';

Will these work for this case:

[...]
WHERE type="bcn_scan"
AND date BETWEEN "2016-11-01"
AND "2016-12-15"
AND did='some_value'
[...]

Or I also need this one:

CREATE INDEX `bcn_scan-did-t` ON sync_gateway(did, t) WHERE type='bcn_scan';

?

I don’t know why you need to split up the indexes, but if that works for you, then great.

If you want to use SUBSTR() in your WHERE clause, then

(1) SUBSTR() must be directly in your WHERE clause, and not aliased via LET

(2) The index key must also contain SUBSTR(), i.e. you cannot index t and then query on SUBSTR(t). You must also index SUBSTR(t).

I don’t know what to do, honestly. That’s why I have been disturbing you for all these days :joy:.
I also want to thank you for putting up with me :smile:.

Okay - Next index attempt:

Index Statement

CREATE INDEX `bcn_scan-did-t-date` ON sync_gateway(did, t, SUBSTR(t, 0, 10)) WHERE type='bcn_scan';

system:indexes

cbq> select * from system:indexes;
{
    "requestID": "1cbf9b6a-e9ae-41f5-be7a-e0d52e747ceb",
    "signature": {
        "*": "*"
    },
    "results": [
        {
            "indexes": {
                "condition": "(`type` = \"bcn_scan\")",
                "datastore_id": "http://127.0.0.1:8091",
                "id": "57b6a0fc70b6e7b0",
                "index_key": [
                    "`did`",
                    "`t`",
                    "substr(`t`, 0, 10)"
                ],
                "keyspace_id": "sync_gateway",
                "name": "bcn_scan-did-t-date",
                "namespace_id": "default",
                "state": "online",
                "using": "gsi"
            }
        },
        {
            "indexes": {
                "datastore_id": "http://127.0.0.1:8091",
                "id": "a8c8c25786449f1",
                "index_key": [],
                "is_primary": true,
                "keyspace_id": "playground",
                "name": "#primary",
                "namespace_id": "default",
                "state": "online",
                "using": "gsi"
            }
        },
        {
            "indexes": {
                "datastore_id": "http://127.0.0.1:8091",
                "id": "582e2bb37cdebf18",
                "index_key": [],
                "is_primary": true,
                "keyspace_id": "sync_gateway",
                "name": "#primary",
                "namespace_id": "default",
                "state": "online",
                "using": "gsi"
            }
        },
        {
            "indexes": {
                "condition": "(`type` = \"bcn_scan\")",
                "datastore_id": "http://127.0.0.1:8091",
                "id": "3847cba3cb75f8a9",
                "index_key": [
                    "`did`",
                    "`t`"
                ],
                "keyspace_id": "sync_gateway",
                "name": "bcn_scan-did-t",
                "namespace_id": "default",
                "state": "online",
                "using": "gsi"
            }
        }
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "152.837705ms",
        "executionTime": "152.776964ms",
        "resultCount": 4,
        "resultSize": 1880
    }
}

cbq EXPLAIN $1

EXPLAIN SELECT SUBSTR(t, 0, 10), {did : {
		"foull_entries" : COUNT(NULLIF(ARRAY_LENGTH(bcn), 0)),
		"total_entries" : COUNT(*),
		"nulld_entries" : COUNT(CASE WHEN ARRAY_LENGTH(bcn)==0 THEN 1 ELSE NULL END)
	}
}
FROM sync_gateway
WHERE type="bcn_scan" AND SUBSTR(t, 0, 10) BETWEEN "2016-11-01" AND "2016-12-15"
GROUP BY SUBSTR(t, 0, 10), did
ORDER BY SUBSTR(t, 0, 10) ASC, did ASC;


cbq> EXPLAIN SELECT SUBSTR(t, 0, 10), {did : {
                "foull_entries" : COUNT(NULLIF(ARRAY_LENGTH(bcn), 0)),
   >    >               "total_entries" : COUNT(*),
   >            "nulld_entries" : COUNT(CASE WHEN ARRAY_LENGTH(bcn)==0 THEN 1 ELSE NULL END)
   >    }
   > }
FROM sync_gateway
WHERE type="bcn_scan" AND SUBSTR(t, 0, 10) BETWEEN "2016-11-01" AND "2016-12-15"
GROUP BY SUBSTR(t, 0, 10), did
ORDER BY SUBSTR(t, 0, 10) ASC, did ASC;
   >    >    >    > {
    "requestID": "867491c0-45b2-4713-a70e-8c558e9126da",
    "signature": "json",
    "results": [
        {
            "plan": {
                "#operator": "Sequence",
                "~children": [
                    {
                        "#operator": "Sequence",
                        "~children": [
                            {
                                "#operator": "PrimaryScan",
                                "index": "#primary",
                                "keyspace": "sync_gateway",
                                "namespace": "default",
                                "using": "gsi"
                            },
                            {
                                "#operator": "Fetch",
                                "keyspace": "sync_gateway",
                                "namespace": "default"
                            },
                            {
                                "#operator": "Parallel",
                                "~child": {
                                    "#operator": "Sequence",
                                    "~children": [
                                        {
                                            "#operator": "Filter",
                                            "condition": "(((`sync_gateway`.`type`) = \"bcn_scan\") and (substr((`sync_gateway`.`t`), 0, 10) between \"2016-11-01\" and \"2016-12-15
\"))"
                                        },
                                        {
                                            "#operator": "InitialGroup",
                                            "aggregates": [
                                                "count(*)",
                                                "count(case when (array_length((`sync_gateway`.`bcn`)) = 0) then 1 else null end)",
                                                "count(nullif(array_length((`sync_gateway`.`bcn`)), 0))"
                                            ],
                                            "group_keys": [
                                                "substr((`sync_gateway`.`t`), 0, 10)",
                                                "(`sync_gateway`.`did`)"
                                            ]
                                        }
                                    ]
                                }
                            },
                            {
                                "#operator": "IntermediateGroup",
                                "aggregates": [
                                    "count(*)",
                                    "count(case when (array_length((`sync_gateway`.`bcn`)) = 0) then 1 else null end)",
                                    "count(nullif(array_length((`sync_gateway`.`bcn`)), 0))"
                                ],
                                "group_keys": [
                                    "substr((`sync_gateway`.`t`), 0, 10)",
                                    "(`sync_gateway`.`did`)"
                                ]
                            },
                            {
                                "#operator": "FinalGroup",
                                "aggregates": [
                                    "count(*)",
                                    "count(case when (array_length((`sync_gateway`.`bcn`)) = 0) then 1 else null end)",
                                    "count(nullif(array_length((`sync_gateway`.`bcn`)), 0))"
                                ],
                                "group_keys": [
                                    "substr((`sync_gateway`.`t`), 0, 10)",
                                    "(`sync_gateway`.`did`)"
                                ]
                            },
                            {
                                "#operator": "Parallel",
                                "~child": {
                                    "#operator": "Sequence",
                                    "~children": [
                                        {
                                            "#operator": "InitialProject",
                                            "result_terms": [
                                                {
                                                    "expr": "substr((`sync_gateway`.`t`), 0, 10)"
                                                },
                                                {
                                                    "expr": "{\"(`sync_gateway`.`did`)\": {\"\\\"foull_entries\\\"\": count(nullif(array_length((`sync_gateway`.`bcn`)), 0)), \"\\\"
nulld_entries\\\"\": count(case when (array_length((`sync_gateway`.`bcn`)) = 0) then 1 else null end), \"\\\"total_entries\\\"\": count(*)}}"
                                                }
                                            ]
                                        }
                                    ]
                                }
                            }
                        ]
                    },
                    {
                        "#operator": "Order",
                        "sort_terms": [
                            {
                                "expr": "substr((`sync_gateway`.`t`), 0, 10)"
                            },
                            {
                                "expr": "(`sync_gateway`.`did`)"
                            }
                        ]
                    },
                    {
                        "#operator": "FinalProject"
                    }
                ]
            },
            "text": "SELECT SUBSTR(t, 0, 10), {did : { \"foull_entries\" : COUNT(NULLIF(ARRAY_LENGTH(bcn), 0)), \"total_entries\" : COUNT(*), \"nulld_entries\" : COUNT(CASE WHEN AR
RAY_LENGTH(bcn)==0 THEN 1 ELSE NULL END) } } FROM sync_gateway WHERE type=\"bcn_scan\" AND SUBSTR(t, 0, 10) BETWEEN \"2016-11-01\" AND \"2016-12-15\" GROUP BY SUBSTR(t, 0, 10), did
 ORDER BY SUBSTR(t, 0, 10) ASC, did ASC"
        }
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "11.821667ms",
        "executionTime": "11.745138ms",
        "resultCount": 1,
        "resultSize": 5897
    }
}

Still wants to do a primary scan.

Is it because I am NOT querying did key?

We will figure it out :smile:

Just post the exact query you want, and we will propose an index.

The query I am ‘explaining’ so far.

SELECT SUBSTR(t, 0, 10), {did : {
		"foull_entries" : COUNT(NULLIF(ARRAY_LENGTH(bcn), 0)),
		"total_entries" : COUNT(*),
		"nulld_entries" : COUNT(CASE WHEN ARRAY_LENGTH(bcn)==0 THEN 1 ELSE NULL END)
	}
}
FROM sync_gateway
WHERE type="bcn_scan" AND SUBSTR(t, 0, 10) BETWEEN "2016-11-01" AND "2016-12-15"
GROUP BY SUBSTR(t, 0, 10), did
ORDER BY SUBSTR(t, 0, 10) ASC, did ASC;

for all SUBSTR(t, 0, 10) values ofc, not only the ones used here.
Unless there is a better way of working with date ranges using the original "t": "2016-11-15T07:27:50.241Z" value

Database is a “production database”, but it’s still my playground (learning N1QL, NoSQL DB amidst other things) - so I don’t have a clear idea of what I’d like those indexes to be.

(Also, another query of the top off my head would be:

SELECT t, bcn
FROM sync_gateway
WHERE type="bcn_scan" AND bcn.id == ["0x5475726b75204f626f416b6164656d69", "0x0110", "0x0032"]
ORDER BY t DESC
LIMIT 1;

I am not sure about the validity of the bcn.id == ["0x5475726b75204f626f416b6164656d69", "0x0110", "0x0032"] expression - maybe it helps better shape the above query)

Query:

SELECT SUBSTR(t, 0, 10), {did : {
		"foull_entries" : COUNT(NULLIF(ARRAY_LENGTH(bcn), 0)),
		"total_entries" : COUNT(*),
		"nulld_entries" : COUNT(CASE WHEN ARRAY_LENGTH(bcn)==0 THEN 1 ELSE NULL END)
	}
}
FROM sync_gateway
WHERE
    type="bcn_scan"
    AND SUBSTR(t, 0, 10) BETWEEN $start AND $end
    AND did IS NOT NULL
GROUP BY SUBSTR(t, 0, 10), did
ORDER BY SUBSTR(t, 0, 10) ASC, did ASC;

Index:

CREATE INDEX idx ON sync_gateway( SUBSTR(t, 0, 10), did ) WHERE type = 'bcn_scan';

So why is this different?

CREATE INDEX `bcn_scan-did-t-date` ON sync_gateway(did, t, SUBSTR(t, 0, 10)) WHERE type='bcn_scan';
  • I have the fields in different order (and they somehow matter)?
  • I have extra fields (and it messes up ‘matching’)?
  • Index is named? :joy:

Can you try it out first and confirm if it works, performs, etc.

Surpisingly enough, your query

EXPLAIN SELECT SUBSTR(t, 0, 10), {did : {
		"foull_entries" : COUNT(NULLIF(ARRAY_LENGTH(bcn), 0)),
		"total_entries" : COUNT(*),
		"nulld_entries" : COUNT(CASE WHEN ARRAY_LENGTH(bcn)==0 THEN 1 ELSE NULL END)
	}
}
FROM sync_gateway
WHERE
    type="bcn_scan"
    AND SUBSTR(t, 0, 10) BETWEEN $start AND $end
    AND did IS NOT NULL
GROUP BY SUBSTR(t, 0, 10), did
ORDER BY SUBSTR(t, 0, 10) ASC, did ASC;

Is:

"#operator": "IndexScan",
"index": "bcn_scan-did-t-date"

Whereas my original one:

EXPLAIN SELECT SUBSTR(t, 0, 10), {did : {
		"foull_entries" : COUNT(NULLIF(ARRAY_LENGTH(bcn), 0)),
		"total_entries" : COUNT(*),
		"nulld_entries" : COUNT(CASE WHEN ARRAY_LENGTH(bcn)==0 THEN 1 ELSE NULL END)
	}
}
FROM sync_gateway
WHERE type="bcn_scan" AND SUBSTR(t, 0, 10) BETWEEN "2016-11-01" AND "2016-12-15"
GROUP BY SUBSTR(t, 0, 10), did
ORDER BY SUBSTR(t, 0, 10) ASC, did ASC;

Is:

"#operator": "IndexScan",
"index": "bcn_scan-date-did"

Ok. My two questions were:

(1) Does it work? (Yes / no)

(2) Does it perform? (Yes / no)

I was waiting for the results to come through. Both queries timeout

cbq> SELECT SUBSTR(t, 0, 10), {did : {
                "foull_entries" : COUNT(NULLIF(ARRAY_LENGTH(bcn), 0)),
   >    >               "total_entries" : COUNT(*),
   >            "nulld_entries" : COUNT(CASE WHEN ARRAY_LENGTH(bcn)==0 THEN 1 ELSE NULL END)
        }
}
FROM sync_gateway
WHERE
    type="bcn_scan"
    AND SUBSTR(t, 0, 10) BETWEEN "2016-11-01" AND "2016-12-15"
    AND did IS NOT NULL
GROUP BY SUBSTR(t, 0, 10), did
ORDER BY S   >    >    >    >    >    >    >    >    > UBSTR(t, 0, 10) ASC, did ASC;
{
    "requestID": "d059f667-4b30-41cc-a2a2-56e89ad4c0dd",
    "signature": {
        "$1": "string",
        "$2": "object"
    },
    "results": [
        {}
    ],
    "errors": [
        {
            "code": 12015,
            "msg": "Index scan timed out - cause: Index scan timed out"
        }
    ],
    "status": "errors",
    "metrics": {
        "elapsedTime": "2m0.103019842s",
        "executionTime": "2m0.102792644s",
        "resultCount": 1,
        "resultSize": 2,
        "sortCount": 1,
        "errorCount": 1
    }
}
cbq> 

SELECT SUBSTR(t, 0, 10), {did : {
        "foull_entries" : COUNT(NULLIF(ARRAY_LENGTH(bcn), 0)),
        "total_entries" : COUNT(*),
        "nulld_entries" : COUNT(CASE WHEN ARRAY_LENGTH(bcn)==0 THEN 1 ELSE NULL END)
    }
}
FROM sync_gateway
WHERE
    type="bcn_scan"
    AND SUBSTR(t, 0, 10) BETWEEN "2016-11-01" AND "2016-12-15"
GROUP BY SUBSTR(t, 0, 10), did
ORDER BY SUBSTR(t, 0, 10) ASC, did ASC;

"errors": [
    {
        "code": 12015,
        "msg": "Index scan timed out - cause: Index scan timed out"
    }
],
"status": "errors",
"metrics": {
    "elapsedTime": "2m3.604967639s",
    "executionTime": "2m3.604919887s",
    "resultCount": 96,
    "resultSize": 26328,
    "sortCount": 96,
    "errorCount": 1

The primary index queries average at about 2m57 to 4m49, so I could increase the time - but I don’t know it it’s worth it after all

Ok. Drop all your other indexes, and keep only the last index I posted. Then rerun the exact query I posted.