Count(*) with simple where clause is very very slow

select count(*) as count from mybucket where t = 'post' and topic = 'test' 

Everything is indexed.

I my case this takes .5 seconds over 1M documents. It grows steadily with more documents with about 150ms per 300k documents.

Good news is - the performance is isolated to the unique where clause. Meaning: topic = “test2” with 300k documents returns in 150ms.

My question is - what are the suggested approaches to make this query run sub 50ms?

Right now I am thinking of writing a service that repeatedly queries the bucket and keeps updated counts in separate documents per ‘topic’. Maybe triggered by eventing curl call on new posts being made.

Is there a better way?
I gave up trying to solve this with indexes.

Current:

CREATE INDEX idx_count_posts ON `mybucket`(`t`,`topic`) WHERE ((`t` = 'post') and (`topic` is not missing)) 

Any suggestions welcome.

In the Index definition, (topic is not missing) is unnecessary… This could be causing the optimizer not to do the push down… Also, see the EXPLAIN for the query and share it here if it’s slow even after tweaking the index.

In your case, you can simply create the index on topic since t is always = ‘post’.

CREATE INDEX idx_count_posts ON mybucket(topic) WHERE t = 'post';

1 Like

@keshav_m thank you that shaved off 100ms. Down to ~400ms now on 1M docs.

Definitely helps.

Here is explain:

"#operator": "Sequence",
    "~children": [
        {
            "#operator": "IndexScan3",
            "covers": [
                "cover ((`mybucket`.`topic`))",
                "cover ((meta(`mybucket`).`id`))",
                "cover (count(*))"
            ],
            "filter_covers": {
                "cover ((`mybucket`.`t`))": "post"
            },
            "index": "idx_count_posts",
            "index_group_aggs": {
                "aggregates": [
                    {
                        "aggregate": "COUNT",
                        "expr": "1",
                        "id": 2,
                        "keypos": -1
                    }
                ]
            },
            "index_id": "5f4a498090958cbb",
            "index_projection": {
                "entry_keys": [
                    2
                ]
            },
            "keyspace": "mybucket",
            "namespace": "default",
            "spans": [
                {
                    "exact": true,
                    "range": [
                        {
                            "high": "\"test\"",
                            "inclusion": 3,
                            "low": "\"test\""
                        }
                    ]
                }
            ],
            "using": "gsi"
        },
        {
            "#operator": "Parallel",
            "~child": {
                "#operator": "Sequence",
                "~children": [
                    {
                        "#operator": "InitialProject",
                        "result_terms": [
                            {
                                "as": "count",
                                "expr": "cover (count(*))"
                            }
                        ]
                    }
                ]
            }
        }
    ]
}

The plan seems to right. Also using Index aggregation. May be check index residence ratio. cc @deepkaran.salooja

You can try creating a partitioned index. This can reduce the latency by performing the aggregation in parallel across partitions. Also, check the resident ratio of the index as @vsr1 suggested.

As you have equality predicates use partition on document key

CREATE INDEX idx_count_posts ON mybucket(topic) PARTITION BY HASH(META().id) 
WHERE t = 'post';

Down to 160ms, amazing!
Thank you @vsr1 , @deepkaran.salooja

select count(*) as count 
from mybucket
use index (idx_count_posts)
where t = 'post' and topic = 'test'

====

Follow up on this question, to share my eventual resolution if someone has similar issue.

As fast and as awesome as Couchbase is - 160ms query absolutely kills it if this request (via REST) is spammed by holding F5 in a browser. 100% cpu usage and eventual time outs as requests exceed 30 seconds. An example average query against Couchbase on my janky dev box is 4-18ms.

So in combination with the above optimization I’ve also used Couchbase Expiration to temporarily cache the query result in a document.

Using Couchbase doc with expiration is handy because it centralizes this operation to Couchbase when used by multiple REST clients.

The client app simply queries to see if there is a cache present, if not it issues a temporary CountCache doc while running full count(*) query.

{
  "count": 0,
  "counting": true,
  "meta": "",
  "t": "countcache",
  "topic": "test"
}

with some Expiry which is less than a second.

Since many more requests may come before count(*) … where… is done - other requests check if CountCache is present and if counting == true and sleep themselves (non-blocking) for a few ms.

When the query finishes, the request that initiated the count(*) … where… upserts the same document but with correct count and counting = false.

{
  "count": 1000000,
  "counting": false,
  "meta": "",
  "t": "countcache",
  "topic": "test"
}

Eventually all requests get the cached count.

In a few seconds the CacheCount doc expires prompting new query on next request.

So far, testing this - I have not had double query run on a single couchbase node (double queries in a cluster are ok). Performance is the same as with stress testing on any other query.

Its one solution, hope it helps someone. If you have better idea please let me know!

2 Likes