Indexing a distinct

whollycow007 · October 23, 2015, 6:31pm

Hi,

Below is a query that is currently taking 9 seconds to run and I was hoping to get some help in reducing the execution time.

query = 'select distinct type from default'

running an explain shows this:

[
{
 "#operator": "Sequence",
"~children": [
  {
    "#operator": "PrimaryScan",
    "index": "#primary",
    "keyspace": "default",
    "namespace": "default",
    "using": "gsi"
  },
  {
    "#operator": "Parallel",
    "~child": {
      "#operator": "Sequence",
      "~children": [
        {
          "#operator": "Fetch",
          "keyspace": "default",
          "namespace": "default"
        },
        {
          "#operator": "InitialProject",
          "distinct": true,
          "result_terms": [
            {
              "expr": "(`default`.`type`)"
            }
          ]
        },
        {
          "#operator": "Distinct"
        },
        {
          "#operator": "FinalProject"
        }
      ]
    }
  },
  {
    "#operator": "Distinct"
  }
]
  }
]

It does not seem to be using the index on ‘type’.

This my index definition:
CREATE INDEX docType ON default(type) USING GSI

Any suggestions on how I can make the query run much faster?

Thanks!

cihangirb · October 23, 2015, 6:54pm

DISTINCT processing is done by the index. However you do need to specify the following in the where clause as GSI does not index MISSING or NULL values;

select distinct type from default where type is not missing and type is not null;

whollycow007 · October 23, 2015, 7:17pm

@cihangirb thanks for the help.

sadly, adding those two where clauses increased the execution time from 9 seconds to 11 seconds . would a view work better than a GSI in this case?

i’m still not 100% sure the difference between a view and a GSI - at a high level i would think a GSI is less overhead on the db - is that true?

geraldss · October 23, 2015, 7:31pm

Hi,

We are adding covering indexes to address this issue. As @cihangirb indicated, here is the correct query, which will automatically benefit when we add covering indexes:

select distinct type from default where type is not null;

A slight tweak to @cihangirb’s note: GSI does not index MISSING, but it does index NULL.

whollycow007 · October 23, 2015, 8:26pm

Thanks @geraldss

Do you have any rough ETA for this? I can’t throw this query into production because of the 9-10 second wait time and the load on the db.

Currently,

select distinct type from default where type is not null;

takes 10.4 seconds to return data and this:

select distinct type from default;

takes 9.1 seconds. @cihangirb query is at 11.4.

Anyway around this issue until covering indexes are live?

geraldss · October 23, 2015, 9:45pm

For production, it will be a couple of months. We will release a DP (developer preview) version before that.

cihangirb · October 24, 2015, 5:28am

If the entire bucket has docs with type and not a lot of documents are missing type, the index adds an additional hop. scanning the bucket seems more efficient.

manik · October 26, 2015, 8:01am

Hi,

What is the total number of documents being returned with the query. If the number of documents is large then the fetch between query engine and the couchbase server could be the bottleneck.

If that is the case then we would try increasing some of the system parameters to increase the fetch throughput

Cheers,
Manik

whollycow007 · October 27, 2015, 6:13pm

@manik,

the result set, currently is only about 20-30 rows. not much at all.

we have renamed nickel to nyquill (nyql) until we get proper indexes from couchbase - joking aside, n1ql is very useful and we would love to see couchbase push to massively improve performance. we’ll keep using it in the meantime, but with the nyquill name since it does put a few devs to sleep.

we did a test where we indexed every single query we use in the mentioned script - used explain to double check if the indexes were being used, and then guess what - script execution time went from 45secs to 66secs . we then deleted all the indexes.

cheers

manik · October 27, 2015, 6:33pm

Hi @whollycow07,

Well that nyquill for you is like red bull for us. What is the total number of documents in the bucket ? Even if the final result-set is small, the query could still be fetching a large number of documents.

You also could try using view indexes and see if that improves things. I doubt that it would but it’d be worth a try. To create an index using view you need to do replace USING GSI with USING VIEW.

Cheers,
Manik