N1QL query is slow

abxu · December 12, 2016, 7:42am

Hi, I have a bucket with 2M documents (each document is less than 2KB). When I try range scan query with N1QL, it takes more than 1 minute to finish the query (my query should scan 1.2M documents). This is slower than my expectation. I try similar query on a MongoDB cluster, it takes 4s to scan 1M documents over 3M documents.

Any tips to make the query faster? I doubt whether I’m using Couchbase in a wrong way:(

More contexts here:

My query is “SELECT COUNT(*), AVG(edgeLatency) FROM test WHERE timestampUtc BETWEEN ‘2016-11-22T01:00:00’ AND ‘2016-11-23T06:00:00’”. Here is one response.

{
  "requestID": "d37772c4-10cf-4095-a156-8c739ac81c2d",
  "clientContextID": "59::60",
  "signature": {
    "$1": "number",
    "$2": "number"
  },
  "results": [
    {
      "$1": 1266665,
      "$2": 0.49988867718860125
    }
  ],
  "status": 0,
  "errors": [],
  "warnings": [],
  "metrics": {
    "elapsedTime": "1m22.4896596s",
    "executionTime": "1m22.48869s",
    "resultCount": 1,
    "resultSize": 76,
    "mutationCount": 0,
    "errorCount": 0,
    "warningCount": 0,
    "sortCount": 0
  }
}

Field “timestampUtc” is a datetime and I have created GSI INDEX on it.

Definition: CREATE INDEX timestampUtc_index ON test(`timestampUtc`) USING GSI

This is explanation of my query.

{
  "requestID": "a9628394-8d00-42b6-9084-248b4ddc1f75",
  "clientContextID": "61::62",
  "signature": "json",
  "results": [
    {
      "plan": {
        "#operator": "Sequence",
        "~children": [
          {
            "#operator": "IndexScan",
            "index": "timestamputc_index",
            "index_id": "93504482a482e79c",
            "keyspace": "test",
            "namespace": "default",
            "spans": [
              {
                "Range": {
                  "High": [
                    "\"2016-11-23T06:00:00\""
                  ],
                  "Inclusion": 3,
                  "Low": [
                    "\"2016-11-22T01:00:00\""
                  ]
                }
              }
            ],
            "using": "gsi"
          },
          {
            "#operator": "Fetch",
            "keyspace": "test",
            "namespace": "default"
          },
          {
            "#operator": "Parallel",
            "~child": {
              "#operator": "Sequence",
              "~children": [
                {
                  "#operator": "Filter",
                  "condition": "((`test`.`timestampUtc`) between \"2016-11-22T01:00:00\" and \"2016-11-23T06:00:00\")"
                },
                {
                  "#operator": "InitialGroup",
                  "aggregates": [
                    "avg((`test`.`edgeLatency`))",
                    "count(*)"
                  ],
                  "group_keys": []
                }
              ]
            }
          },
          {
            "#operator": "IntermediateGroup",
            "aggregates": [
              "avg((`test`.`edgeLatency`))",
              "count(*)"
            ],
            "group_keys": []
          },
          {
            "#operator": "FinalGroup",
            "aggregates": [
              "avg((`test`.`edgeLatency`))",
              "count(*)"
            ],
            "group_keys": []
          },
          {
            "#operator": "Parallel",
            "~child": {
              "#operator": "Sequence",
              "~children": [
                {
                  "#operator": "InitialProject",
                  "result_terms": [
                    {
                      "expr": "count(*)"
                    },
                    {
                      "expr": "avg((`test`.`edgeLatency`))"
                    }
                  ]
                },
                {
                  "#operator": "FinalProject"
                }
              ]
            }
          }
        ]
      },
      "text": "SELECT COUNT(*), AVG(edgeLatency) FROM test WHERE timestampUtc BETWEEN '2016-11-22T01:00:00' AND '2016-11-23T06:00:00'"
    }
  ],
  "status": 0,
  "errors": [],
  "warnings": [],
  "metrics": {
    "elapsedTime": "5.9998ms",
    "executionTime": "4.9983ms",
    "resultCount": 1,
    "resultSize": 3861,
    "mutationCount": 0,
    "errorCount": 0,
    "warningCount": 0,
    "sortCount": 0
  }
}

I have set max_parallelism=8 in my query. This makes my query faster but I cannot get extra improvement if max_parallelism is greater than 8.
My Couchbase cluster has 12 nodes, each node is a DS13_v2 Azure VM. My bucket is the only bucket running on cluter except for a empty default bucket. My bucket has 60GB RAM quota.
|Size|CPU cores|Memory: GiB|Local SSD: GiB|Max data disks|Max cached disk throughput: IOPS / MBps (cache size in GiB)|Max uncached disk throughput: IOPS / MBps|Max NICs / Network bandwidth|
| ------| ------ | ------ |
|Standard_DS13_v2|8|56|112|16|32,000 / 256 (288)|25,600 / 384|8 high|
My cluster is using Couchbase Server 4.0.0-4051 Community Edition (build-4051). I have tried 4.1 CE and 4.5.1 EE on single machine and it has same performance. With MOI in 4.5.1 EE, the query is faster but it still takes more than 1 minute.
Covering index doesn’t fit my scenarios.

atom_yang · December 12, 2016, 8:13am

Why?

I think data fetch will take a lot of time, you can improve query by CoverIndex.

abxu · December 12, 2016, 10:17am

Thanks for looking at this topic.

I need to do aggregation with many fields to my documents, it is strange if I put most fields into index.
Actually I have tried covering index by modifying my query, it is still slower than MongoDB.

SELECT MIN(timestampUtc) is 13ms
SELECT MAX(timestampUtc) is 45s
SELECT COUNT(timestampUtc) is 1.4s

May I know the query performance when you do N1QL query?

atom_yang · December 12, 2016, 10:23am

remove your index,and
Definition:

CREATE INDEX timestampUtc_edgeLatency_index ON test(`timestampUtc`,`edgeLatency`) USING GSI

and try

SELECT COUNT(1), AVG(edgeLatency) FROM test WHERE timestampUtc BETWEEN '2016-11-22T01:00:00' AND '2016-11-23T06:00:00`

abxu · December 13, 2016, 2:41am

Much appreciate for the advice! Besides CoverIndex, any other way to improve the performance?

atom_yang · December 13, 2016, 2:52am

You should analyze query log to find which part of query take more time,and improve that part ,such as using latest CB Server/add more query service/add more index service/redesign the document structure etc.

yev · December 18, 2016, 6:36pm

Have you thought about splitting the index between nodes using whatever field makes sense, say a-k on node a, l-z on node b.

madhukant1234 · July 27, 2018, 2:27am

How we can achive this? Could you please let me know? As I am facing the same issue. Thanks.

vsr1 · July 27, 2018, 3:01am

What is issue. Post the query and index as separate post