Query Optimization

java
n1ql
query

#1

Hi guys,

I am new to the Couchbase and (N1ql) world, and I have a query which I don’t really know how to optimize:

SELECT DISTINCT `subject` AS distinct_subject, MIN(`lastModification`) AS modified, `subject` AS subject FROM block WHERE 1 = 1
-- and subject IN [...]
-- and type IN [...]
-- and lastModification >= <A-DATE>
GROUP BY `subject` ORDER BY `modified` ASC LIMIT 10 OFFSET 0

This query is generated dynamically based on a set of filters the user provides. These parameters are “subject”, “type” and “lastModification”.

I have some indexes for subject, type and lastModification which of course improves the query response when these parameters are present. The problem is when there are no filters present.

An explain for this query (without filters) returns the following response:

[
  {
    "plan": {
      "#operator": "Sequence",
      "~children": [
        {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "PrimaryScan",
              "index": "block_idx",
              "keyspace": "block",
              "namespace": "default",
              "using": "gsi"
            },
            {
              "#operator": "Parallel",
              "~child": {
                "#operator": "Sequence",
                "~children": [
                  {
                    "#operator": "Fetch",
                    "keyspace": "block",
                    "namespace": "default"
                  },
                  {
                    "#operator": "Filter",
                    "condition": "(1 = 1)"
                  },
                  {
                    "#operator": "InitialGroup",
                    "aggregates": [
                      "min((`block`.`lastModification`))"
                    ],
                    "group_keys": [
                      "(`block`.`subject`)"
                    ]
                  }
                ]
              }
            },
            {
              "#operator": "IntermediateGroup",
              "aggregates": [
                "min((`block`.`lastModification`))"
              ],
              "group_keys": [
                "(`block`.`subject`)"
              ]
            },
            {
              "#operator": "FinalGroup",
              "aggregates": [
                "min((`block`.`lastModification`))"
              ],
              "group_keys": [
                "(`block`.`subject`)"
              ]
            },
            {
              "#operator": "Parallel",
              "~child": {
                "#operator": "Sequence",
                "~children": [
                  {
                    "#operator": "InitialProject",
                    "distinct": true,
                    "result_terms": [
                      {
                        "as": "distinct_subject",
                        "expr": "(`block`.`subject`)"
                      },
                      {
                        "as": "modified",
                        "expr": "min((`block`.`lastModification`))"
                      },
                      {
                        "as": "subject",
                        "expr": "(`block`.`subject`)"
                      }
                    ]
                  },
                  {
                    "#operator": "Distinct"
                  }
                ]
              }
            },
            {
              "#operator": "Distinct"
            }
          ]
        },
        {
          "#operator": "Order",
          "limit": "10",
          "offset": "0",
          "sort_terms": [
            {
              "expr": "`modified`"
            }
          ]
        },
        {
          "#operator": "Offset",
          "expr": "0"
        },
        {
          "#operator": "Limit",
          "expr": "10"
        },
        {
          "#operator": "FinalProject"
        }
      ]
    },
    "text": "SELECT DISTINCT `subject` AS distinct_subject, MIN(`lastModification`) AS modified, `subject` AS subject FROM block WHERE 1 = 1\nGROUP BY `subject` ORDER BY `modified` ASC LIMIT 10 OFFSET 0"
  }
]

My question is, is there any way to re-write the query in order to avoid a PrimaryScan? Any hint that I can use?

Thanks!.
M.-


#2

Hi!

Can you clarify what you mean by “…when there are no filters present”? Do you mean that you don’t include “and subject in […]” or other clauses in the query at all? So that the query just looks like:

SELECT DISTINCT subject AS distinct_subject, MIN(lastModification) AS modified, subject AS subject FROM block WHERE 1 = 1 GROUP BY subject ORDER BY modified ASC LIMIT 10 OFFSET 0

A related question: describe the results that you want in this case. If there is no filter, then are you asking for the entire data set, grouped by ‘subject’? If so, then perhaps an index on ‘subject’ would help.


#3

Which index would you like to use on this query?

Assuming you have an index on lastModification and that field exists in all of the documents you want to process for the query, you can do this:

SELECT DISTINCT subject AS distinct_subject, MIN(lastModification) AS modified, subject AS subject FROM block WHERE lastModification IS NOT MISSING
– and subject IN […]
– and type IN […]
– and lastModification >=
GROUP BY subject ORDER BY modified ASC LIMIT 10 OFFSET 0;

To force a particular index, add USE INDEX clause.


#4

Hi @eben/@keshav_m thanks for your replies.

@eben Yes! that is exactly what I meant. The query seems to be slow cause I don’t have filters and I tried the index by subject but doesn’t seems to change the response time:

SELECT DISTINCT subject AS distinct_subject, MIN(lastModification) AS modified, subject AS subject FROM block WHERE 1 = 1
GROUP BY subject ORDER BY modified ASC LIMIT 10 OFFSET 0

That query takes around ~21sec and we have 16K documents in the bucket. I am expecting to migrate around 350k documents to couchbase.

The result is something like:

[
  {
    "distinct_subject": "http://some-uri",
    "modified": 1371361298031,
    "subject": "http://some-uri"
  },
  ...
]

The structure of the document is:

{
  "creator": "system",
  "subject": "http://some-uri",
  "lastModification": 1373287406103,
  "_class": "com.mmonti.Block",
  "type": "http://some-uri",
  "body": {
     ...
  },
  "creationDate": 1471645336960
}

where “body” field can hold any JSON Structure.

I have created an index by subject, other for type, and other for lastModification.

  • CREATE INDEX i_type ON block(type)
  • CREATE INDEX i_subject ON block(subject)
  • CREATE INDEX i_lastModification ON block(lastModification)

@keshav_m

I tried what you recommended and even though the explain for that query changed, I am not seeing any improvement in the results. It still takes ~21sec to resolve that query.

EXPLAIN SELECT DISTINCT subject AS distinct_subject, MIN(lastModification) AS modified, subject AS subject FROM block USE INDEX (i_lastModification USING GSI)
WHERE 1 = 1
and lastModification IS NOT MISSING
GROUP BY subject ORDER BY modified ASC LIMIT 10 OFFSET 10

The explain shows:

[
  {
"plan": {
  "#operator": "Sequence",
  "~children": [
    {
      "#operator": "Sequence",
      "~children": [
        {
          "#operator": "IndexScan",
          "index": "i_lastModification",
          "index_id": "26b3afd85df36450",
          "keyspace": "block",
          "namespace": "default",
          "spans": [
            {
              "Range": {
                "Inclusion": 1,
                "Low": [
                  "null"
                ]
              }
            }
          ],
          "using": "gsi"
        },
        {
          "#operator": "Parallel",
          "~child": {
            "#operator": "Sequence",
            "~children": [
              {
                "#operator": "Fetch",
                "keyspace": "block",
                "namespace": "default"
              },
              {
                "#operator": "Filter",
                "condition": "((1 = 1) and ((`block`.`lastModification`) is not missing))"
              },
              {
                "#operator": "InitialGroup",
                "aggregates": [
                  "min((`block`.`lastModification`))"
                ],
                "group_keys": [
                  "(`block`.`subject`)"
                ]
              }
            ]
          }
        },
        {
          "#operator": "IntermediateGroup",
          "aggregates": [
            "min((`block`.`lastModification`))"
          ],
          "group_keys": [
            "(`block`.`subject`)"
          ]
        },
        {
          "#operator": "FinalGroup",
          "aggregates": [
            "min((`block`.`lastModification`))"
          ],
          "group_keys": [
            "(`block`.`subject`)"
          ]
        },
        {
          "#operator": "Parallel",
          "~child": {
            "#operator": "Sequence",
            "~children": [
              {
                "#operator": "InitialProject",
                "distinct": true,
                "result_terms": [
                  {
                    "as": "distinct_subject",
                    "expr": "(`block`.`subject`)"
                  },
                  {
                    "as": "modified",
                    "expr": "min((`block`.`lastModification`))"
                  },
                  {
                    "as": "subject",
                    "expr": "(`block`.`subject`)"
                  }
                ]
              },
              {
                "#operator": "Distinct"
              }
            ]
          }
        },
        {
          "#operator": "Distinct"
        }
      ]
    },
    {
      "#operator": "Order",
      "limit": "10",
      "offset": "10",
      "sort_terms": [
        {
          "expr": "`modified`"
        }
      ]
    },
    {
      "#operator": "Offset",
      "expr": "10"
    },
    {
      "#operator": "Limit",
      "expr": "10"
    },
    {
      "#operator": "FinalProject"
    }
  ]
},
"text": "SELECT DISTINCT subject AS distinct_subject, MIN(lastModification) AS modified, subject AS subject FROM block USE INDEX (i_lastModification USING GSI)\nWHERE 1 = 1\nand lastModification IS NOT MISSING\nGROUP BY subject ORDER BY modified ASC LIMIT 10 OFFSET 10"
  }
]

Any other tips that I can try?
Thanks guys!
M.-


#5
  1. CREATE INDEX i1 on block(subject, lastModification);
  2. Query: (remove distinct. You’re already doing a GROUP BY subject. You’ll get distinct rows anyway.
    Remove the second subject as subject. It’s the something as distinct_subject.
    SELECT subject AS distinct_subject, MIN(lastModification) AS modified
    FROM block USE INDEX (i1 USING GSI)
    WHERE subject IS NOT MISSING
    GROUP BY subject ORDER BY modified ASC LIMIT 10 OFFSET 10
  3. This should give your covering index plan.
  4. You’re essentially doing the complete document scan.
    What’s the total number of documents in block (your bucket)?

#6

Thanks @keshav_m! That improved the response time a lot!. From 21sec to 2.5sec
I am testing with 16K docs in the bucket. Our plan is to migrate first 300K docs and finally test with around 3M docs
M.-