Slow index join

foobar123 · December 18, 2016, 12:46pm

Hello,

I am trying to speed up an index join, but I’m unfortunately unable to achieve acceptable speed

My situation is as follows:
Every user has two documents that both are in the same bucket (“data”):

a data doc (id=‘ud/THEUSERID/data’) that has a year and a lastLogin property
a user data doc (id=‘ud/THEUSERID/uData’) that has a sex property

Now I want to do a join on those two documents for each user and filter by their properties:

SELECT META(d).id

FROM  `data` d
JOIN `data` ud ON KEY REPLACE(META(ud).id, 'uData', 'data')  FOR d

WHERE d.type = "data"
AND d.year BETWEEN 0 AND 2000

AND ud.sex = 1

ORDER BY d.lastLogin
LIMIT 100

I created two indices for speeding this query up:

CREATE INDEX `type_year_lastLogin` ON `data`(`type`,`year`,`lastLogin`) using GSI
CREATE INDEX `id_and_sex` ON `data`(replace((meta(self).`id`), "uData", "data"),`sex`) using GSI

However, the query is still quite slow (more than a second for a few thousand documents whereas the query without JOIN runs in about 50ms).
This is what EXPLAIN looks like:

 [
  {
    "plan": {
      "#operator": "Sequence",
      "~children": [
        {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "IndexScan",
              "covers": [
                "cover ((`d`.`type`))",
                "cover ((`d`.`year`))",
                "cover ((`d`.`lastLogin`))",
                "cover ((meta(`d`).`id`))"
              ],
              "index": "type_year_lastLogin",
              "index_id": "cfa9e2f340ecde40",
              "keyspace": "data",
              "namespace": "default",
              "spans": [
                {
                  "Range": {
                    "High": [
                      "\"data\"",
                      "successor(2000)"
                    ],
                    "Inclusion": 1,
                    "Low": [
                      "\"data\"",
                      "0"
                    ]
                  }
                }
              ],
              "using": "gsi"
            },
            {
              "#operator": "Parallel",
              "~child": {
                "#operator": "Sequence",
                "~children": [
                  {
                    "#operator": "IndexJoin",
                    "as": "ud",
                    "for": "d",
                    "keyspace": "data",
                    "namespace": "default",
                    "on_key": "replace((meta(`ud`).`id`), \"uData\", \"data\")",
                    "scan": {
                      "covers": [
                        "cover (replace((meta(`ud`).`id`), \"uData\", \"data\"))",
                        "cover ((`ud`.`sex`))"
                      ],
                      "index": "id_and_sex",
                      "index_id": "d8bf9bddb553643c",
                      "using": "gsi"
                    }
                  },
                  {
                    "#operator": "Filter",
                    "condition": "(((cover ((`d`.`type`)) = \"data\") and (cover ((`d`.`year`)) between 0 and 2000)) and ((`ud`.`sex`) = 1))"
                  },
                  {
                    "#operator": "InitialProject",
                    "result_terms": [
                      {
                        "expr": "cover ((meta(`d`).`id`))"
                      }
                    ]
                  }
                ]
              }
            }
          ]
        },
        {
          "#operator": "Order",
          "limit": "100",
          "sort_terms": [
            {
              "expr": "cover ((`d`.`lastLogin`))"
            }
          ]
        },
        {
          "#operator": "Limit",
          "expr": "100"
        },
        {
          "#operator": "FinalProject"
        }
      ]
    },
    "text": "\nSELECT META(d).id\n\nFROM  `data` d\nJOIN `data` ud ON KEY REPLACE(META(ud).id, 'uData', 'data')  FOR d\n\nWHERE d.type = \"data\"\nAND d.year BETWEEN 0 AND 2000\n\nAND ud.sex = 1\n\nORDER BY d.lastLogin\nLIMIT 100"
  }
]

Apparently, both indices are used, but the query is slow nonetheless. Apparently I’m doing something wrong, but I can’t find a way to make it fast. Do you have an idea how I could achieve acceptable speed for this query?

Thank you in advance!

atom_yang · December 18, 2016, 3:30pm

try this:
change this index

to

CREATE INDEX `id_and_sex` ON `data`(replace((meta(self).`id`), "uData", "data"),`sex`) WHERE sex IS NOT MISSING using GSI

foobar123 · December 18, 2016, 5:44pm

thank you for your answer! I tried it out and it is a little bit faster now (900ms), but still slower by a factor of about 20 compared to the query without JOIN, which seems unusual to me for a few thousand docs. Is there another way of improving this query?

Explain looks like this:

[
  {
    "plan": {
      "#operator": "Sequence",
      "~children": [
        {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "IndexScan",
              "covers": [
                "cover ((`d`.`year`))",
                "cover ((`d`.`lastLogin`))",
                "cover ((meta(`d`).`id`))"
              ],
              "filter_covers": {
                "cover ((`d`.`type`))": "data"
              },
              "index": "dataSearch",
              "index_id": "909b8d17e5123752",
              "keyspace": "data",
              "namespace": "default",
              "spans": [
                {
                  "Range": {
                    "High": [
                      "successor(2000)"
                    ],
                    "Inclusion": 1,
                    "Low": [
                      "0"
                    ]
                  }
                }
              ],
              "using": "gsi"
            },
            {
              "#operator": "Parallel",
              "~child": {
                "#operator": "Sequence",
                "~children": [
                  {
                    "#operator": "IndexJoin",
                    "as": "ud",
                    "for": "d",
                    "keyspace": "data",
                    "namespace": "default",
                    "on_key": "replace((meta(`ud`).`id`), \"uData\", \"data\")",
                    "scan": {
                      "covers": [
                        "cover (replace((meta(`ud`).`id`), \"uData\", \"data\"))",
                        "cover ((`ud`.`sex`))"
                      ],
                      "index": "id_and_sex",
                      "index_id": "f74e90c241533dd5",
                      "using": "gsi"
                    }
                  },
                  {
                    "#operator": "Filter",
                    "condition": "(((cover ((`d`.`type`)) = \"data\") and (cover ((`d`.`year`)) between 0 and 2000)) and ((`ud`.`sex`) = 1))"
                  },
                  {
                    "#operator": "InitialProject",
                    "result_terms": [
                      {
                        "expr": "cover ((meta(`d`).`id`))"
                      }
                    ]
                  }
                ]
              }
            }
          ]
        },
        {
          "#operator": "Order",
          "limit": "100",
          "sort_terms": [
            {
              "expr": "cover ((`d`.`lastLogin`))"
            }
          ]
        },
        {
          "#operator": "Limit",
          "expr": "100"
        },
        {
          "#operator": "FinalProject"
        }
      ]
    },
    "text": "\nSELECT META(d).id\n\nFROM  `data` d\nJOIN `data` ud ON KEY REPLACE(META(ud).id, 'uData', 'data')  FOR d\n\nWHERE d.type = \"data\"\nAND d.year BETWEEN 0 AND 2000\n\nAND ud.sex = 1\n\nORDER BY d.lastLogin\nLIMIT 100"
  }
]

atom_yang · December 19, 2016, 3:13am

how about this one:

CREATE INDEX `year_lastLogin` ON `data`(`year`,`lastLogin`) WHERE type ="data" using GSI
CREATE INDEX `id_and_sex` ON `data`(replace((meta(self).`id`), "uData", "data"),`sex`) WHERE sex IS NOT MISSING using GSI

foobar123 · December 19, 2016, 9:06pm

thanks again, but still, the query time is not improved, close to 1 second I noticed that even doing LIMIT 1 does not improve the query speed. Is there some way to improve it further, or am I naive expecting much better performance? Basically, the join should be the same performing two single queries and finding the intersection afterwards, which should be quite easy with a “LIMIT 1”, or am I mistaken?

geraldss · December 19, 2016, 11:04pm

Hi @vsr1, any suggestions?

vsr1 · December 19, 2016, 11:49pm

Due to ORDER BY clause query needs to sort all qualified items irrespective of LIMIT value.

It looks like you have similar document keys. You can try regular JOIN instead of INDEX JOIN and see if that helps.

CREATE INDEX `year_lastLogin` ON `data`(`year`,`lastLogin`) WHERE type ="data" using GSI;
SELECT META(d).id FROM  `data` d JOIN `data` ud ON KEYS REPLACE(META(d).id, 'data', 'uDdata')  WHERE d.type = "data" AND d.year BETWEEN 0 AND 2000 AND ud.sex = 1 ORDER BY d.lastLogin LIMIT 100;