Optimization of N1QL query with LEFT JOIN

I have a data structure where I have “split” a data entity in two and moved the private information to a sub document. This is primarily to control access to the private information in an easier fashion. I have e.g. this “Catch” document

Key: Catch:05A0DC4A6D807B74C12583B600509432

{
  "assockey": "2",
  "baitkey": "10",
  "count": 1,
  "date": "2019-03-07T12:00:00+0100",
  "fishingtripkey": "594BFAE2E65CAB72C12583B600509433",
  "injury": "0",
  "ispublic": false,
  "key": "05A0DC4A6D807B74C12583B600509432",
  "length": 45,
  "markednow": false,
  "maturity": "1",
  "revisioninfo": {
    "created": "2019-03-07T15:40:08+0100",
    "createdby": "Lars Larsen/9FC75FEBD076ACD1C1257FFD0065D338/Fangst",
    "modifiedcount": 0
  },
  "sex": "1",
  "specieskey": "13",
  "takenhome": 0,
  "time": "14:00",
  "type": "Catch",
  "undersized": 0,
  "userkey": "9FC75FEBD076ACD1C1257FFD0065D338"
}

And this private document
Key: Catch:Private:05A0DC4A6D807B74C12583B600509432

{
  "key": "05A0DC4A6D807B74C12583B600509432",
  "parenttype": "Catch",
  "personemail": "lars.larsen@xyz.com",
  "personname": "Lars Larsen ",
  "type": "Private",
  "userkey": "9FC75FEBD076ACD1C1257FFD0065D338"
}

I have this N1QL query to return a combined result where all of the catch info is in the same result set:

SELECT t2.*, t1.* FROM data AS t1 
LEFT JOIN data AS t2 ON 'Catch:Private:' || t1.`key` = META(t2).id AND t2.type='Private' 
WHERE t1.type='Catch' AND (t1.assockey='2')

This query finds 12 documents (from a total of 33-35,000 docs of that type) with the specified assockey. The challenge is that this takes 5-6 seconds - which is quite a lot more than I would expect.
I have this output from “Explain” - but I’m not really good at understanding how to transform that into specific optimizations…

{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "IndexScan3",
        "as": "t1",
        "index": "def_type_2",
        "index_id": "a95f91171288dce7",
        "index_projection": {
          "primary_key": true
        },
        "keyspace": "data",
        "namespace": "default",
        "spans": [
          {
            "exact": true,
            "range": [
              {
                "high": "\"Catch\"",
                "inclusion": 3,
                "low": "\"Catch\""
              }
            ]
          }
        ],
        "using": "gsi"
      },
      {
        "#operator": "Fetch",
        "as": "t1",
        "keyspace": "data",
        "namespace": "default"
      },
      {
        "#operator": "Join",
        "as": "t2",
        "keyspace": "data",
        "namespace": "default",
        "on_keys": "(\"Catch:Private:\" || (`t1`.`key`))",
        "outer": true
      },
      {
        "#operator": "Parallel",
        "~child": {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "Filter",
              "condition": "(((`t1`.`type`) = \"Catch\") and ((`t1`.`assockey`) = \"2\"))"
            },
            {
              "#operator": "InitialProject",
              "result_terms": [
                {
                  "expr": "`t2`",
                  "star": true
                },
                {
                  "expr": "`t1`",
                  "star": true
                }
              ]
            },
            {
              "#operator": "FinalProject"
            }
          ]
        }
      }
    ]
  },
  "text": "SELECT t2.*, t1.* FROM data AS t1 \nLEFT JOIN data AS t2 ON 'Catch:Private:' || t1.`key` = META(t2).id AND t2.type='Private' \nWHERE t1.type='Catch' AND (t1.assockey='2')"
}

Any ideas and suggestions are highly appriciated!

Ok, this is what I call the “radiator model” - try to explain your problem (to a radiator) then you’ll most likely find the solution…

So for completeness let me explain what I did:

First I tried to see what the response time was for a simple query without the join:

SELECT t1.* FROM data AS t1 
WHERE t1.type='Catch' AND (t1.assockey='2')

And surprise… that took around 3 seconds (~half of the time of the query I want to optimize). But this is a simple query…

So that got me started on thinking about the index. And of course I miss the assockey in an appropriate index. So I tried to create this index:

CREATE INDEX `def_type_assoc_1` ON `data`(`type`,`assockey`) WHERE (`assockey` is valued)

And voila! Now the simple query takes 5-7ms and the combined query 7-8ms!

Thank you for being my radiator! :smile: :blush:

FYI: You no need WHERE (assockey is valued) unless you have lot of values that are not valued.
Best way to do this is

CREATE INDEX `ix_assoc_1` ON `data`(`assockey`) WHERE  type = "Catch";
SELECT t1.* FROM data AS t1 
WHERE t1.type='Catch' AND (t1.assockey='2');

SELECT t1.* FROM data AS t1 
WHERE t1.type='Catch' AND t1.assockey IS VALUED;

Check out Designing Index For Query In Couchbase N1QL for details

Hmmm… I’m a little unsure about your recommendation @vsr1

Should I use

CREATE INDEX `ix_assoc_1` ON `data`(`assockey`) WHERE  type = "Catch";

instead of:

CREATE INDEX `def_type_assoc_1` ON `data`(`type`,`assockey`) WHERE (`assockey` is valued)

In my query I need to ask for the type as well as the assockey - as I have more types with the assockey field.

I would therefore have thought that creating the index on (type,assockey) and only include those docs. where assockey was valued would be optimal?

So when you suggest using assockey is valued in the select I guess it’s only meant for situations where I would want to find all docs. with an assockey set? Otherwise, I’m not sure I understand…

In that case leave the index that you have, that will be optimal

Ok, thanks!

I’m just trying to learn as much as I can on this… So I was kind of proud of my own achievement :smile: - but whatever I can learn in addition is even better :+1:

1 Like