N1QL Join to array within a document


#1

I have (what I thought) would be a relatively simple N1QL query that is turning out to be quite difficult. I have a single document that contains a relatively small list of codes with multiple attributes as a nested array:

{
  "version": 9,
  "docId": "codes-version-9",
  "codes": [
    {
      "code": "001",
      "type": "P",
      "title": "SYSTEM W MCC",
      "weight": 26.2466
    },
    {
      "code": "166",
      "type": "P",
      "title": "SYSTEM W/O MCC",
      "weight": 14.6448
    }
  ]
 }

Then I have a list of “accounts” with associated codes. The idea being that when we update the type, title, weight of the codes, we want these updates to be reflected across all accounts:

{
  "accountNumber": 123456789,
  "docId": "account-customerXYZ-123456789",
  "code": "166"
}

So we want to JOIN the “accounts” document against the “codes” document, such that the result would be:

{
  "accountNumber": 123456789,
  "docId": "account-customerXYZ-123456789",
  "code": "166"
  "type": "P",
  "title": "SYSTEM W/O MCC",
  "weight": 14.6448
}

This would be pretty easy if the codes were stored as keys (e.g. codes-001, codes-166) but unfortunately that is not the way this particular piece of data was created. Is it possible to JOIN across the elements of the array in this fashion? I did a ton of reading on the CB 4.5 documentation and can’t find any examples where they’re joining to individual elements within a nested array.


#2

You can do the following.

SELECT accounts.*, codes.*
FROM accounts USE KEYS "account-customerXYZ-123456789"
JOIN codes_doc ON KEYS "codes-version-9"
UNNEST codes_doc.codes
WHERE accounts.code = codes.code;

#3

Thank you geraldss. However, I should have provided more info on the account side. While there’s just one “codes” document, there are literally millions of account-customer-###### documents (these represent an individual account within each of our tenants). Is there a way to write the join such that it matches on any document with a matching “code”: property?

  • Jeff

----- Reply message -----


#4

You can use any filter on the accounts. Here is an example that filters on accounts.code. If you want to visit all accounts, you can remove the WHERE clause. Of course, the more selective the query, the faster it will run.

CREATE INDEX idx_acct_code ON accounts( code );

SELECT accounts.*, code.type, code.title, code.weight
FROM accounts
JOIN codes_doc ON KEYS "codes-version-9"
LET code = FIRST c FOR c IN codes_doc WHEN c.code = accounts.code END
WHERE accounts.code = "166";

#5

Thanks Geraldss, this is great stuff! I’m working to apply this new syntax to your Production data, I should have some results soon.

One other question for you: I’ve taken the paid-for CD210 SQL for Documents class, and done significant reading of your online documentation, and the details are thin (at best) for more advanced topics like JOIN to data within nested arrays. I also had a difficult time sourcing details on the LET / FIRST / FOR / IN / WHEN structure you used.

Do you have links with more detailed documentation on how to use this syntax?


#6

Thank you @jeffhoward001.

I am adding our product manager @prasad to this thread, to discuss documentation.


#7

I’m seeing some really slow behavior out of my final query. The following query appears to take about 2 mins for every day I include in the date range, which doesn’t make sense considering we have at most 150 account documents per day.

My EXPLAIN plan says it’s a index scan on account.transDate, so we assumed this would be a relatively quick query. All it has to do is find the appropriate dates using the index, then join on the codes to find & calc the average weight.

I included a copy of the EXPLAIN output below the query. Any ideas why this is so slow?

SELECT SUBSTR(account.transDate,0,10) As transDate, AVG(Codes.weight) As averageWeight
FROM acctbucket account
JOIN acctbucket codesDoc ON KEYS "codes-version-9"
LET Codes = FIRST c for c IN codesDoc.codes WHEN c.code = account.code END
WHERE account.code != "" AND meta().id LIKE "account-customer_xyz%"
AND SUBSTR(account.transDate,0,10) >= "2016-07-01" AND SUBSTR(account.transDate,0,10) < "2016-07-16"
GROUP BY SUBSTR(account.transDate,0,10);

Here’s the EXPLAIN output:

#operator Sequence
~children
#operator IndexScan
index idx_account_transDate
index_id 8b6c096ee88f89b3
keyspace acctbucket
namespace default
spans
Range
High
"2016-07-17"
Inclusion 1
Low
"2016-07-01"
using gsi
#operator Parallel
~child
#operator Sequence
~children
#operator Fetch
as account
keyspace acctbucket
namespace default
#operator Join
as codesDoc
keyspace acctbucket
namespace default
on_keys “codes-version-9”
#operator Let
bindings
expr first c for c in (codesDoc.codes) when ((c.code) = (account.code)) end
var codes
#operator Filter
condition ((((not ((account.code) = “”)) and ((meta().id) like “account-customer_xyz%”)) and (“2016-07-01” <= substr((account.transDate), 0, 10))) and (substr((account.transDate), 0, 10) < “2016-07-17”))
#operator InitialGroup
aggregates
count((account.transDate))
group_keys
substr((account.transDate), 0, 10)
#operator IntermediateGroup
aggregates
count((account.transDate))
group_keys
substr((account.transDate), 0, 10)
#operator FinalGroup
aggregates
count((account.transDate))
group_keys
substr((account.transDate), 0, 10)
#operator Parallel
~child
#operator Sequence
~children
#operator InitialProject
result_terms
as transDate
expr substr((account.transDate), 0, 10)
as accountsPerDay
expr count((account.transDate))
#operator FinalProject
text SELECT SUBSTR(account.transDate,0,10) As transDate, COUNT(account.transDate) As accountsPerDay FROM acctbucket account JOIN acctbucket codesDoc ON KEYS “codes-version-9” LET codes = FIRST c for c IN codesDoc.codes WHEN c.code = account.code END WHERE account.code != “” AND meta().id LIKE “account-customer_xyz%” AND SUBSTR(account.transDate,0,10) >= “2016-07-01” AND SUBSTR(account.transDate,0,10) < “2016-07-17” GROUP BY SUBSTR(account.transDate,0,10)


#8

Here is a covering indexing and the corresponding query. Note the argument to META() in the query.

CREATE INDEX idx_covering ON acctbucket( SUBSTR(transDate,0,10), code )
WHERE code != "" AND META().id LIKE "account-customer_xyz%";

SELECT SUBSTR(account.transDate,0,10) As transDate, AVG(Codes.weight) As averageWeight
FROM acctbucket account
JOIN acctbucket codesDoc ON KEYS "codes-version-9"
LET Codes = FIRST c for c IN codesDoc.codes WHEN c.code = account.code END
WHERE account.code != "" AND META(account).id LIKE "account-customer_xyz%"
AND SUBSTR(account.transDate,0,10) >= "2016-07-01" AND SUBSTR(account.transDate,0,10) < "2016-07-16"
GROUP BY SUBSTR(account.transDate,0,10);

#9

Thanks again geraldss, that appears to be helping with the speed, and EXPLAIN is reporting a proper covering index, however somehow that broke the JOIN to the code-version-9 document.

It’s properly reporting transDate, but it appears that no data coming back from the Codes.weight reference. I get NULL for the AVG(Codes.weight) output. This was previously working with the old index that only hit on code and transDate.


#10

Can you post your exact index and query? Could be a typo somewhere.


#11

Ok, here’s the index and query syntax. I did some additional testing, and it’s almost as if the LET statement is being ignored when the covering index is used:

CREATE INDEX idx_account_customer_xyz_transDate
ON acctbucket(SUBSTR(transDate,0,10),code)
WHERE code != "" AND meta().id LIKE "account-customer_xyz%"

SELECT SUBSTR(account.transDate,0,10) As transDate, AVG(codes.weight) As avgWeight
FROM acctbucket account
JOIN acctbucket codesDoc ON KEYS "codes-version-9"
LET codes = FIRST c for c IN codesDoc.codes WHEN c.msDrg = account.code END
WHERE account.code != "" AND meta(account).id LIKE "account-customer_xyz%"
AND SUBSTR(account.transDate,0,10) >= "2016-07-01" AND SUBSTR(account.transDate,0,10) < "2016-07-03"
GROUP BY SUBSTR(account.transDate,0,10)

======= RESULT =======
[
{
“avgWeight”: null,
“transDate”: “2016-07-01”
},
{
“avgWeight”: null,
“transDate”: “2016-07-02”
}
]


#12

Thanks for the pointer. We are checking to see if there is a bug or issue with covering indexes and LET. Here is the query without the LET. Can you try this out.

CREATE INDEX idx_account_customer_xyz_transDate
ON acctbucket(SUBSTR(transDate,0,10),code)
WHERE code != "" AND meta().id LIKE "account-customer_xyz%";

SELECT SUBSTR(account.transDate,0,10) As transDate, AVG(FIRST c.weight for c IN codesDoc.codes WHEN c.msDrg = account.code END) As avgWeight
FROM acctbucket account
JOIN acctbucket codesDoc ON KEYS "codes-version-9"
WHERE account.code != "" AND meta(account).id LIKE "account-customer_xyz%"
AND SUBSTR(account.transDate,0,10) >= "2016-07-01" AND SUBSTR(account.transDate,0,10) < "2016-07-03"
GROUP BY SUBSTR(account.transDate,0,10);

#13

It turns out you found a bug. Not with LET, but with FIRST and similar operators.

JIRA ticket at https://issues.couchbase.com/browse/MB-20492

Thanks,
Gerald


#14

Wow, that makes me feel better. We thought we were losing out marbles over here :slight_smile: Is the new query you wrote with the FOR statement in the SELECT block still susceptible the bug?

Also, the new query is a little confusing in that the FIRST / FOR / IN / WHEN clause doesn’t make reference to the codesDoc.weight property, so how will it know what properties in the result set to average?


#15

Oops, I fixed my post to include weight.

The new query is susceptible to the bug, because the bug is not in LET. It is fixed now, will ship in 4.5.1.


#16

Ok, that change makes sense (and also sheds a little more light on the FIRST / FOR / IN / WHEN syntax). Do you have a rough idea of when 4.5.1 will ship?


#17

4.5.1 should GA around end of September, but there should be a Beta build available before that.


#18

Thanks geraldss, one last question on this. After reviewing our document models, we make heavy use of nested documents in arrays that pointed out the bug you found, example:

Document Key: financial-customer_xyz-19800001
{
  "acctFinancials": [
{
  "acctNumber": "19800001",
  "transactionId": "164126601",
  "transactionType": "Credit Adjustment",
  "financialCode": "3010",
  "extended": {
    "service_date": "09/15/2015 00:00:00"
  }
},
{
  "acctNumber": "19800001",
  "transactionId": "162555860",
  "transactionType": "Credit Adjustment",
  "financialCode": "3010",
  "extended": {
    "service_date": "09/15/2015 00:00:00"
  }
}
  ],
  "tenantName": "customer_xyz",
  "acctNumber": "19800001",
  "schemaVer": "1.1.0.0"
}

As you can see, “acctNumber” is our unique ID, and our “main” documents are of course the “account” document, e.g. keys named account-customer_xyz-########.

If our end goal is to JOIN from the account- documents, to other nested documents (like the example above) what’s the best way to index these operations? Is there a way to join on a partial portion of the key since we’ve embedded the account number in all documents, or will we need to index the actual “acctNumber” attribute?


#19

Hi @jeffhoward001,

If you post an example of the two documents you want to JOIN, along with their respective keys, and an example of the desired result, we can take a look. You can JOIN or NEST using computed expressions, so using a portion of a key or other string should be fine. But an example would be helpful to confirm.


#20

If the account number is unique and other portions of the key is the document type string (“account_customer_xyz” or something else), you can construct that at runtime.

Simple example:

SELECT *
FROM child c  INNER JOIN parent p
             ON KEYS ("parent-doc::" || substr(c.id, 8, 5) || "::subset(c.pid, 10,5));