Join on keys and select by key

chotenvan · July 26, 2018, 4:56pm

Hi!

I’m trying to switch to N1QL for joins and wonder if it possible to join by key and at the same time return join result by condition on key.

I have 2 kind of documents in the same bucket.

user, with keys like “user:123” (key includes userId)
{
“userId”: 123,
“name”: “name1”,
“docType”: “user”
}
device, with keys like “device:123:8888” (key includes userId and deviceId)
{
“userId”: 123,
“deviceId”: “8888”,
“model”: “M1”,
“manufacturer”: “MNF1”,
“docType”: “device”
}

Total bucket size: 100 mln users, 200 mln devices

I want to:

get user and list of user devices filtered by userId (and keys can be built from that inside N1QL for joins)
get user and device filtered by userId+deviceId (user/device keys again built inside N1QL for joins)

Is it possible via N1QL at all? Could you please suggest some query?
Is it possible to do without any indices, using only keys? If its possible with indices only, I’d be grateful if you suggest some query too however indices is something I’d like to avoid, at the very least, index on device.userId can be added

Thanks in advance.
Cho

vsr1 · July 26, 2018, 5:22pm

For first use case you need index.
Second use case if i understood correct you can avoid N1QL and directly get from KV using your SDK also.
Replaces $userid, $deviceid to actual values.

CREATE INDEX ix1 ON default (userId) WHERE docType = "device";
SELECT  u AS user, devices
FROM default AS u USE KEYS "user:" || TOSTRING($userid)
LET devices = (SELECT RAW d
               FROM default AS d
               WHERE d.userId = $userid AND d.docType = "device");

SELECT  u AS user, devices
FROM default AS u USE KEYS "user:" || TOSTRING($userid)
LET devices = (SELECT RAW d
               FROM default AS d  USE KEYS "device:" || TOSTRING($userid) || ":" || TOSTRING($deviceid)
              )[0];

keshav_m · July 26, 2018, 5:25pm

On Couchbase 5.5 (just released), you can join any or any portion of the keys, attributes,e tc.
Here’s an example you can start to work with.

See the ANSI join article at: https://blog.couchbase.com/a-guide-to-n1ql-features-in-couchbase-5-5-special-edition/

SELECT user
FROM user ["user:123", "user:124"]
                   INNER JOIN
             device 
                   ON ( meta(user).id =  ("device" || SUBSTR(meta(user).id, 7))

chotenvan · July 31, 2018, 1:08pm

thanks a lot! with a couple of tweaks I was able to run the query you’ve suggested.
however, performance in not very impressive on my dataset. maybe I need to play with cluster settings

chotenvan · July 31, 2018, 1:13pm

keshav_m, thanks for your suggestion! I’ve seen the announcement, but I have to wait for our admins to setup cluster with new version…
Is there any performance benchmark on N1QL joins on big data sets (like 300mln documents)?
We are trying to migrate from SQL and I’m afraid that joins could be slow…

keshav_m · July 31, 2018, 1:46pm

N1QL is mainly designed for operational queries. It’s not the number of documents in the bucket, but the number of documents that’d qualify for the joins.

In 5.5, we use nested loop by default. Suitable for small number of documents that’d qualify via index look up.

In 5.5 EE, we also have hash joins which can handle much larger working set. Read the ANSI join article in the link I mentioned.

If you have need to join all or most of the 300million, analytics service may be the right way. That’s in beta now.

chotenvan · July 31, 2018, 2:15pm

Thanks again for quick reply!
No, I don’t need to join all 300mln docs - that’s the total number of docs in the bucket (and expected to be 500mln next year)
I wanted to extract (per each request/query - from front-end api gateway server - not from background analytics jobs) like 10-20 user documents (so its batch query, where keys for user docs provided in query args), each joined with 1-10 other (device) documents from the same bucket, so on average it will be, say, total 100 documents extracted per query. I know I didn’t mention it in original question, sorry, that’s another case I’ve forgot about. I wonder what kind of performance can be achieved on my 300mln dataset… It would be good if it was under 30ms, but I’m afraid it will be more like 200ms, which is not very good for front end. So I thought maybe there are some measurements done for some similar case.
Well, anyway, I guess its too much to ask
Probably I have to measure that on my own once the cluster is up.
Thanks for your help again!

keshav_m · July 31, 2018, 2:24pm

For these kind of queries, you should have indexes covering predicates on both sides of the join. Use indexes covering the whole query. That’ll give you the best latency/performance.