Sub-queries vs. KV

I’m trying to wrap my head around how Couchbase behave internally in the following situation:

I currently have a sub-document (an array of IDs) in each document of a collection that helps me perform some queries.
This sub-document can get quite big (a few hundreds of entries in some cases), so in order save some storage I’m interested in extracting this sub-document into another standalone document.

When I make a query I have a WHERE clause on the sub-document, and it returns the matching document. Nothing fancy.

If I were to externalize this sub-document, I would have to make a first request to the standalone document and then another query to get the main documents.
In this case, would it be better to fetch the standalone documents and then use KV or use a sub-query in conjonction with a USE KEYS clause ?

I’ve come to understand that KV is so fast it’s always preferable but it feels like many network trips so I would love some information about the different behaviors : single query, sub-queries and KV.

If you are not filtering, Joining, aggregation and have document keys use KV (you can do in parallel/asynchronous) to get the document to avoid two hops via N1QL.

You can mix and match as you described.

What is the behavior of a N1QL query ?

Why a N1QL query takes two hops ? Because it goes from one service to another ? Query then Index then Data ?

So you are saying that between a sub-document, a standalone document + sub-query and a standalone document + kv, the last one is the fastest ?

I would have thought the first one is the fastest since you have only one query.
In the case of the standalone document I still have to fetch that document and I’m not sure I’ll be able to use KV for that one (that would make a big document), maybe I’ll have to use N1QL to fetch the right standalone documnt. In that case, is it still worth it ?

SELECT with USE KEYS. Two hops are ( Data Node ====> Query Service ===> Client)
SDK direct KV ( Data Node ====> Client)

See based on how many documents etc which one performs better

{"a":1 , "ids":  ["k1", "k2", "k3"] }

N1QL:

SELECT t.*, j
FROM mybucket t 
JOIN mybucket AS j ON KEYS t.ids
WHERE  t.a = 1;

SDK:

SELECT t.*
FROM mybucket t 
WHERE  t.a = 1;

Once u get the document with ids, Use SDK kv ops get each document 
(If required use subdoc API, parallel, asychronous)

The current query (with sub-document) looks like that :

SELECT b.*
FROM mybucket b
WHERE b.condition1 = true
AND ANY a IN b.arr SATISFIES a.myprop = condition2 END

So there is no join with the current data model (solution 1).
Your queries (let’s call the JOIN the solution 2 and the use of KV solution 3) match what I had in mind for the standalone data model, but if I understand you correctly, according to you there is no clear winner, it depends on the volume of documents ?

I’m trying to understand how it works and what are the parameters of the performance here.
For something like 200 documents on average, what would be your guess on the performance ranking ? 3-1-2, 1-3-2 ?

The data model you had inline as nested document.
Your array can be big 200, how many array elements match the condition ? Do you need whole document etc. Also the document size can max 20MB.
In this model if you need whole document see by projecting document key, if query covers. If yes use SDK fetch the document to avoid two hops of big documents.

If going solution 2,3 It needs to JOIN or subquery. Each condition need to match and combined condition needs to applied after fetch the document.

May be you try experiment and see which does better.

Only one value in the array will match the condition, but a lot of documents can match.
I know the document size is 20MB max, I’m no near that value, it’s just that there is a lot of redundancy in the data so overall it will consume a lot of space only for convenience & performance.

I’ll experiment and see. Thanks for the explanations.

Also consider mutations/atomicity if separated

If use separate document , you need to perform JOIN, If that result in lot of matches performance may impact due to Nested Loop JOIN , HASH JOIN (EE Only).

If array condition is predictable you can store condition field and reference document in array and when matched then fetch. (consistently maintaining both needs to be considered)

1 Like