Query with join very slow

edwin.van.der.elst · January 11, 2017, 4:37pm

I have 2 buckets (dossier: 280.000 records and execution: 898.000 records).
When I perform a query with a join, the performance is very slow (17s).
dossier is indexed on primary
execution has a property dossierId that contains the reference to dossier.
execution has an index on ‘procedure.id’

Example queries (just to show the issue, the queries are different in the app):
select count()
from execution execution
where execution.procedure.id=1
==> 350 ms, this is fast
select count()
from execution execution
left outer join dossier d on keys execution.dossierId
where execution.procedure.id=1
==> 17s, very slow
In this case, the count is exactly the same (it should be).

Another example:
select count()
from execution execution
inner join dossier dossier on keys execution.dossierId
where execution.procedure.id=1
and dossier.naam=‘Hilay’
=> 17s
And there is an index on dossier.naam, but that index is not used.
Explain:
[
{
“plan”: {
"#operator": “Sequence”,
"~children": [
{
"#operator": “IndexScan”,
“index”: “execution_procedureId”,
“index_id”: “a5ee4ba1318c93ec”,
“keyspace”: “execution”,
“namespace”: “default”,
“spans”: [
{
“Range”: {
“High”: [
“1”
],
“Inclusion”: 3,
“Low”: [
“1”
]
}
}
],
“using”: “gsi”
},
{
"#operator": “Parallel”,
"~child": {
"#operator": “Sequence”,
"~children": [
{
"#operator": “Fetch”,
“as”: “execution”,
“keyspace”: “execution”,
“namespace”: “default”
},
{
"#operator": “Join”,
“as”: “dossier”,
“keyspace”: “dossier”,
“namespace”: “default”,
“on_keys”: “(execution.dossierId)”
},
{
"#operator": “Filter”,
“condition”: “((((execution.procedure).id) = 1) and ((dossier.naam) = “Hilay”))”
},
{
"#operator": “InitialGroup”,
“aggregates”: [
"count()"
],
“group_keys”: []
}
]
}
},
{
"#operator": “IntermediateGroup”,
“aggregates”: [
“count()"
],
“group_keys”: []
},
{
"#operator": “FinalGroup”,
“aggregates”: [
"count()”
],
“group_keys”: []
},
{
"#operator": “Parallel”,
"~child": {
"#operator": “Sequence”,
"~children": [
{
"#operator": “InitialProject”,
“result_terms”: [
{
“expr”: “count()"
}
]
},
{
"#operator": “FinalProject”
}
]
}
}
]
},
“text”: "\nselect count()\nfrom execution execution\ninner join dossier dossier on keys execution.dossierId\nwhere execution.procedure.id=1\nand dossier.naam=‘Hilay’”
}
]

geraldss · January 12, 2017, 4:04am

For the third query, try:

select count(*)
from dossier
where dossier.naam='Hilay'

edwin.van.der.elst · January 12, 2017, 8:22am

It is not the count that I’m worried about (an your query gives me different results, since it doesn’t check that the dossier has an execution instance).

Problem is that whenever I add a join, any indexes on the joined document (dossier) is not used to limit the results.
So, when there are (example) 20.000 executions with ‘procedure.id=1’, and 4 dossiers with name ‘Hilay’, the join should give me 4 rows (at most). Now, this is just an example, other queries are more complex.

Is it correct that indexes on the joined documents are never used to limit the results? (I think that this could be done by the queryplanner when an inner join is used)

geraldss · January 12, 2017, 2:59pm

How about this:

CREATE INDEX idx_naam ON dossier( naam );
CREATE INDEX idx_dossier_id ON execution( dossier_id );

SELECT COUNT(*)
FROM dossier d
JOIN execution e ON KEY e.dossier_id FOR d
WHERE d.naam = 'Hilay';

edwin.van.der.elst · January 12, 2017, 3:19pm

That is fast.
It is only slow when the join is the other way around:
select count(*)
from execution execution
join dossier dossier on keys execution.dossierId
where dossier.naam=‘Hilay’

Not so pretty, but this is quicker:
1 query to get the dossier Ids:
select meta().id from dossier where naam='Hilay’
1 query to get the executions:
select count(*) from execution where dossierId in $1
Total time within 200ms

This is also a bit quicker (6s):
select count(*)
from execution execution
where execution.dossierId in (
select raw meta().id from dossier
where dossier.naam=‘Hilay’)

Note: the subselect has no reference to the outer query, so it can be evaluated once. But 6s is still a long time, given that the subselect as a standlone query takes just a couple of ms.

geraldss · January 12, 2017, 4:13pm

If the result is the same, why not use the fast approach.

edwin.van.der.elst · January 12, 2017, 4:25pm

I am using that approach now/
But I think it is a workaround. And because of the large number of documents, the list of dossier-ids can be as large as 200.000. I hope the query-engine can handle such in-parameters efficiently.

The natural way to create such a query (in SQL) is using an inner-join and a condition on dossier.
Now, I need to execute multiple queries so the index is used for each query.
It looks like indexes are not used when a condition on the joined document is applied (and also not for a subselect that has no references to the outer query).

geraldss · January 12, 2017, 4:32pm

What is the final query you want, in standard SQL? If you tell us that, we will suggest an approach for N1QL. “Workaround” is subjective

edwin.van.der.elst · January 12, 2017, 4:44pm

The query is actually generated by options the user makes, so it is not a fixed query.
But it usually looks like this (example, the actual properties and criteria differ, based on the users choices):

select execution.start_date as ex_start_date, execution.step1.name as ex_name, dossier.name as dos_name
from execution execution
inner join dossier dossier on keys execution.dossierId
where execution.procedure.id=$1
and execution.organisation.id in $2
and dossier.arrival_date=$3
and dossier.naam=$4

If a user doesn’t need dossier criteria or results, the join is omitted:

select execution.start_date as ex_start_date, execution.step1.name as ex_name
from execution execution
where execution.procedure.id=$1
and execution.organisation.id in $2

I guess I could always start with ‘from dossier’ if the join is needed, so I can use the join from dossier to execution which seems faster (don’t know if this will slow down once I add criteria on ‘execution’)

I will expirement with that next week Thank you for all the suggestions so far!

geraldss · January 12, 2017, 4:51pm

@edwin.van.der.elst what is the application? Would you consider this a data platform? cc @keshav_m

edwin.van.der.elst · January 12, 2017, 6:49pm

It is the administration of persons (‘dossier’) and the (legal) procedures they are going through (execution is ‘an instance’ of a procedure-workflow).
The schema is flexible (main reason to use couchbase).
We build a small reporting module in which users can pick the properties they want to see in the report, and set constraints.
With that information, we generate the query using the DSL (and generate a join to ‘dossier’ if properties from the related dossier are requested).

Not sure wether this qualifies as a ‘data platform’.

geraldss · January 12, 2017, 7:20pm

Interesting application, thanks. Unless the dossier constraints have very low selectivity, I suspect you will want to start with dossier if the user specifies dossier constraints.

It is a data platform if this serves many needs or use cases. The flexible querying is one indicator of what we consider data platform. @keshav_m