Perfomance issue with N1QL self join

Any solution guys ? Will really leave N1QL as solution if this is not going to give me sub second response.

Hi Siddhartha,

I’m not good with N1QL, so I might be wrong, but it looks like you could also use a mapreduce view to solve your problem. I also don’t know what at Date does, so I don’t know if mapreduce views can do it. But for the rest, it would look like that:

The map function:

function(doc) {
    // 1 == Debit Amount
    // 2 == Credit Amount
    // 3 == Net Change
    // 4 == Balance
    // 5 == Additional-Currency Balance
    // 6 == Additional-Currency Net Change
    if (doc['Debit Amount'] !== undefined) {
        emit([doc.No_, 1], doc['Debit Amount']);
    }
    if (doc['Credit Amount'] !== undefined) {
        emit([doc.No_, 2], doc['Credit Amount']);
    }
    ...
}

Your reduce function would be _sum.

Your query parameters would then be:

?startkey=["122537",null]&endkey=["122537",{}]&group=true

Cheers,
Volker

Hi @singh_siddhu,

Can you tell us which document has No_ as a primary key, and which one has No_ as a foreign key? I masking about N and X. That is, do the N documents have No_ as primary key, or do the X documents have No_ as a primary key?

Gerald

In N it is primary key, and X it is foreign key that is why aggregate functions are applied on X’s column because there would be multiple rows for each Key of N in X . i.e., 1 : N ratio between N and X documents.

Ok, thank you. We are taking a look.

Hi Siddhu,

You’re joining NAV X JOIN NAV N.
You have a filter on “NAV N”: WHERE N.No_ = ‘122537’
You have a primary index created on the bucket.

That means, the engine (query plan) will scan the full bucket NAV X and for each document, it looks for the field X.No_ and tries to match with “NAV N”.
The filter N.No_ = ‘122537’ is applies ONLY AFTER THE JOIN.

To improve the performance, you’ve to filter the documents early (during the scan of NAV X).

If X.No_ is the foreign key for N.No_ and the document ID has the same value,
1. You can simply apply the predicate to X.No.
2. Create a secondary index on NAV(No_)

CREATE INDEX NAVNo_ on NAV(No_);

If you have type field defined within each doc to differentiate between parent & child, you can can use that in the query and add WHERE clause to the index as well.

CREATE INDEX NAVNo_ on NAV(No_) WHERE type = ‘X’;
SELECT blah…
FROM NAV X JOIN NAV N ON KEYS [X.No_]
WHERE X.type = ‘X’ and X.No_ = 122537’
GROUP BY X.No_

I will try this as well. But didn’t understand this “and the document ID has the same value”, how can two different documents will have document ID same ? I already described earlier that keys are unique for each document.

Also how to make sure that out of 3 nodes in the cluster only 2 are data nodes and 1 is for query ?

Above mentioned help didn’t work out. It is taking 34 seconds for just 50K documents.

When you add a node to the cluster, you can choose which services you want to enable on it. There you can specify Data and/or Index and/or Query. For the rest of the performance related work I leave you in the capable hands of @keshav_m and @geraldss.

Thanks @daschl , this is a great help. Now I am going to do that first. Cheers

So now I have 1 data node , 1 query node and 1 index node. Still I am getting result set which is too slow.
Total 50K documents, total time it took : 12.267031445s

There is no join now. Querying using only 1 type of document

Query :
select
X.No_ AS No_,
IFNULL(Sum(X.Balance),0) as Balance
from NAV X
Where X.Type = "GLEntry"
Group by X.No_

Indexes are :
a) CREATE PRIMARY INDEX NAV_No_ ON NAV USING GSI WITH {“nodes”:“172.16.47.62:8091”};
b) CREATE INDEX idx_GLE_Balance ON NAV(Balance) USING GSI WITH {“nodes”:“172.16.47.62:8091”};
c) CREATE INDEX idx_GLE_Type ON NAV(Type) WHERE (Type = “GLEntry”) USING GSI WITH {“nodes”:“172.16.47.62:8091”};

Query I am running using workbench :
./launch-cbq-gui.sh
Running: ./cbq-gui -datastore=http://172.16.47.60:8091 -localPort=:8094 -user= -pass=
Launching query web service.
Using CB Server at: http://172.16.47.60:8091
Using N1QL query service on: 172.16.47.60:8093
Using memcached service on: 172.16.47.60:11210
Using web content at: ./static
Launching UI server, to use, point browser at http://localhost:8094
Hit enter to stop server:

=====
Just wanted to know if this is the best I can get from N1QL ? Please let me know your input @geraldss @keshav_m.

@singh_siddhu, did you get on Skype with @keshav_m?

sorry don’t know his skype id. I have sent mine to him in personal message @geraldss, @keshav_m.

Hi @singh_siddhu,

I sent you another message yesterday with the info yesterday.
my skypeid is: rkeshavmurthy

Here’s the advice for you to get the best performance.

  1. CREATE INDEX idx_gle_type_balance ON NAV(No_, Balance) WHERE (Type = ‘GLEntry’) USING GSI;

Then you issue:
select
X.No_ AS No_,
IFNULL(Sum(X.Balance),0) as Balance
from NAV X
Where X.Type = "GLEntry"
AND X.Balance IS NOT MISSING
AND X.No_ IS NOT MISSING
Group by X.No_;

Remember, in this case, you’re doing the full index scan followed by grouping & aggregation.
The cost of executing this query is proportional to number of entries in the index.

Please post the EXPLAIN when you post the results next.

Sure @keshav_m, will do that and let you know.

Same result.

Query execution time : 11.83 s
Explain Query :

[{
"#operator": “Sequence”,
"~children": [{
"#operator": “IntersectScan”,
“scans”: [{
"#operator": “IndexScan”,
“index”: “idx_gle_type_balance”,
“keyspace”: “NAV”,
“namespace”: “default”,
“spans”: [{
“Range”: {
“Inclusion”: 1,
“Low”: [
“null”,
“null”
]
}
}],
“using”: “gsi”
}, {
"#operator": “IndexScan”,
“index”: “idx_GLE_Type”,
“keyspace”: “NAV”,
“namespace”: “default”,
“spans”: [{
“Range”: {
“High”: [
"“GLEntry”"
],
“Inclusion”: 3,
“Low”: [
"“GLEntry”"
]
}
}],
“using”: “gsi”
}]
}, {
"#operator": “Parallel”,
"~child": {
"#operator": “Sequence”,
"~children": [{
"#operator": “Fetch”,
“as”: “X”,
“keyspace”: “NAV”,
“namespace”: “default”
}, {
"#operator": “Filter”,
“condition”: “((((X.Type) = “GLEntry”) and ((X.Balance) is not missing)) and ((X.No_) is not missing))”
}, {
"#operator": “InitialGroup”,
“aggregates”: [
“sum((X.Balance))”
],
“group_keys”: [
"(X.No_)"
]
}]
}
}, {
"#operator": “IntermediateGroup”,
“aggregates”: [
“sum((X.Balance))”
],
“group_keys”: [
"(X.No_)"
]
}, {
"#operator": “FinalGroup”,
“aggregates”: [
“sum((X.Balance))”
],
“group_keys”: [
"(X.No_)"
]
}, {
"#operator": “Parallel”,
"~child": {
"#operator": “Sequence”,
"~children": [{
"#operator": “InitialProject”,
“result_terms”: [{
“as”: “No_”,
“expr”: “(X.No_)”
}, {
“as”: “Balance”,
“expr”: “ifnull(sum((X.Balance)), 0)”
}]
}, {
"#operator": “FinalProject”
}]
}
}]
}]

Any thoughts here guys ? @keshav_m and @geraldss ? Same query with SQL server with ~ 3 million records is under 1 second .

Please execute the following query and you should see better performance.

select
X.No AS No,
IFNULL(Sum(X.Balance),0) as Balance
from NAV X USE INDEX (idx_gle_type_balance)
Where X.Type = "GLEntry"
AND X.Balance IS NOT MISSING
AND X.No_ IS NOT MISSING
Group by X.No_;

Please create the following index and try the query below it.

CREATE INDEX idx_gle_type_balance2 ON NAV(No_, Balance, Type) WHERE (Type = ‘GLEntry’) USING GSI;

select
X.No_ AS No_,
IFNULL(Sum(X.Balance),0) as Balance
from NAV X USE INDEX (idx_gle_type_balance2)
Where X.Type = "GLEntry"
and X.Balance IS NOT MISSING
AND X.No_ IS NOT MISSING
Group by X.No_

----- Update from Siddu is, with the new index in place, query runs in about 1.7 seconds.

1 Like