Correlated subquery with two buckets

I want to translate query 22 of the TPC-H benchmark to N1QL. The query involves two buckets named “customer” and “orders”; the former containing 150,000 documents and the latter 1,500,000. My first approach to write it is shown below:

select
        cntrycode,
        count(*) as numcust,
        sum(c_acctbal) as totacctbal
from
        (
                select
                        substr(c1.c_phone, 0, 2) as cntrycode,
                        c1.c_acctbal
                from
                        customer c1
                where
                        substr(c1.c_phone, 0, 2) in
                                ['30', '17', '25', '10', '22', '15', '21']
                        and c1.c_acctbal > (
                                select raw
                                        avg(c2.c_acctbal)
                                from
                                        customer c2
                                where
                                        c2.c_acctbal > 0.00
                                        and substr(c2.c_phone, 0, 2) in
                                                ['30', '17', '25', '10', '22', '15', '21']
                        )[0]
                        and not exists (
                                select
                                        *
                                from
                                        orders o
                                use keys meta(c1).id
                                where
                                         o.o_custkey = c1.id
                        )
        ) as custsale
group by
        cntrycode
order by
        cntrycode;

The problem is that the correlated sub-query in “exists” always returns an empty set. I have tried to isolate the problem by:

  1. Trying to count the number of customers without orders with the query below, but just like in the main query it results in an empty set.
select count(*)
from customer c1
where not exists (
    select o.o_custkey
    from orders o
    use keys meta(c1).id
    where c1.id = o.o_custkey
)
  1. Writing the query above in a different way , but the length is always 0, which (again) causes the query to count all customers
select count(*)
from customer c1
let corders = (
    select o.o_custkey
    from orders o
    use keys meta(c1).id
    where o.o_custkey = c1.id
)
where ARRAY_LENGTH(corders) = 0
  1. I checked to see if something about the data was off by finding the customers with at least 1 order (see the query below), but the result is 99996; which is correct as I know beforehand there are 50004 customers without orders: 150,000 - 99,996 = 50,004
select count(distinct c.id)
    from customer c join orders o on c.id = o.o_custkey
    where c.id = o.o_custkey

I have seen the “id” field in couchbase to be usually defined as a string, but mine is an integer (in TPC-H, the primary key of “customer” and “orders” are the fields “c_custkey” and “o_orderkey” respectively, but I simply reference them as “id” in their respective couchbase bucket ). May this be affecting USE KEYS somehow?
Could you help me understand what the problem is?
(NOTE: I am using Couchbase Server 6.0.0 Community)

N1QL Correlated subquery requires USE KEYS. When used USE KEYS, it only check that documents only vs whole table.
Only option you have is materialize as non-correlated subquery and match through ARRAY construct. When you have such large documents it will difficult to do.

I found a different way to write it without materializing and matching through ARRAY. Since the NOT EXISTS matches customers without orders, an alternative is to find such customers with:

select raw c.id
from customer c left outer join orders o on c.id = o.o_custkey
group by c.id
having count(o.id) = 0

and then look for customer ids within the set using IN. So, the final version would be:

select
        cntrycode,
        count(*) as numcust,
        sum(c_acctbal) as totacctbal
from
        (
                select
                        substr(c1.c_phone, 0, 2) as cntrycode,
                        c1.c_acctbal
                from
                        customer c1
                let
                        customersWithoutOrders = (
                                select raw c.id
                                from customer c left outer join orders o on c.id = o.o_custkey
                                group by c.id
                                having count(o.id) = 0
                        )

                where
                        substr(c1.c_phone, 0, 2) in
                                ['30', '17', '25', '10', '22', '15', '21']
                        and c1.c_acctbal > (
                                select raw
                                        avg(c2.c_acctbal)
                                from
                                        customer c2
                                where
                                        c2.c_acctbal > 0.00
                                        and substr(c2.c_phone, 0, 2) in
                                                ['30', '17', '25', '10', '22', '15', '21']
                        )[0]
                        and c1.id in customersWithoutOrders
        ) as custsale
group by
        cntrycode
order by
        cntrycode;

It takes a long time, however. Is there a way to make it faster?

You can try removing the group by clause since you are only interested in customers that do not have matching orders, and in that case each customer should appear in the result set once. Try:
select raw c.id from customer c left outer join orders o on c.id = o.o_custkey where o.id is missing

try following

SELECT
        cntrycode,
        COUNT(*) AS numcust,
        SUM(c_acctbal) AS totacctbal
FROM ( SELECT SUBSTR(c1.c_phone, 0, 2) AS cntrycode,
              c1.c_acctbal
       FROM customer c1
        WHERE SUBSTR(c1.c_phone, 0, 2) IN ['30', '17', '25', '10', '22', '15', '21']
              AND c1.c_acctbal > ( SELECT RAW AVG(c2.c_acctbal)
                                FROM customer c2
                                WHERE c2.c_acctbal > 0.00
                                      AND substr(c2.c_phone, 0, 2) IN ['30', '17', '25', '10', '22', '15', '21']
                        )[0]
             AND c1.id IN ( SELECT DISTINCT RAW c.id
                            FROM customer c
                            LEFT OUTER JOIN orders o ON c.id = o.o_custkey
                            WHERE SUBSTR(c.c_phone, 0, 2) IN ['30', '17', '25', '10', '22', '15', '21']
                                  AND c.c_acctbal > 0.00 AND o.id IS MISSING
                          )
        ) AS custsale
GROUP BY cntrycode
ORDER BY cntrycode;

CREATE INDEX cix1 ON customer(SUBSTR(c_phone, 0, 2), c_acctbal, id);
CRATE INDEX oix1 ON orders(o_custkey, id);

OR

SELECT cntrycode, COUNT(*) AS numcust, SUM(c_acctbal) AS totacctbal
FROM ( SELECT MAX(c_acctbal) AS c_acctbal, MAX(cntrycode) AS cntrycode
       FROM (SELECT SUBSTR(c1.c_phone, 0, 2) AS cntrycode, c1.c_acctbal, c1.id, META(c1).id AS mid
             FROM customer c1
             WHERE SUBSTR(c1.c_phone, 0, 2) IN ['30', '17', '25', '10', '22', '15', '21']
                   AND c1.c_acctbal > ( SELECT RAW AVG(c2.c_acctbal)
                                         FROM customer c2
                                         WHERE c2.c_acctbal > 0.00
                                               AND substr(c2.c_phone, 0, 2) IN ['30', '17', '25', '10', '22', '15', '21']
                                       )[0]
            ) AS d
       LEFT JOIN orders o ON d.id = o.o_custkey
       WHERE o.id IS MISSNG
       GROUP BY d.mid
       ) AS custsale
GROUP BY cntrycode
ORDER BY cntrycode;

GROUP BY d.mid produces one per document from JOIN i.e. same IN clause AS mid is unique MAX on d always same. This allows to use NL index scan vs 50K IN clause search
If you are looking optimization run AVG query separately pass the value as named/positional parameters.

I am sorry for the late reply. Thank you @bingjie.miao and @vsr1 for these improvements.
Adding an additional selection to the suggestion made by @bingjie.miao makes a lot of sense because only orders that match the cntrycodes listed and for which c_acctbal is greater than the average are needed. Using the first approach suggested by @vsr1, running time is decreased by ~57% , whereas the last one decreases it by ~96%. However, although the integer parts are the same, the decimals are ceil-ed. For instance, one of the results from the original query is

        {

            "cntrycode": "17",

            "numcust": 863,

            "totacctbal": 6470565.700000001

        }

but @vsr1 final suggestion’s result is:

        {

            "cntrycode": "17",

            "numcust": 863,

            "totacctbal": 6470565.699999996

        }

I am curious as to why this is the case. Maybe some kind of numerical error propagation?

float64 rounding in the addition (order of addition can differ intermediate float64 value due to loss of precision). You should use ROUND function on the result.

Hello @vsr1,
I updated Couchbase Community edition to version 6.5 and have been experimenting using the recommended single bucket setup. That is, I created a single bucket called “tpch” with all documents within it and modified your final suggestion to use the WHERE <bucket>.type = <type> pattern. In my use case, types are “c” for customer and “ol” for orders. However, I am having problems with the left outer join. The query is:

SELECT cntrycode, COUNT(*) AS numcust, SUM(c_acctbal) AS totacctbal
FROM ( SELECT MAX(d.c_acctbal) AS c_acctbal, MAX(d.cntrycode) AS cntrycode
       FROM (
		SELECT SUBSTR(c1.c_phone, 0, 2) AS cntrycode, c1.c_acctbal, c1.c_custkey, META(c1).id AS mid
                      FROM tpch c1
                      WHERE c1.type = "c" and SUBSTR(c1.c_phone, 0, 2) IN ['30', '17', '25', '10', '22', '15', '21']
                            AND c1.c_acctbal > ( SELECT RAW AVG(c2.c_acctbal)
                                                  FROM tpch c2
                                                  WHERE c2.type = "c" and c2.c_acctbal > 0.00
                                                        AND substr(c2.c_phone, 0, 2) IN ['30', '17', '25', '10', '22', '15', '21']
                                                )[0]
	    ) as d
       LEFT JOIN tpch o ON o.type = "ol" and d.c_custkey = o.o_custkey
       WHERE o.o_orderkey IS MISSING
       GROUP BY d.c_custkey
) AS custsale
GROUP BY cntrycode
ORDER BY cntrycode;

the problem, just like at my initial question, is that all customers produced at the sub-query from the second FROM clause are being returned. When executing the left outer join by itself without the o.o_orderkey IS MISSING condition I see the join being done correctly. After adding the condition all elements from the left side are returned, but none from the right side are.
I suspect (please correct me if I am wrong) that the optimizer translates something like this:

select c.c_custkey
from bucket c left join bucket o on c.c_custkey = o.o_custkey and o.type = "ol"
where c.type = "c" and o.o_orderkey is missing

into something like this (note the following query cannot be executed in couchbase; it is just shown for explanatory purposes):

select c.c_custkey
from (

    select * from bucket c where c.type = "c"

) a

left join (

    select * from bucket o where o.type="ol" and o.o_orderkey is missing

) b on a.c_custkey = b.o_custkey

which would explain why all elements at the left side are returned as there would be no matches. One solution may be to evaluate the condition “o.o_orderkey is missing” in an outer query, thus pushing the original left join into a sub-query, but then the whole left join would need to be computed first and degrade performance.

Can you please help me with this?

Thank you!

If this not right explain what exactly you want (Example 7: https://blog.couchbase.com/ansi-join-support-n1ql/)
As you are not projecting any thing from o and you can only see difference when JOIN is true and number of rows expands.

select c.c_custkey
from bucket c left join bucket o on c.c_custkey = o.o_custkey and o.type = "ol"
where c.type = "c" and o.o_orderkey is missing

Even if I project something from o, the result does not change.
Take the following query for example:

select distinct o.o_custkey
from tpch o where o.type="ol"
order by o.o_custkey
limit 2

it produces

[
  {
    "o_custkey": 1
  },
  {
    "o_custkey": 2
  }
]

so I know these two customers should not appear in the result of the following query:

SELECT c.c_custkey, o.o_orderkey
FROM `tpch` c LEFT JOIN `tpch` o
     ON c.c_custkey = o.o_custkey AND o.type = "ol"
WHERE c.type = "c" AND o.o_orderkey IS MISSING;

but I get:

[
  {
    "c_custkey": 1
  },
  {
    "c_custkey": 2
  },
  ... 
]

I still get ALL customers up to 150000, but there are 50004 without orders.
Just to check document integrity for this example (and the fact that inner join works):

select c.c_custkey, count(o.o_orderkey) as num_orders
from tpch c join tpch o on c.c_custkey = o.o_custkey
where c.type="c" and o.type="ol" and (c.c_custkey = 1 or c.c_custkey = 2) 
group by c.c_custkey

results in:

[
  {
    "c_custkey": 2,
    "num_orders": 7
  },
  {
    "c_custkey": 1,
    "num_orders": 6
  }
]

What are the indexes. Is this works.

SELECT c.c_custkey, o.o_orderkey
FROM `tpch` c LEFT JOIN `tpch` o
     ON c.c_custkey = o.o_custkey AND o.type = "ol"
WHERE c.type = "c" AND o IS MISSING;

Indexes:

CREATE INDEX `ol_type` ON tpch (
  type,o_orderkey,o_custkey,o_comment,o_orderdate,
   o_orderpriority,o_orderstatus,o_shippriority,o_totalprice,o_clerk
) 
WHERE type = "ol"
CREATE INDEX c_type ON tpch(
  type,c_custkey,c_acctbal,c_mktsegment,c_phone
) 
WHERE type = "c"
CREATE INDEX o_custkey_ol_ix ON tpch(o_custkey) 
WHERE type = "ol"

As for the query you posted. Asking if “o is missing” takes a lot of time because it needs to do a fetch? (why is that by the way?). I will let you know when the result is done

Opened MB-38564

Options

  1. try this. May have adjust index to move o_orderkey

    CREATE INDEX ol_type ON tpch (o_custkey, o_orderkey, o_comment, o_orderdate, o_orderpriority, o_orderstatus, o_shippriority, o_totalprice, o_clerk )
    WHERE type = “ol”;

     SELECT c.c_custkey, o.o_orderkey
     FROM `tpch` c LEFT JOIN `tpch` o
          ON c.c_custkey = o.o_custkey AND o.type = "ol"
     WHERE c.type = "c" AND o.o_custkey IS MISSING;
    
  2. Process o_orderkey in parent query

  3. Use group/having if possible