N1QL query time out for large data array from two buckets

yogesh_0586 · April 19, 2018, 5:10am

I already posted question here but I have some issues. I have two buckets “temp” and “temp1” in “temp1” buckets I have thousands of documents and in each documents I have structure as like below

{
   "data":["ID","ID1","ID2",...."ID10000"]
},
{
   "data":["ID","ID1","ID2",...."ID10000"]
},
{
   "data":["ID","ID1","ID2",...."ID10000"]
}

This “data” array contains ten thousand of ids stored and that ids in string format. And same way I have data in “temp” buckets as below

{
  "user_ID":"A1",
   "data":["ID","ID1","ID2",...."ID10000"]
},
{
  "user_ID":"A2",
   "data":["ID","ID1","ID2",...."ID10000"]
},
{
  "user_ID":"A3",
   "data":["ID","ID1","ID2",...."ID10000"]
}

and I want to check if any of “ID” from “temp1.data” present in “temp.data” then return matched documents “user_ID” from “temp” buckets so my expected output as below

["A1","A2",...] etc.

This two buckets don’t have any relation ship between them. If I run query mentioned in previous question like this

SELECT ARRAY_INTERSECT(t.data, t1.data) FROM temp t UNNEST (SELECT RAW temp1 FROM temp1) AS t1;

but it takes too much time to execute query and after some minutes it show query time out exception.

vsr1 · April 19, 2018, 5:26am

Hi @yogesh_0586, Try in cbq shell

SELECT t.user_ID FROM temp1 t UNNEST (SELECT RAW temp FROM temp) AS t1
WHERE ANY v IN t1.data SATISFIES v IN t.data END;

yogesh_0586 · April 19, 2018, 7:52am

I used given query on cbq shell and still it takes too much time to execute ( more than 20 minutes ) and no output it still it in execution mode I mannually stop couch base server and my cbq shell look like this

cbq> SELECT t.user_ID FROM temp1 t UNNEST (SELECT RAW temp FROM temp) AS t1 WHERE ANY v IN t1.data SATISFIES v IN t.data END;

here on both buckets temp and temp1 having only primary index and both buckets RAM Quota is 300MB

vsr1 · April 19, 2018, 12:13pm

This query has lot of loops and takes time. You mentioned you have 10,000 on each bucket and 10,000 each array. This result in 10,000 *10,000 Cartesian Join and it for matching each document it needs to process 10,000 *10,000 Array iterations.

May be you should split query into multiple and try in the loop.
SELECT META().id FROM temp1;
Repeat the following query each META().id

SELECT t.user_ID FROM temp1 t USE KEYS UNNEST (SELECT RAW temp FROM temp) AS t1
WHERE ANY v IN t1.data SATISFIES v IN t.data END;

If you run one meta()id you know approximate time then you will have idea how long it will take.

yogesh_0586 · April 19, 2018, 1:05pm

Instead of pushing data in “temp1.data” if document inserted like this

{
"data": "id1"
},
{
"data": "id2"
},
{
"data": "id3"
}

and create index on data in temp1.data and use following query

SELECT t.user_ID FROM temp t WHERE ANY v IN (select raw temp1.data from temp1) SATISFIES v IN t.data END;

is this query performance better than as compare to previous buckets structure, I am open to modified “temp1” bucket structure.

vsr1 · April 19, 2018, 1:21pm

In 5.5.0 (currently beta) we have ANSI join https://blog.couchbase.com/ansi-join-support-n1ql/. It should perform better. You can try out