Need help for the array query

hartono.mochtar · January 9, 2021, 5:46pm

I have around 10 millions of JSON with array having sample as below:

{
  "name": "AAA",
  "type": "Book",
  "status": [
    {
      "value": "A",
      "update_date": "20101206"
    },
    {
      "value": "B",
      "update_date": "20150910"
    },
    {
      "value": "A",
      "update_date": "20150911"
    },
    {
      "value": "D",
      "update_date": "20151016"
    }
  ]
}

From above JSON, I would like to query the all of documents where update_date <= “20150930” and based on the first filter, having last value of status in [‘A’,‘D’].

My current query is like this:

SELECT name
FROM BucketName b
WHERE b.type_ = ‘Book’
AND ARRAY_MAX(ARRAY_SORT(ARRAY v FOR v IN b.status WHEN v.update_date <= ‘20150930’ END)).value IN [‘A’, ‘D’];

with above query, it took around 37 mins to complete.

Please help for the optimized query and also with the index needed.

Thanks,
Hartono

vsr1 · January 9, 2021, 8:07pm

  CREATE INDEX ix1 ON BucketName (DISTINCT FOR v.update_date IN status END, name ) 
    WHERE type = "Book";

SELECT name
FROM BucketName AS b
WHERE b.type = "Book"
AND ANY v IN b.status SATISFIES  v.`value` IN ["A", "B"] AND  v.update_date <= "20150930" END;

OR

The following is complex but performs better

CREATE INDEX ix2 ON BucketName (DISTINCT FOR [v.`value`, v.update_date]  IN status END, name ) 
WHERE type = "Book";

SELECT name
FROM BucketName AS b
WHERE b.type = "Book"
AND ANY v IN b.status SATISFIES  ([v.`value`, v.update_date]  > = ["A", false] AND   [v.`value`, v.update_date]  <= ["A", "20150930"]) OR  ([v.`value`, v.update_date]  > = ["B", false] AND   [v.`value`, v.update_date]  <= ["B", "20150930"]) END;

hartono.mochtar · January 11, 2021, 3:20am

Hi @vsr1,

thank you for your response.
I have tried both given query, and it doesn’t really meet my case.
For the given query, if I changed the update_date to be 20150909, it still return the sample document, where the expectation is the document shouldn’t match the criteria since the last value having update_date <= 20150909 is “B”.

Is there any other way to cater my case?

Thanks,
Hartono

vsr1 · January 11, 2021, 3:33am

  {
      "value": "A",
      "update_date": "20101206"
    }

If you see your filter is ANY and A value match

Are you looking for the ARRAY must contain A, B and both must be < value.

SELECT name
FROM BucketName AS b
WHERE b.type = "Book"
AND ANY v IN b.status SATISFIES  v.`value`  = "A" AND  v.update_date <= "20150930" END
AND ANY v IN b.status SATISFIES  v.`value`  = "B" AND  v.update_date <= "20150930" END;

hartono.mochtar · January 11, 2021, 3:54am

Hi @vsr1

the logic of the case is:

filter the array based on the update_date <= “20150909”. Based on the sample doc, it should have the value as “A” and “B” which are matching the update_date criteria.
based on the filter #1, then filter the last value sorted by update_date in [“A”, “D”]

Based on the above logic, then the sample document will not match the criteria since the last value of the array that matching the criteria is B.

Thanks,
Hartono

vsr1 · January 11, 2021, 4:13am

CREATE INDEX ix1 ON BucketName (DISTINCT ARRAY v.update_date FOR v  IN status END, name ) 
    WHERE type = "Book";

SELECT name
FROM BucketName AS b
LET vv = (SELECT RAW MAX([s.update_date,s.`value`])[1] FROM  b.status AS s WHERE s.update_date <= "20150930" )[0] 
WHERE b.type = "Book"
AND ANY v IN b.status SATISFIES v.update_date <=  "20150930" END
AND vv IN ["A", "D"];

hartono.mochtar · January 14, 2021, 8:24am

Hi @vsr1,

Thank you for your help.
I’ve tried with the index and your last query, and it worked.

When I tried to test the performance by selecting the count of that criteria within 10 millions records, the query completed within 55 mins with the record counts of 2 millions out of 10 mil match the criteria.

SELECT count(1)
FROM BucketName AS b
LET vv = (SELECT RAW MAX([s.update_date,s.`value`])[1] FROM b.status AS s WHERE s.update_date <= “20191130” )[0]
WHERE b.type_ = “Book”
AND ANY v IN b.status SATISFIES v.update_date <= “20191130” END
AND vv IN [“A”, “D”];

I feel that 55 mins is a bit slow. Does it look normal for such query to take place up to 55 mins? Is there any way to speed up the performance of the query?

Regards,
Hartono

vsr1 · January 14, 2021, 2:15pm

Normally it will not take that long.
But CE query service is limited 4 cores.
Also it is array index and array entries might be much higher (array length) than 2M.
It also need to Fetch the document and do subquery on each document(aggregate query)
enable query profile and see where it taking time.

hartono.mochtar · January 15, 2021, 8:42am

Hi @vsr1

Below is the query profile that i retrieved from the system:completed_requests. This result is newly executed query, since the last execution already removed from system:completed_requests.

[
{
“completed_requests”: {
“clientContextID”: “4f10fc3d-e5ec-4bb6-825d-cd4a53d8dad6”,
“elapsedTime”: “1h23m14.652918266s”,
“errorCount”: 0,
“mutations”: 1,
“node”: “localhost:8091”,
“phaseCounts”: {
“fetch”: 8741059,
“filter”: 16441842,
“indexScan”: 10590817
},
“phaseOperators”: {
“authorize”: 1,
“fetch”: 1,
“filter”: 8741047,
“indexScan”: 1,
“insert”: 1
},
“phaseTimes”: {
“authorize”: “11.801377ms”,
“fetch”: “1h12m11.635001049s”,
“filter”: “20m8.289706668s”,
“indexScan”: “1m23.645637015s”,
“insert”: “2.252214ms”,
“instantiate”: “69.911µs”,
“parse”: “6.522948ms”,
“plan”: “59.775311ms”,
“run”: “1h23m14.586347185s”
},
“remoteAddr”: “127.0.0.1:59471”,
“requestId”: “5abf9bd8-083f-4b85-a9ec-f79e25efac72”,
“requestTime”: “2021-01-15T08:32:15.662Z”,
“resultCount”: 1,
“resultSize”: 10,
“scanConsistency”: “unbounded”,
“serviceTime”: “1h23m14.65275854s”,
“state”: “completed”
}
}
]

also, below is the current setting of the query node:

{“auto-prepare”:false,“completed”:{“aborted”:null,“threshold”:1000},“completed-limit”:4000,“completed-threshold”:1000,“controls”:false,“cpuprofile”:“”,“debug”:false,“functions-limit”:16384,“keep-alive-length”:16384,“loglevel”:“INFO”,“max-index-api”:4,“max-parallelism”:1,“memprofile”:“”,“mutexprofile”:false,“n1ql-feat-ctrl”:12,“pipeline-batch”:16,“pipeline-cap”:512,“prepared-limit”:16384,“pretty”:false,“profile”:“off”,“request-size-cap”:67108864,“scan-cap”:512,“servicers”:56,“timeout”:0}

Does setting the max-parallelism before running the query might improve to the query performance? Or are there anything I can change to improve the query performance?

Regards,
Hartono

hartono.mochtar · January 19, 2021, 9:33am

hi @vsr1, any update on this?

vsr1 · January 19, 2021, 2:13pm

You can set ma_prallelism at request level https://docs.couchbase.com/server/current/settings/query-settings.html#max_parallelism_req