Distinct values across large number of documents

srinivas · February 3, 2021, 8:29am

{
“Contract” : “1”
“CustomerID”:“778358358”
“AccountID”: “37537853875”
“Products”:[“A”,“B”,“C”,“D”]
}

Here is a sample structure of our couchbase document
There are around 10M documents we have in couchbase with different values of customer id and account id
Also products that customer take are not alike and varies

For a given contract and customer combination we would like to obtain distinct list of account id’s and distinct list of products

At the moment we have indexes on all of the 4 fields individually and obtaining this distinct set is not really performant.

Any suggestions on this kind of pattern please?

vsr1 · February 9, 2021, 7:11pm

CREATE INDEX ix1 ON default(Contract, CustomerID, AccountID,  Products);
SELECT AccountID,  ARRAY_DISTINCT(ARRAY_FLATTEN(ARRAY_AGG(Products),1)) AS Products
FROM default
WHERE  Contract = "1" AND  CustomerID = "778358358"
GROUP BY AccountID;

SELECT ARRAY_AGG(DISTINCT AccountID) AS AccountIDS,  
  ARRAY_DISTINCT(ARRAY_FLATTEN(ARRAY_AGG(Products),1)) AS Products
FROM default
WHERE  Contract = "1" AND  CustomerID = "778358358";

srinivas · February 9, 2021, 10:26pm

Thanks for your prompt response , we did try this approach already and given large number of documents that we have i.e. around 1M , this is still taking around 15 to 16 secs. Is there any way to further improve performance

Was reading this article mentioned by Sitaram Index Grouping and Aggregation Overview | The Couchbase Blog where he explains that Grouping and aggregation is performed at index scan itself post couchbase 5.5 version

At the moment we are on Community edition 6.0

With index that you have asked us to create( which is same that we already tried) we don’t see this behavior and we see both grouping and Filter as separate steps in explain plan, would this impact performance , are we missing something?

We also tried to check explain plan by loading default travel-bucket and same there as well, grouping and filter are shown as separate steps still and not covered on index scan

Also our query block contains array aggregation, will we be still covered by Index scan even if array aggregation is present?

Regards
Srinivas

vsr1 · February 9, 2021, 10:44pm

ARRAY_AGG() can’t use Index Aggregation.

Set max_parallelism = 4 for this query and see.

Also try to separate queries and see

SELECT DISTINCT  RAW AccountID
FROM default
WHERE  Contract = "1" AND  CustomerID = "778358358";

SELECT DISTINCT  RAW  p
FROM default
UNNEST Products AS p
WHERE  Contract = "1" AND  CustomerID = "778358358";

srinivas · February 9, 2021, 10:50pm

Thanks for suggestion, will try this

srinivas · February 9, 2021, 11:22pm

We are executing this query on a couchbase cluster which has 6 nodes
Ealier max-parallelism was 1 , with your suggestion i have increased to 4 and we dont see any difference in response times with this change.

Also whether we do an array aggregation or DISTINCT RAW , both of them are more or less taking same time
One thing that did work is separation Separation is resulting in break down of time , may be we need to execute as separate queries parallelly and then join result set.

srinivas · February 10, 2021, 9:26am

For this max-parallelism parameter where and how can we verify that this parameter change is making a difference. Please could you let us know. Should we see an increased CPU usage when value is increased from 1 to 4?

vsr1 · February 10, 2021, 2:48pm

https://docs.couchbase.com/server/current/settings/query-settings.html

Check request_level settings