Array with DISTINCT is slow


#1

Hello,

Document:
{
id: “123”
type: “report”
data: [
{
“id”: 0,
“required”: true,
“value”: “10000”
},
{
“id”: 1,
“required”: true,
“value”: “florida”
},
{
“group”: “”,
“id”: 2,
“required”: true,
“value”: “07/01/2017”
},

{
“group”: “”,
“id”: 25,
“required”: true,
“value”: “45”
},
]

Index:
CREATE INDEX doc_idx ON default((distinct (array (s.value) for s in data when ((s.id) = 17) end)))

Query:
SELECT DISTINCT (ARRAY i.value FOR i IN data WHEN i.id=17 END) FROM default Use index(doc_idx) WHERE type= ‘report’

Size: 100K documents of type=‘report’

Output:
[
{
“$1”: [
“45”
]
},
{
“$1”: [
“50”
]
}
]
When I try to run this query it almost takes 15 to 18 seconds to give 2 distinct values. I don’t see any difference with and without array index. This takes really long time. I don’t know whether I am missing something.


#2

If id is constant all the time you can use following query and index. NOTE: variable in the index and UNNEST alias must be same (i.e. s)

CREATE INDEX doc_idx ON default(ALL ARRAY s.value FOR s IN data WHEN s.id = 17 END) WHERE type = "report";
SELECT s.value FROM default d UNNEST d.data AS s WHERE d.type = "report" AND s.id = 17 AND s.value IS NOT NULL;

If id is not consant you can try the following.

CREATE INDEX doc_idx ON default(DISTINCT ARRAY s.id FOR s IN data END) WHERE type = “report”;
SELECT s.value FROM default AS d UNNEST d.data AS s WHERE d.type = “report” AND s.id = 17 ;

SELECT ARRAY s.value FOR s IN d.data WHEN s.id = 17 END FROM default AS d WHERE d.type = “report” AND ANY v IN d.data SATISFIES v.id = 17 END;


#3

Yes id is constant. Data array size will vary based on type. Say (type=‘report’, then data array size is 25 , type=‘xxx’ data size is 18)

FYI: I have 100,000 documents of type ‘report’ (each document will have data array of size 25)

I tried both possibilities, still taking time with distinct.

Id is constant
CREATE INDEX doc_idx ON default(ALL ARRAY s.value FOR s IN data WHEN s.id = 17 END) WHERE type = “report”;

23 seconds
SELECT distinct s.value FROM default d UNNEST d.data AS s WHERE
d.type= ‘report’ AND s.id = 17 AND s.value IS NOT NULL;

If id is not constant
CREATE INDEX doc_idx ON default(distinct array (s.value) for s in data end)
WHERE type=‘report’

25 seconds
SELECT ARRAY s.value FOR s IN d.data WHEN s.id = 17 END FROM default AS d WHERE d.type = “report” AND ANY v IN d.data SATISFIES v.id = 17 END;

59 milli seconds
SELECT ARRAY s.value FOR s IN d.data WHEN s.id = 17 END FROM default AS d WHERE d.type = “report” AND ANY v IN d.data SATISFIES v.id = 17 END LIMIT 10;

21 seconds
SELECT DISTINCT ARRAY s.value FOR s IN d.data WHEN s.id = 17 END FROM default AS d WHERE d.type = “report” AND ANY v IN d.data SATISFIES v.id = 17 END;


#4

What version of Couchbase and can you post some EXPLAIN results.


#5

LIMIT query returns past. That means your query result size might be too high and that is taking time.
DISTINCT on projection should not take time and it showing your output.

25 seconds
SELECT ARRAY s.value FOR s IN d.data WHEN s.id = 17 END FROM default AS d WHERE d.type = “report” AND ANY v IN d.data SATISFIES v.id = 17 END;

21 seconds
SELECT DISTINCT ARRAY s.value FOR s IN d.data WHEN s.id = 17 END FROM default AS d WHERE d.type = “report” AND ANY v IN d.data SATISFIES v.id = 17 END;


#6

Version 5.1.0

Indexes:
CREATE INDEX doc_id_idx ON testdoc(type,id,data)
CREATE INDEX doc_idx ON testdoc((distinct (array (s.value) for s in data end))) WHERE (type = “report”)

Query:
SELECT DISTINCT ARRAY s.value FOR s IN d.data WHEN s.id = 17 END FROM testdoc AS d USE INDEX(doc_idx)
WHERE type= ‘report’ AND ANY v IN d.data SATISFIES v.id = 17 END;

Explain:
{
“plan”: {
#operator”: “Sequence”,
“~children”: [
{
#operator”: “IndexScan2”,
“covers”: [
“cover ((d.type))”,
“cover ((d.id))”,
“cover ((d.data))”,
“cover ((meta(d).id))”
],
“index”: “doc_id_idx”,
“index_id”: “c3c4309d1cd1859d”,
“index_projection”: {
“entry_keys”: [
0,
2
]
},
“keyspace”: “testdoc”,
“namespace”: “default”,
“spans”: [
{
“exact”: true,
“range”: [
{
“high”: ““report””,
“inclusion”: 3,
“low”: ““report””
}
]
}
],
“using”: “gsi”
},
{
#operator”: “Parallel”,
“~child”: {
#operator”: “Sequence”,
“~children”: [
{
#operator”: “Filter”,
“condition”: “((cover ((d.type)) = “report”) and any v in cover ((d.data)) satisfies ((v.id) = 17) end)”
},
{
#operator”: “InitialProject”,
“distinct”: true,
“result_terms”: [
{
“expr”: “array (s.value) for s in cover ((d.data)) when ((s.id) = 17) end”
}
]
},
{
#operator”: “Distinct”
},
{
#operator”: “FinalProject”
}
]
}
},
{
#operator”: “Distinct”
}
]
},
“text”: “SELECT DISTINCT ARRAY s.value FOR s IN d.data WHEN s.id = 17 END FROM testdoc AS d USE INDEX(doc_idx)\nWHERE type= ‘report’ AND ANY v IN d.data SATISFIES v.id = 17 END;”
}


#7

The plan looks good. number of qualified items might be too high. That is why it taking time.


#8

oh ok. Thanks for the response.

Yes for SUM() and DISTINCT functions with high volume of document, its taking more time. For just 100,000 doucments itself its taking around 25 seconds. We are having atmost 1 million documents.

So you are saying is couchbase db won’t be good for high volume data(SUM, DISTINCT) operations. Do you recommend any other solutions?


#9

Depending on data accuracy you might be able to get away with using the older views to get the data for you.


#10

Sorry for asking this again, we are new to Couchbase we are working for larger dataset which can store more than 1 million rows per document and we are using Excel like filter to show the data based on the user filtered column(s), please help me to understand what is the “older views” and how i can implement for my data set?


#11

Do you mean more than million in array for each document? You should think about data model https://developer.couchbase.com/documentation/server/5.1/data-modeling/intro-data-modeling.html


#12

No. more than 1 million document in a bucket and each document contains array size of 25, which contains the following data.

{
“documentId”:“1234”
“data”: [{
“group”: “”,
“id”: 0,
“required”: true,
“value”: “07/01/2017”
},
{
“group”: “”,
“id”: 1,
“required”: true,
“value”: “07/01/2017”
},

]
}


#13

Please post your SUM, DISTINCt query.
array has 25 then the multiplication factor will be 25.
What is total qualification items fro your query.


#14

Sample Doc:
{
“_key”: “b6dce680-7543-4de4-a719-ba97bc025cf50”,
“data”: [
{
“group”: “”,
“id”: 0,
“originalValue”: “10000”,
“required”: true,
“type”: “alphanumeric”,
“value”: “10000”
},
{
“group”: “”,
“id”: 1,
“originalValue”: “Option”,
“required”: true,
“type”: “alphabet”,
“value”: “Option”
},
{
“group”: “”,
“id”: 2,
“originalValue”: “2017-07-01”,
“required”: true,
“type”: “date(MM/dd/yyyy)”,
“value”: “07/01/2017”
},
{
“group”: “”,
“id”: 3,
“originalValue”: “2020-07-01”,
“required”: true,
“type”: “date(MM/dd/yyyy)”,
“value”: “07/01/2020”
},
{
“group”: “”,
“id”: 4,
“originalValue”: “3042”,
“required”: true,
“type”: “int”,
“value”: “3042”
},
{
“group”: “”,
“id”: 5,
“originalValue”: “20.00”,
“required”: true,
“type”: “currency”,
“value”: “$20.00”
},
{
“group”: “”,
“id”: 6,
“originalValue”: “20”,
“required”: true,
“type”: “int”,
“value”: “20”
},
{
“group”: “rsu”,
“id”: 7,
“originalValue”: “1.29”,
“required”: true,
“type”: “percent”,
“value”: “1.29%”
},
{
“group”: “rsu”,
“id”: 8,
“originalValue”: “17.30”,
“required”: true,
“type”: “percent”,
“value”: “17.30%”
},
{
“group”: “rsu”,
“id”: 9,
“originalValue”: “1.10”,
“required”: true,
“type”: “percent”,
“value”: “1.10%”
},
{
“group”: “rsu”,
“id”: 10,
“originalValue”: “3.00”,
“required”: true,
“type”: “double”,
“value”: “3.00”
},
{
“group”: “”,
“id”: 11,
“originalValue”: “6.00”,
“required”: true,
“type”: “percent”,
“value”: “6.00%”
},
{
“group”: “”,
“id”: 12,
“originalValue”: “47.07”,
“required”: true,
“type”: “currency”,
“value”: “$47.07”
},
{
“group”: “”,
“id”: 13,
“originalValue”: “44.24”,
“required”: true,
“type”: “currency”,
“value”: “$44.24”
},
{
“group”: “”,
“id”: 14,
“originalValue”: “415”,
“required”: true,
“type”: “int”,
“value”: “415”
},
{
“group”: “”,
“id”: 15,
“originalValue”: “7.39”,
“required”: true,
“type”: “currency”,
“value”: “$7.39”
},
{
“group”: “”,
“id”: 16,
“originalValue”: “54.41”,
“required”: true,
“type”: “currency”,
“value”: “$54.41”
},
{
“group”: “”,
“id”: 17,
“originalValue”: “47.06882441775697”,
“required”: false,
“type”: “currency”,
“value”: “$47.07”
},
{
“group”: “”,
“id”: 18,
“originalValue”: “-0.0011755822430288276”,
“required”: false,
“type”: “currency”,
“value”: “$0.00”
},
{
“group”: “”,
“id”: 19,
“originalValue”: “44.24469495269155”,
“required”: false,
“type”: “currency”,
“value”: “$44.24”
},
{
“group”: “”,
“id”: 20,
“originalValue”: “0.00469495269155118”,
“required”: false,
“type”: “currency”,
“value”: “$0.00”
},
{
“group”: “”,
“id”: 21,
“originalValue”: “7.387572241188462”,
“required”: false,
“type”: “currency”,
“value”: “$7.39”
},
{
“group”: “”,
“id”: 22,
“originalValue”: “-0.0024277588115380055”,
“required”: false,
“type”: “currency”,
“value”: “$0.00”
},
{
“group”: “”,
“id”: 23,
“originalValue”: “36.857122711503095”,
“required”: false,
“type”: “currency”,
“value”: “$36.86”
},
{
“group”: “”,
“id”: 24,
“originalValue”: “-17.5528772884969”,
“required”: false,
“type”: “currency”,
“value”: “($17.55)”
}
],
“id”: 0,
“issue”: false,
“procedureId”: “b6dce680-7543-4de4-a719-ba97bc025cf5”
}

Query:
SELECT distinct ARRAY s.value FOR s IN d.data WHEN s.id = 17 END FROM procedurecontent AS d
WHERE procedureId= ‘b6dce680-7543-4de4-a719-ba97bc025cf5’ AND ANY v IN d.data SATISFIES v.id = 17 END

Index:
CREATE INDEX proc_value_idx ON procedurecontent((distinct (array (s.value) for s in data end))) WHERE (procedureId = “b6dce680-7543-4de4-a719-ba97bc025cf5”)


#15

Please check explain the query may not using the index
Is procedureId , v.id are constant across all queries or it varies. For given document v.id is unique

Assuming procuderId, v.Id can be changed from query to query and v.id is unique in the array
The following is best option

CREATE INDEX proc_value_idx ON procedurecontent(procedureId, DISTINCT ARRAY s.id FOR s IN data END);

SELECT DISTINCT FIRST s.value FOR s IN d.data WHEN s.id = 17 END FROM procedurecontent AS d
WHERE d.procedureId = "b6dce680-7543-4de4-a719-ba97bc025cf5" AND ANY v IN d.data SATISFIES v.id = 17 END;

#16

Sorry for the late reply. I tried this query too, but no much difference in query execution time. As discussed before I think since we have large amount of data, its taking some time. Thanks for the response