Create Secondary Index on Array

rkumar · May 25, 2017, 1:12pm

Thank you for support,now i’m getting the proper results.
one think i need to clarify about array of indexing.
below is my JSON document in couchbase;

{
“active”: true,
“avatar”: [
{
“id”: 89000
},
{
“id”: 872457
},
{
“id”: 872460
}
],
“category”: [
{
“id”: 1111
},
{
“id”: 2222
},
{
“id”: 3333
}],
“client”: “importtest”,
“content”: “New data with Physician data only”,
“engagement_status”: null,
“event_time”: “2017-05-17T00:00:00.000Z”,
“feedback_score”: null,
“feedback_score_sentiment”: “unscored”,
“id”: 20611555,
“location_id”: 0,
“location_name”: null,
“location_profile_picture_url”: null,
“provider_gender”: null,
“provider_id”: 568905,
“provider_name”: “John Doe”,
“provider_profile_picture_url”: null,
“source”: [
{
“id”: 100140
},
{
“id”: 100032
},
{
“id”: 100001
},
{
“id”: 100000
}
],
“source_id”: 100140,
“source_name”: “Amazon”,
“source_score”: null,
“survey_type”: “FEEDBACK”,
“task_escalated”: false,
“task_id”: null,
“task_open_for_days”: null,
“task_owner_id”: null,
“task_owner_name”: "admin@bfountain.com",
“task_response_type”: null,
“task_status_code”: null
}
and i’m create a secondary index as below:

CREATE INDEX idx2 ON default (client,feedback_score_sentiment,task_owner_name,ALL DISTINCT ARRAY v.id FOR v IN
[avatar,source,category] END)
where event_time between “1990-01-01” and “3000-01-01”;

can you please help me that is the right way to create index using ‘IN’, or any other efficient way.

and below is my select and count query look like.

select count(*) from default AS d
where d.event_time between “2017-05-17T00:00:00.000Z” and "2017-05-20T00:00:00.000Z"
and d.client= "importtest"
AND
ANY a IN d.avatar SATISFIES a.id IN [89000] END AND
ANY s IN d.source SATISFIES s.id IN [100140] END and
d.feedback_score_sentiment in [“unscored”]
and d.task_owner_name is null
order by d.provider_name desc limit 10 offset 10;

when i check it with EXPLAIN ,it consider the INDEX(idx2) for scanning.

vsr1 · May 25, 2017, 1:31pm

The following is right index and query.

CREATE INDEX idx3 ON default (client,feedback_score_sentiment,task_owner_name,DISTINCT ARRAY v.id FOR v IN avatar END, event_time);

SELECT COUNT(1) FROM default AS d
WHERE d.event_time between "2017-05-17T00:00:00.000Z" and "2017-05-20T00:00:00.000Z" AND
d.client= "importtest" AND
ANY v IN d.avatar SATISFIES v.id IN [89000] END AND
ANY s IN d.source SATISFIES s.id IN [100140] END AND
d.feedback_score_sentiment IN ["unscored"] AND
d.task_owner_name IS NULL;

Index can have single array index key. You can choose which one is more selective and use that (avatar or source )
You don’t want event_time in Index WHERE condition because your query has more restricted filter and would like to have this as index key
Array index key variable in this case v need to match with corresponding any clause. (removed this restriction in 4.6.2)
It is AGGREGATE query with out GROUP BY that means results will be always 1. So you no need ORDER BY, LIMIT, OFFSET.

NOTE: If you are planning to use the query parameters like … IN $avatar , build query text dynamically and do adhoc query as desribed in N1QL query with IN Clause - not working perform better.

FYI: You should also check Spans of IndexScan section of EXPLAIN. The more values of predicates present more better.

rkumar · May 25, 2017, 1:48pm

Hi @vsr1,

Ok point 1,2 and 3 are clear to me,kindly explain more on point no.4,
if my requirement to do pagination with particular value to be desc/asc,in that case how could i do that.

On your NOTE–build query text dynamically and do adhoc ,can you explain more on this,and what we achieved this adhoc.

thanks

vsr1 · May 25, 2017, 1:59pm

The sample query you provided has count(*) in projection there is no group by. That means the results will be 1 document. Why do you need “order by d.provider_name desc limit 10 offset 10” can be removed.

If you replace count(*) with * then it is fine to have all those.

CREATE INDEX ix1 ON default(k0);
SELECT * FROM default WHERE k0 IN $inlist
"$inlist": [10]

The above query can use ix1 but not able to push 10 to indexer because values can change each execution, prepared plan need to work for all inlist values. It does complete indexScan and later applies predicate. This is not optimal.

Instead Client will form statement dynamically after query parameters available .
I.e. SELECT * FROM default WHERE k0 IN [10]
and prepare statement and execute it. In this case value 10 will be pushed to indexer.

rkumar · May 25, 2017, 7:15pm

@vsr1,

Ok my bad,you are right, i paste wrongly count query instead of original with without count ,sorry for that.

for next point ,i build my query dynamically on run time so what would be the value of $inlist ,even i don’t predict,
thats why i need to wrote SELECT * FROM default WHERE k0 IN $inlist and pass the value ,instead of
SELECT * FROM default WHERE k0 IN [10].

in future i will try to improve this,still i’m on reading and understanding mode of CB.

vsr1 · May 25, 2017, 7:21pm

As I mentioned once parameter available write a query from the parameter and do adhoc query. The only over head is preparation of query but that gives much better performance. Please note this is only for ko IN $inslist.

rkumar · May 25, 2017, 7:44pm

@vsr1

Kindly can you provide me some sample java code ,because i tried with java client but not able to success to build them
at run time,that why i use StringBuilder to create the N1QL . even any reference will also work for me

rkumar · May 26, 2017, 6:12am

Hey @vsr1,

Let me give you all information like structure of JSON in CB,INDEX,QUERY and NOs of records

N0. of record in CB= 300 millions

JSON Structure is below:

{
“default”: {
“active”: true,
“avatar_hierarchy”: [
{
“id”: 41730
},
{
“id”: 41738
},
{
“id”: 41768
},
{
“id”: 41824
},
{
“id”: 1010001
},
{
“id”: 41988
},
{
“id”: 42706
},
{
“id”: 215851
}
],
“category_hierarchy”: [
{
“id”: 4032
},
{
“id”: 4003
},
{
“id”: 4004
},
{
“id”: 4038
}
],
“client”: “hca”,
“content”: “Well if you can ever get an answer you might get what you need in times of need.”,
“engagement_status”: null,
“event_time”: “2017-01-06T00:00:00.000Z”,
“feedback_score”: -1,
“feedback_score_sentiment”: “negative”,
“id”: 100000001,
“location_id”: 0,
“location_name”: null,
“location_profile_picture_url”: null,
“provider_gender”: “M”,
“provider_id”: 1010001,
“provider_name”: “Randy, MD”,
“provider_profile_picture_url”: “051b23f9539b4a1d7.jpg”,
“source_hierarchy”: [
{
“id”: 100000
},
{
“id”: 100001
},
{
“id”: 100032
},
{
“id”: 100171
}
],
“source_id”: 100171,
“source_name”: “Google Plus”,
“source_score”: 0,
“survey_type”: “FEEDBACK”,
“task_escalated”: false,
“task_id”: 150224,
“task_open_for_days”: 7,
“task_owner_id”: "keith@abct.in",
“task_owner_name”: “Rigdon, Keith”,
“task_response_type”: “O”,
“task_status_code”: “C”
}
}

secondary index:
CREATE INDEX filter ON default (client,feedback_score_sentiment,task_owner_name,DISTINCT ARRAY v.id FOR v IN avatar_hierarchy END, event_time);

Query
select d.* from default AS d
where d.event_time between “2017-01-01T00:00:00.000Z” and "2017-05-20T00:00:00.000Z"
and d.client= "hca"
AND
ANY a IN d.avatar_hierarchy SATISFIES a.id IN [41730] END AND
ANY s IN d.source_hierarchy SATISFIES s.id IN [100000,100001] END and
d.feedback_score_sentiment in [“positive”,“negative”,“unscored”]
and d.task_owner_name is null
order by d.provider_name desc limit 10 offset 10;

Explain is look like:;

[
{
“plan”: {
"#operator": “Sequence”,
"~children": [
{
"#operator": “Sequence”,
"~children": [
{
"#operator": “DistinctScan”,
“scan”: {
"#operator": “IndexScan”,
“index”: “filter”,
“index_id”: “9a45a6e6abd8d4a9”,
“keyspace”: “default”,
“namespace”: “default”,
“spans”: [
{
“Range”: {
“High”: [
"“hca”",
"“negative”",
“null”
],
“Inclusion”: 3,
“Low”: [
"“hca”",
"“negative”",
“null”
]
}
},
{
“Range”: {
“High”: [
"“hca”",
"“positive”",
“null”
],
“Inclusion”: 3,
“Low”: [
"“hca”",
"“positive”",
“null”
]
}
},
{
“Range”: {
“High”: [
"“hca”",
"“unscored”",
“null”
],
“Inclusion”: 3,
“Low”: [
"“hca”",
"“unscored”",
“null”
]
}
}
],
“using”: “gsi”
}
},
{
"#operator": “Fetch”,
“as”: “d”,
“keyspace”: “default”,
“namespace”: “default”
},
{
"#operator": “Parallel”,
"~child": {
"#operator": “Sequence”,
"~children": [
{
"#operator": “Filter”,
“condition”: “(((((((d.event_time) between “2017-01-01T00:00:00.000Z” and “2017-05-20T00:00:00.000Z”) and ((d.client) = “hca”)) and any a in (d.avatar_hierarchy) satisfies ((a.id) in [41730]) end) and any s in (d.source_hierarchy) satisfies ((s.id) in [100000, 100001]) end) and ((d.feedback_score_sentiment) in [“positive”, “negative”, “unscored”])) and ((d.task_owner_name) is null))”
},
{
"#operator": “InitialProject”,
“result_terms”: [
{
“expr”: “d”,
“star”: true
}
]
}
]
}
}
]
},
{
"#operator": “Order”,
“limit”: “10”,
“offset”: “10”,
“sort_terms”: [
{
“desc”: true,
“expr”: “(d.provider_name)”
}
]
},
{
"#operator": “Offset”,
“expr”: “10”
},
{
"#operator": “Limit”,
“expr”: “10”
},
{
"#operator": “FinalProject”
}
]
},
“text”: “select d.* from default AS d \nwhere d.event_time between “2017-01-01T00:00:00.000Z” and “2017-05-20T00:00:00.000Z”\nand d.client= “hca”\n AND\nANY a IN d.avatar_hierarchy SATISFIES a.id IN [41730] END AND\nANY s IN d.source_hierarchy SATISFIES s.id IN [100000,100001] END and\nd.feedback_score_sentiment in [“positive”,“negative”,“unscored”]\nand d.task_owner_name is null\norder by d.provider_name desc limit 10 offset 10;”
}
]
------------------------------------problem------------------

It taking hell out of time to return the Result and some time i need to interrupter to stop them.

Kindly guide me to make them optimized in either way,we need to test them so that we can proceed further to production or actual development.

Thanks

vsr1 · May 26, 2017, 1:00pm

Given situation the plan is optimal. Tell me how big is avatar_hierarchy, source_hierarchy.

You are not followed item 3 in post 2 (Array index key variable in this case v need to match with corresponding any clause. (removed this restriction in 4.6.2)). Please change the a to v for predicate ANY a IN d.avatar_hierarchy SATISFIES a.id IN [41730] END
Due to two arrays all the predicates can’t be pushed to indexer.
You required whole document
Biggest issue will be ORDER BY d.provider_name DESC. Even though You are interested ONLY 10 results, ORDER BY needs to get all the qualified records and do sort. This query can’t use Index order because ORDER BY expression has DESC on strings and which is supported in CB 5.0.0

What version of CB you are using. Do you want to try CB 5.0.0 or 4.6.2

cc @keshav_m

rkumar · May 29, 2017, 5:15am

@vsr1,

Ok i will change the in the query 'from a to v’
for 3: yes i want whole document need to return

Its totally depends how big could be avatar_hierarchy or source_hiearachy,but could be 10-15 id’s
i’m using 4.5.1 and if i tried to execute the index over 300 millions of records it also taking 2 days and still buuilding

vsr1 · May 29, 2017, 3:06pm

The index might end up with 4,500 million items and take time depends on cluster configuration and may required proper cluster sizing.

liesting · October 27, 2017, 12:52pm

Hi,
Hope you can help met out. I try to make an array index in N1QL but it doesn’t seem to work:
I used

CREATE INDEX assto1 ON theo_data_new (distinct array a.teamUuid for a in assignedTo end) WHERE docType = “EXERCISE” and (not meta(exercise).id like “%sync%”)

and the query I used
SELECT COUNT(*) as count FROM theo_data_new as exercise USE INDEX(assto1) WHERE
exercise.docType=“EXERCISE” AND “authorUuid” IS NOT NULL AND
META(exercise).id NOT LIKE “%sync%” AND
(((exercise.authorUuid=“0A317AFE-D9F1-C346-0D5D-068B8E48D5B9”) OR (exercise.authorType=“club” AND
(exercise.authorUuid=“D73C8569-0AB8-20C7-8B9C-A6523D1A152C”)) OR (ANY assigned IN exercise.assignedTo SATISFIES
(assigned.teamUuid=“B64102AA-55A8-A4BB-C0BE-911524FB38E3”) AND assigned.begin>= 20171027 AND assigned.end <= 20171027 END)))

In explain it says
[
{
“plan”: {
"#operator": “Sequence”,
"~children": [
{
"#operator": “PrimaryScan”,
“index”: “#primary”,
“keyspace”: “theo_data_new”,
“namespace”: “default”,
“using”: “gsi”
},
{
"#operator": “Fetch”,
“as”: “exercise”,
“keyspace”: “theo_data_new”,
“namespace”: “default”
},
{
"#operator": “Parallel”,
"~child": {
"#operator": “Sequence”,
"~children": [
{
"#operator": “Filter”,
“condition”: “(((((exercise.docType) = “EXERCISE”) and (“authorUuid” is not null)) and (not ((meta(exercise).id) like “%sync%”))) and ((((exercise.authorUuid) = “0A317AFE-D9F1-C346-0D5D-068B8E48D5B9”) or (((exercise.authorType) = “club”) and ((exercise.authorUuid) = “D73C8569-0AB8-20C7-8B9C-A6523D1A152C”))) or any assigned in (exercise.assignedTo) satisfies ((((assigned.teamUuid) = “B64102AA-55A8-A4BB-C0BE-911524FB38E3”) and (20171027 <= (assigned.begin))) and ((assigned.end) <= 20171027)) end))”
},
{
"#operator": “InitialGroup”,
“aggregates”: [
“count()"
],
“group_keys”: []
}
]
}
},
{
"#operator": “IntermediateGroup”,
“aggregates”: [
"count()”
],
“group_keys”: []
},
{
"#operator": “FinalGroup”,
“aggregates”: [
“count()"
],
“group_keys”: []
},
{
"#operator": “Parallel”,
"~child": {
"#operator": “Sequence”,
"~children": [
{
"#operator": “InitialProject”,
“result_terms”: [
{
“as”: “count”,
“expr”: "count()”
}
]
},
{
"#operator": “FinalProject”
}
]
}
}
]
},
“text”: “\r\nSELECT COUNT(*) as count FROM theo_data_new as exercise USE INDEX(assto1) WHERE \r\nexercise.docType=“EXERCISE” AND “authorUuid” IS NOT NULL AND \r\nMETA(exercise).id NOT LIKE “%sync%” AND \r\n(((exercise.authorUuid=“0A317AFE-D9F1-C346-0D5D-068B8E48D5B9”) OR (exercise.authorType=“club” AND\r\n (exercise.authorUuid=“D73C8569-0AB8-20C7-8B9C-A6523D1A152C”)) OR (ANY assigned IN exercise.assignedTo SATISFIES \r\n(assigned.teamUuid=“B64102AA-55A8-A4BB-C0BE-911524FB38E3”) AND assigned.begin>= 20171027 AND assigned.end <= 20171027 END)))”
}
]

vsr1 · October 27, 2017, 1:26pm

In CREATE index META() argument exercise doesn’t exist. Correct one is

CREATE INDEX assto1 ON theo_data_new (distinct array a.teamUuid for a in assignedTo end) 
WHERE docType = "EXERCISE" and (not meta().id like "%sync%");

The above index doesn’t qualify for the following query that is reason it uses primary index.
The query has 3 OR clauses, To uses the index each OR clause needs to have leading index key.

SELECT COUNT(*) as count
FROM theo_data_new as exercise USE INDEX(assto1)
WHERE exercise.docType="EXERCISE" AND "authorUuid" IS NOT NULL AND META(exercise).id NOT LIKE "%sync%”"AND
  (((exercise.authorUuid="0A317AFE-D9F1-C346-0D5D-068B8E48D5B9")
    OR (exercise.authorType="club" AND (exercise.authorUuid="D73C8569-0AB8-20C7-8B9C-A6523D1A152C"))
    OR (ANY assigned IN exercise.assignedTo SATISFIES (assigned.teamUuid="B64102AA-55A8-A4BB-C0BE-911524FB38E3")
             AND assigned.begin>= 20171027 AND assigned.end <= 20171027 END)));

Also check this “authorUuid” IS NOT NULL . This is constant do u want this is field? If it is field try this.

 CREATE INDEX assto1 ON theo_data_new (authorUuid) 
    WHERE docType = "EXERCISE" and (not meta().id like "%sync%");

liesting · October 27, 2017, 1:50pm

Hi
Thanks for your help. I made a mistake for the field authorUuid. I have created an index on this field also

CREATE INDEX exercise_authorUuid ON theo_data_new(authorUuid) WHERE ((docType = “EXERCISE”) and (not ((meta().id) like “%sync%”)))

CREATE INDEX assto3 ON theo_data_new (distinct array a.teamUuid for a in assignedTo end)
WHERE docType = “EXERCISE” and (not meta().id like “%sync%”)

The query is however still not using the Index
SELECT COUNT(*) as count FROM theo_data_new as exercise USE INDEX(assto3) WHERE
exercise.docType=“EXERCISE” AND authorUuid IS NOT NULL AND
META(exercise).id NOT LIKE “%sync%” AND
(((exercise.authorUuid=“0A317AFE-D9F1-C346-0D5D-068B8E48D5B9”) OR (exercise.authorType=“club” AND
(exercise.authorUuid=“D73C8569-0AB8-20C7-8B9C-A6523D1A152C”)) OR (ANY assigned IN exercise.assignedTo SATISFIES
(assigned.teamUuid=“B64102AA-55A8-A4BB-C0BE-911524FB38E3”) AND assigned.begin>= 20171027 AND assigned.end <= 20171027 END)))

Also if I left out the authorUuid is null

[
{
“plan”: {
"#operator": “Sequence”,
"~children": [
{
"#operator": “PrimaryScan”,
“index”: “#primary”,
“keyspace”: “theo_data_new”,
“namespace”: “default”,
“using”: “gsi”
},
{
"#operator": “Fetch”,
“as”: “exercise”,
“keyspace”: “theo_data_new”,
“namespace”: “default”
},
{
"#operator": “Parallel”,
"~child": {
"#operator": “Sequence”,
"~children": [
{
"#operator": “Filter”,
“condition”: “(((((exercise.docType) = “EXERCISE”) and ((exercise.authorUuid) is not null)) and (not ((meta(exercise).id) like “%sync%”))) and ((((exercise.authorUuid) = “0A317AFE-D9F1-C346-0D5D-068B8E48D5B9”) or (((exercise.authorType) = “club”) and ((exercise.authorUuid) = “D73C8569-0AB8-20C7-8B9C-A6523D1A152C”))) or any assigned in (exercise.assignedTo) satisfies ((((assigned.teamUuid) = “B64102AA-55A8-A4BB-C0BE-911524FB38E3”) and (20171027 <= (assigned.begin))) and ((assigned.end) <= 20171027)) end))”
},
{
"#operator": “InitialGroup”,
“aggregates”: [
“count()"
],
“group_keys”: []
}
]
}
},
{
"#operator": “IntermediateGroup”,
“aggregates”: [
"count()”
],
“group_keys”: []
},
{
"#operator": “FinalGroup”,
“aggregates”: [
“count()"
],
“group_keys”: []
},
{
"#operator": “Parallel”,
"~child": {
"#operator": “Sequence”,
"~children": [
{
"#operator": “InitialProject”,
“result_terms”: [
{
“as”: “count”,
“expr”: "count()”
}
]
},
{
"#operator": “FinalProject”
}
]
}
}
]
},
“text”: “\r\nSELECT COUNT(*) as count FROM theo_data_new as exercise USE INDEX(assto3) WHERE \r\nexercise.docType=“EXERCISE” AND authorUuid IS NOT NULL AND \r\nMETA(exercise).id NOT LIKE “%sync%” AND \r\n(((exercise.authorUuid=“0A317AFE-D9F1-C346-0D5D-068B8E48D5B9”) OR (exercise.authorType=“club” AND\r\n (exercise.authorUuid=“D73C8569-0AB8-20C7-8B9C-A6523D1A152C”)) OR (ANY assigned IN exercise.assignedTo SATISFIES \r\n(assigned.teamUuid=“B64102AA-55A8-A4BB-C0BE-911524FB38E3”) AND assigned.begin>= 20171027 AND assigned.end <= 20171027 END)))”
}
]

vsr1 · October 27, 2017, 3:59pm

By using assto3 query can’t give right results so that is why it is not qualified. Try one of the two options.

Remove USE INDEX option
USE INDEX (exercise_authorUuid)