I have a problem with ARRAY INDEXES


#1

Hello , I have some problems when create and use an Array index. I try to count the distinct colors ids are in the XXX documents, and the query is very slow when has a lot of documents.

Here my documents structure:
[
{
“xxx”: {
“aaaa”: “car”,
"colors": [
{
“class”: “1234”,
“id”: “1234”
},
{
“class”: “2222”,
“id”: “2222”
}
],
“documentType”: “CARX”
},
{
“xxx”: {
“aaaa”: “car”,
“colors”: [
{
“class”: “3333”,
“id”: “1234”
},
{
“class”: “4444”,
“id”: “4444”
}
],
“documentType”: “CARX”
}
}
]

INDEX
CREATE INDEX test2
ON APP_SPEC_DATA(documentType, (DISTINCT ARRAY c.id FOR c IN colors END))
WHERE documentType = ‘CARX’

and the Query
SELECT distinct(c.id)
FROM APP_SPEC_DATA AS cars
USE INDEX (test2)
UNNEST cars.colors as c
WHERE cars.documentType = 'CARX

So, when execute the query with the INDEX, returns empty results, and when execute without index, returns some results
WHY? Where I have the problem?


#2

Try this

 SELECT DISTINCT(color.id)
   FROM APP_SPEC_DATA AS cars
 UNNEST cars.colors as color 
  WHERE cars.documentType == "CARX"
    AND ANY c IN cars.colors SATISFIES (c.id IS NOT MISSING)  END

#3

No, I still receive empty results.


#4

How explain shows?

my env works:

CREATE INDEX test4c
ON default(documentType, (DISTINCT ARRAY c.id FOR c IN colors END))
WHERE documentType == "CARX"
SELECT DISTINCT(color.id)
FROM default AS cars
USE INDEX (test4c)
UNNEST cars.colors as color 
WHERE cars.documentType == "CARX"
  AND ANY c IN cars.colors SATISFIES (c.id IS NOT MISSING)  END

my env shows looks like as following:

[
  {
    "plan": {
      "#operator": "Sequence",
      "~children": [
        {
          "#operator": "DistinctScan",
          "scan": {
            "#operator": "IndexScan",
            "index": "test4c",
            "index_id": "9b669f2ffd67b93b",
            "keyspace": "default",
            "namespace": "default",
            "spans": [
              {
                "Range": {
                  "High": [
                    "successor(\"CARX\")"
                  ],
                  "Inclusion": 1,
                  "Low": [
                    "\"CARX\"",
                    "null"
                  ]
                }
              }
            ],
            "using": "gsi"
          }
        },
        {
          "#operator": "Parallel",
          "~child": {
            "#operator": "Sequence",
            "~children": [
              {
                "#operator": "Fetch",
                "as": "cars",
                "keyspace": "default",
                "namespace": "default"
              },
              {
                "#operator": "Unnest",
                "as": "color",
                "expr": "(`cars`.`colors`)"
              },
              {
                "#operator": "Filter",
                "condition": "(((`cars`.`documentType`) = \"CARX\") and any `c` in (`cars`.`colors`) satisfies ((`c`.`id`) is not missing) end)"
              },
              {
                "#operator": "InitialProject",
                "distinct": true,
                "result_terms": [
                  {
                    "expr": "(`color`.`id`)"
                  }
                ]
              },
              {
                "#operator": "Distinct"
              },
              {
                "#operator": "FinalProject"
              }
            ]
          }
        },
        {
          "#operator": "Distinct"
        }
      ]
    },
    "text": "SELECT DISTINCT(color.id)\nFROM default AS cars\nUSE INDEX (test4c)\nUNNEST cars.colors as color \nWHERE cars.documentType == \"CARX\"\n  AND ANY c IN cars.colors SATISFIES (c.id IS NOT MISSING)  END"
  }
]

and the N1QL return:

[
  {
    "id": "1234"
  },
  {
    "id": "4444"
  },
  {
    "id": "2222"
  }
]

#5

Thanks, It’s worked,
But now I have a other problems when in the query I have a filter with 5000 options in an IN.

SELECT DISTINCT(color.id)
FROM default AS cars
USE INDEX (test4c)
UNNEST cars.colors as color
WHERE cars.documentType == “CARX” AND options IN [‘x1’,‘b21’,…]
AND ANY c IN cars.colors SATISFIES (c.id IS NOT MISSING) END

The options is a string field in a XXX Documents

“xxx”: {
“options”: “x1”,
“aaaa”: “car”,
“colors”: [
{
“class”: “1234”,
“id”: “1234”
},

When I add this filter crash the couchbase.
Can I optimize the index or create a new one? and How?
thanks!!


#6

try to create index

CREATE INDEX test4co
ON default(documentType,options, (DISTINCT ARRAY c.id FOR c IN colors END))
WHERE documentType == "CARX"

And then

SELECT DISTINCT(color.id)
FROM default AS cars
USE INDEX (test4co)
UNNEST cars.colors as color 
WHERE cars.documentType == "CARX" AND cars.options IN ['x1','b21'] 
AND ANY c IN cars.colors SATISFIES (c.id IS NOT MISSING) END

The explain show

[
  {
    "plan": {
      "#operator": "Sequence",
      "~children": [
        {
          "#operator": "DistinctScan",
          "scan": {
            "#operator": "IndexScan",
            "index": "test4co",
            "index_id": "134104f09296eeda",
            "keyspace": "default",
            "namespace": "default",
            "spans": [
              {
                "Range": {
                  "High": [
                    "\"CARX\"",
                    "successor(\"b21\")"
                  ],
                  "Inclusion": 1,
                  "Low": [
                    "\"CARX\"",
                    "\"b21\"",
                    "null"
                  ]
                }
              },
              {
                "Range": {
                  "High": [
                    "\"CARX\"",
                    "successor(\"x1\")"
                  ],
                  "Inclusion": 1,
                  "Low": [
                    "\"CARX\"",
                    "\"x1\"",
                    "null"
                  ]
                }
              }
            ],
            "using": "gsi"
          }
        },
        {
          "#operator": "Parallel",
          "~child": {
            "#operator": "Sequence",
            "~children": [
              {
                "#operator": "Fetch",
                "as": "cars",
                "keyspace": "default",
                "namespace": "default"
              },
              {
                "#operator": "Unnest",
                "as": "color",
                "expr": "(`cars`.`colors`)"
              },
              {
                "#operator": "Filter",
                "condition": "((((`cars`.`documentType`) = \"CARX\") and ((`cars`.`options`) in [\"x1\", \"b21\"])) and any `c` in (`cars`.`colors`) satisfies ((`c`.`id`) is not missing) end)"
              },
              {
                "#operator": "InitialProject",
                "distinct": true,
                "result_terms": [
                  {
                    "expr": "(`color`.`id`)"
                  }
                ]
              },
              {
                "#operator": "Distinct"
              },
              {
                "#operator": "FinalProject"
              }
            ]
          }
        },
        {
          "#operator": "Distinct"
        }
      ]
    },
    "text": "SELECT DISTINCT(color.id)\nFROM default AS cars\nUSE INDEX (test4co)\nUNNEST cars.colors as color \nWHERE cars.documentType == \"CARX\" AND cars.options IN ['x1','b21'] \nAND ANY c IN cars.colors SATISFIES (c.id IS NOT MISSING) END"
  }
]

but may be there will be a issue on index with in condition, @vsr1 might help to confirm .


#7

@rodrigo.rio, @atom_yang, As you are not pushing any array predicates to indexer, array index is not needed.

You can try the following covered index and query.

CREATE INDEX test4co1
ON default(options, colors) WHERE documentType = “CARX”;

SELECT DISTINCT(color.id) FROM default AS cars
USE INDEX (test4co1)
UNNEST cars.colors as color
WHERE cars.documentType = “CARX” AND cars.options IN [‘x1’,‘b21’];


#8

Did cbq-engine panic? Can you attach stack trace. @geraldss


#9

And also tell us what version of Couchbase you are using.


#10

Hello @geraldss, the version is 4.5.1-2841 Enterprise Edition (build-2841).


#11

With this index, It works but the performance I think that is not good, I have an average execution time of 9 o 10 seconds,
This query It should take two seconds at most.


#12

How many document do you have ?
And how many document match the condition?
And what the explain show?


#13

I have 700,000 documents aprox, and matched 1604. And the time to execution was 25 secs with 6000 elements in the IN.
INDEX:
CREATE INDEX test4 ON APP_SPEC_DATA(documentType,(DISTINCT ARRAY c.id FOR c IN colors END))) WHERE (((documentType = “CARX”)
the explain show

[
{
“plan”: {
"#operator": “Sequence”,
"~children": [
{
"#operator": “DistinctScan”,
“scan”: {
"#operator": “IndexScan”,
“index”: “test4”,
“index_id”: “8d57d5eea3eda838”,
“keyspace”: “APP_SPEC_DATA”,
“namespace”: “default”,
“spans”: [
{
“Range”: {
“High”: [
"“CARX”"
],
“Inclusion”: 0,
“Low”: [
"“CARX”",
“null”
]
}
}
],
“using”: “gsi”
}
},
{
"#operator": “Fetch”,
“as”: “cars”,
“keyspace”: “APP_SPEC_DATA”,
“namespace”: “default”
},
{
"#operator": “Parallel”,
"~child": {
"#operator": “Sequence”,
"~children": [
{
"#operator": “Unnest”,
“as”: “c”,
“expr”: “(cars.colors)”
},
{
"#operator": “Filter”,
“condition”: “((((((cars.documentType) = “CARX”) and ((cars.colors) in [“GDC”, “YMX”, “YMW”, “YMT”, “YND”, “GDL”, “OGG”, “YNB”, “OGD”, “YNA”, “OGB”,…,…,…,…,… etc]))”
},
{
"#operator": “InitialGroup”,
“aggregates”: [
“count(distinct (c.id))”
],
“group_keys”: []
}
]
}
},
{
"#operator": “IntermediateGroup”,
“aggregates”: [
“count(distinct (c.id))”
],
“group_keys”: []
},
{
"#operator": “FinalGroup”,
“aggregates”: [
“count(distinct (c.id))”
],
“group_keys”: []
},
{
"#operator": “Parallel”,
"~child": {
"#operator": “Sequence”,
"~children": [
{
"#operator": “InitialProject”,
“result_terms”: [
{
“as”: “colors”,
“expr”: “count(distinct (c.id))”
}
]
},
{
"#operator": “FinalProject”
}
]
}
}
]
},
“text”: “SELECT count(distinct(c.id)) as colors\nFROM APP_SPEC_DATA AS cars \nUNNEST cars.colors c\nWHERE cars.documentType = ‘CARX’ \n AND cars.colors IN\n[‘GDC’,‘YMX’,‘YMW’,‘YMT’,‘YND’,‘GDL’,‘OGG’,‘YNB’,‘OGD’,‘YNA’,‘OGB’, ,…,…,…,…,… etc]”
}
]


#14

Could you tell me how much this query returns.
what is avarage entires in the colors.

SELECT count(1) FROM default AS cars
USE INDEX (test4co1)
WHERE cars.documentType = “CARX” AND cars.options IN [‘x1’,‘b21’];


#15

Also try the following and see it improves.
CREATE INDEX test4co1 ON default(options, colors) WHERE documentType = “CARX”;

SELECT DISTINCT(color.id) FROM (SELECT colors FROM default USE INDEX (test4co1) WHERE documentType = “CARX” AND options IN [‘x1’,‘b21’]) as d UNNEST d.colors as color;


#16

returns: [
{
“colors”: 3382
}
]

and there is avarage 2.3 colors.


#17

How much this takes.

SELECT DISTINCT(color.id) FROM (SELECT colors FROM default USE INDEX (test4co1) WHERE documentType = “CARX” AND options IN [‘x1’,‘b21’]) as d UNNEST d.colors as color;

Also can you post explain.


#18

Great!, is better. Takes two seconds. Now, Options can or cant’t appear in the Where Clause. in Case that don’t appear the query takes a lot of time.
How do I cover all the cases? Must I create each of the indexes for each combination in the Where clause?


#19

@rodrigo.rio, If query is same options not appear, If queries are different you can analyze them and create appropriate indexes.

CREATE INDEX test4co1 ON default(documentType,options, colors) WHERE documentType = “CARX”;


#20

Hello again! , I tried to resolve all cases but I found one that I can’t resolve.

In this case the query take too much time and I don’t find the correct index, Can you help me?

I think that the problem is when add ANY equip IN cars.equipments SATISFIES equip IN [“xx1”,"xx2…

SELECT DISTINCT(color.id)
FROM
(
SELECT colors FROM default USE INDEX
(test4co1)
WHERE documentType = "CARX"
AND cars.status = 'Open’
AND options IN [‘x1’,‘b21’]
AND ANY equip IN cars.equipments SATISFIES equip IN [“xx1”,“xx2”, “xx3”, “xx4”, “xx5”, “xx6”, “xx7”,…etc]
) as d
UNNEST d.colors as color;

Thanks