I have a problem with ARRAY INDEXES

rodrigo.rio · September 21, 2016, 12:02pm

Hello , I have some problems when create and use an Array index. I try to count the distinct colors ids are in the XXX documents, and the query is very slow when has a lot of documents.

Here my documents structure:
[
{
“xxx”: {
“aaaa”: “car”,
"colors": [
{
“class”: “1234”,
“id”: “1234”
},
{
“class”: “2222”,
“id”: “2222”
}
],
“documentType”: “CARX”
},
{
“xxx”: {
“aaaa”: “car”,
“colors”: [
{
“class”: “3333”,
“id”: “1234”
},
{
“class”: “4444”,
“id”: “4444”
}
],
“documentType”: “CARX”
}
}
]

INDEX
CREATE INDEX test2
ON APP_SPEC_DATA(documentType, (DISTINCT ARRAY c.id FOR c IN colors END))
WHERE documentType = ‘CARX’

and the Query
SELECT distinct(c.id)
FROM APP_SPEC_DATA AS cars
USE INDEX (test2)
UNNEST cars.colors as c
WHERE cars.documentType = 'CARX

So, when execute the query with the INDEX, returns empty results, and when execute without index, returns some results
WHY? Where I have the problem?

atom_yang · September 21, 2016, 2:01pm

Try this

 SELECT DISTINCT(color.id)
   FROM APP_SPEC_DATA AS cars
 UNNEST cars.colors as color 
  WHERE cars.documentType == "CARX"
    AND ANY c IN cars.colors SATISFIES (c.id IS NOT MISSING)  END

rodrigo.rio · September 21, 2016, 2:20pm

No, I still receive empty results.

atom_yang · September 21, 2016, 2:34pm

How explain shows?

my env works:

CREATE INDEX test4c
ON default(documentType, (DISTINCT ARRAY c.id FOR c IN colors END))
WHERE documentType == "CARX"

SELECT DISTINCT(color.id)
FROM default AS cars
USE INDEX (test4c)
UNNEST cars.colors as color 
WHERE cars.documentType == "CARX"
  AND ANY c IN cars.colors SATISFIES (c.id IS NOT MISSING)  END

my env shows looks like as following:

[
  {
    "plan": {
      "#operator": "Sequence",
      "~children": [
        {
          "#operator": "DistinctScan",
          "scan": {
            "#operator": "IndexScan",
            "index": "test4c",
            "index_id": "9b669f2ffd67b93b",
            "keyspace": "default",
            "namespace": "default",
            "spans": [
              {
                "Range": {
                  "High": [
                    "successor(\"CARX\")"
                  ],
                  "Inclusion": 1,
                  "Low": [
                    "\"CARX\"",
                    "null"
                  ]
                }
              }
            ],
            "using": "gsi"
          }
        },
        {
          "#operator": "Parallel",
          "~child": {
            "#operator": "Sequence",
            "~children": [
              {
                "#operator": "Fetch",
                "as": "cars",
                "keyspace": "default",
                "namespace": "default"
              },
              {
                "#operator": "Unnest",
                "as": "color",
                "expr": "(`cars`.`colors`)"
              },
              {
                "#operator": "Filter",
                "condition": "(((`cars`.`documentType`) = \"CARX\") and any `c` in (`cars`.`colors`) satisfies ((`c`.`id`) is not missing) end)"
              },
              {
                "#operator": "InitialProject",
                "distinct": true,
                "result_terms": [
                  {
                    "expr": "(`color`.`id`)"
                  }
                ]
              },
              {
                "#operator": "Distinct"
              },
              {
                "#operator": "FinalProject"
              }
            ]
          }
        },
        {
          "#operator": "Distinct"
        }
      ]
    },
    "text": "SELECT DISTINCT(color.id)\nFROM default AS cars\nUSE INDEX (test4c)\nUNNEST cars.colors as color \nWHERE cars.documentType == \"CARX\"\n  AND ANY c IN cars.colors SATISFIES (c.id IS NOT MISSING)  END"
  }
]

and the N1QL return:

[
  {
    "id": "1234"
  },
  {
    "id": "4444"
  },
  {
    "id": "2222"
  }
]

rodrigo.rio · September 22, 2016, 7:13pm

Thanks, It’s worked,
But now I have a other problems when in the query I have a filter with 5000 options in an IN.

SELECT DISTINCT(color.id)
FROM default AS cars
USE INDEX (test4c)
UNNEST cars.colors as color
WHERE cars.documentType == “CARX” AND options IN [‘x1’,‘b21’,…]
AND ANY c IN cars.colors SATISFIES (c.id IS NOT MISSING) END

The options is a string field in a XXX Documents

“xxx”: {
“options”: “x1”,
“aaaa”: “car”,
“colors”: [
{
“class”: “1234”,
“id”: “1234”
},

When I add this filter crash the couchbase.
Can I optimize the index or create a new one? and How?
thanks!!

atom_yang · September 23, 2016, 12:11am

try to create index

CREATE INDEX test4co
ON default(documentType,options, (DISTINCT ARRAY c.id FOR c IN colors END))
WHERE documentType == "CARX"

And then

SELECT DISTINCT(color.id)
FROM default AS cars
USE INDEX (test4co)
UNNEST cars.colors as color 
WHERE cars.documentType == "CARX" AND cars.options IN ['x1','b21'] 
AND ANY c IN cars.colors SATISFIES (c.id IS NOT MISSING) END

The explain show

[
  {
    "plan": {
      "#operator": "Sequence",
      "~children": [
        {
          "#operator": "DistinctScan",
          "scan": {
            "#operator": "IndexScan",
            "index": "test4co",
            "index_id": "134104f09296eeda",
            "keyspace": "default",
            "namespace": "default",
            "spans": [
              {
                "Range": {
                  "High": [
                    "\"CARX\"",
                    "successor(\"b21\")"
                  ],
                  "Inclusion": 1,
                  "Low": [
                    "\"CARX\"",
                    "\"b21\"",
                    "null"
                  ]
                }
              },
              {
                "Range": {
                  "High": [
                    "\"CARX\"",
                    "successor(\"x1\")"
                  ],
                  "Inclusion": 1,
                  "Low": [
                    "\"CARX\"",
                    "\"x1\"",
                    "null"
                  ]
                }
              }
            ],
            "using": "gsi"
          }
        },
        {
          "#operator": "Parallel",
          "~child": {
            "#operator": "Sequence",
            "~children": [
              {
                "#operator": "Fetch",
                "as": "cars",
                "keyspace": "default",
                "namespace": "default"
              },
              {
                "#operator": "Unnest",
                "as": "color",
                "expr": "(`cars`.`colors`)"
              },
              {
                "#operator": "Filter",
                "condition": "((((`cars`.`documentType`) = \"CARX\") and ((`cars`.`options`) in [\"x1\", \"b21\"])) and any `c` in (`cars`.`colors`) satisfies ((`c`.`id`) is not missing) end)"
              },
              {
                "#operator": "InitialProject",
                "distinct": true,
                "result_terms": [
                  {
                    "expr": "(`color`.`id`)"
                  }
                ]
              },
              {
                "#operator": "Distinct"
              },
              {
                "#operator": "FinalProject"
              }
            ]
          }
        },
        {
          "#operator": "Distinct"
        }
      ]
    },
    "text": "SELECT DISTINCT(color.id)\nFROM default AS cars\nUSE INDEX (test4co)\nUNNEST cars.colors as color \nWHERE cars.documentType == \"CARX\" AND cars.options IN ['x1','b21'] \nAND ANY c IN cars.colors SATISFIES (c.id IS NOT MISSING) END"
  }
]

but may be there will be a issue on index with in condition, @vsr1 might help to confirm .

vsr1 · September 23, 2016, 4:36am

@rodrigo.rio, @atom_yang, As you are not pushing any array predicates to indexer, array index is not needed.

You can try the following covered index and query.

CREATE INDEX test4co1
ON default(options, colors) WHERE documentType = “CARX”;

SELECT DISTINCT(color.id) FROM default AS cars
USE INDEX (test4co1)
UNNEST cars.colors as color
WHERE cars.documentType = “CARX” AND cars.options IN [‘x1’,‘b21’];

vsr1 · September 23, 2016, 4:40am

Did cbq-engine panic? Can you attach stack trace. @geraldss

geraldss · September 23, 2016, 5:09am

And also tell us what version of Couchbase you are using.

rodrigo.rio · September 23, 2016, 11:05am

Hello @geraldss, the version is 4.5.1-2841 Enterprise Edition (build-2841).

rodrigo.rio · September 23, 2016, 11:20am

With this index, It works but the performance I think that is not good, I have an average execution time of 9 o 10 seconds,
This query It should take two seconds at most.

atom_yang · September 23, 2016, 11:35am

How many document do you have ?
And how many document match the condition?
And what the explain show?

rodrigo.rio · September 23, 2016, 1:30pm

I have 700,000 documents aprox, and matched 1604. And the time to execution was 25 secs with 6000 elements in the IN.
INDEX:
CREATE INDEX test4 ON APP_SPEC_DATA(documentType,(DISTINCT ARRAY c.id FOR c IN colors END))) WHERE (((documentType = “CARX”)
the explain show

[
{
“plan”: {
"#operator": “Sequence”,
"~children": [
{
"#operator": “DistinctScan”,
“scan”: {
"#operator": “IndexScan”,
“index”: “test4”,
“index_id”: “8d57d5eea3eda838”,
“keyspace”: “APP_SPEC_DATA”,
“namespace”: “default”,
“spans”: [
{
“Range”: {
“High”: [
"“CARX”"
],
“Inclusion”: 0,
“Low”: [
"“CARX”",
“null”
]
}
}
],
“using”: “gsi”
}
},
{
"#operator": “Fetch”,
“as”: “cars”,
“keyspace”: “APP_SPEC_DATA”,
“namespace”: “default”
},
{
"#operator": “Parallel”,
"~child": {
"#operator": “Sequence”,
"~children": [
{
"#operator": “Unnest”,
“as”: “c”,
“expr”: “(cars.colors)”
},
{
"#operator": “Filter”,
“condition”: “((((((cars.documentType) = “CARX”) and ((cars.colors) in [“GDC”, “YMX”, “YMW”, “YMT”, “YND”, “GDL”, “OGG”, “YNB”, “OGD”, “YNA”, “OGB”,…,…,…,…,… etc]))”
},
{
"#operator": “InitialGroup”,
“aggregates”: [
“count(distinct (c.id))”
],
“group_keys”: []
}
]
}
},
{
"#operator": “IntermediateGroup”,
“aggregates”: [
“count(distinct (c.id))”
],
“group_keys”: []
},
{
"#operator": “FinalGroup”,
“aggregates”: [
“count(distinct (c.id))”
],
“group_keys”: []
},
{
"#operator": “Parallel”,
"~child": {
"#operator": “Sequence”,
"~children": [
{
"#operator": “InitialProject”,
“result_terms”: [
{
“as”: “colors”,
“expr”: “count(distinct (c.id))”
}
]
},
{
"#operator": “FinalProject”
}
]
}
}
]
},
“text”: “SELECT count(distinct(c.id)) as colors\nFROM APP_SPEC_DATA AS cars \nUNNEST cars.colors c\nWHERE cars.documentType = ‘CARX’ \n AND cars.colors IN\n[‘GDC’,‘YMX’,‘YMW’,‘YMT’,‘YND’,‘GDL’,‘OGG’,‘YNB’,‘OGD’,‘YNA’,‘OGB’, ,…,…,…,…,… etc]”
}
]

vsr1 · September 23, 2016, 2:22pm

Could you tell me how much this query returns.
what is avarage entires in the colors.

SELECT count(1) FROM default AS cars
USE INDEX (test4co1)
WHERE cars.documentType = “CARX” AND cars.options IN [‘x1’,‘b21’];

vsr1 · September 23, 2016, 2:34pm

Also try the following and see it improves.
CREATE INDEX test4co1 ON default(options, colors) WHERE documentType = “CARX”;

SELECT DISTINCT(color.id) FROM (SELECT colors FROM default USE INDEX (test4co1) WHERE documentType = “CARX” AND options IN [‘x1’,‘b21’]) as d UNNEST d.colors as color;

rodrigo.rio · September 23, 2016, 2:42pm

returns: [
{
“colors”: 3382
}
]

and there is avarage 2.3 colors.

vsr1 · September 23, 2016, 2:51pm

How much this takes.

SELECT DISTINCT(color.id) FROM (SELECT colors FROM default USE INDEX (test4co1) WHERE documentType = “CARX” AND options IN [‘x1’,‘b21’]) as d UNNEST d.colors as color;

Also can you post explain.

rodrigo.rio · September 23, 2016, 4:36pm

Great!, is better. Takes two seconds. Now, Options can or cant’t appear in the Where Clause. in Case that don’t appear the query takes a lot of time.
How do I cover all the cases? Must I create each of the indexes for each combination in the Where clause?

vsr1 · September 23, 2016, 4:50pm

@rodrigo.rio, If query is same options not appear, If queries are different you can analyze them and create appropriate indexes.

CREATE INDEX test4co1 ON default(documentType,options, colors) WHERE documentType = “CARX”;

rodrigo.rio · September 28, 2016, 12:28pm

Hello again! , I tried to resolve all cases but I found one that I can’t resolve.

In this case the query take too much time and I don’t find the correct index, Can you help me?

I think that the problem is when add ANY equip IN cars.equipments SATISFIES equip IN [“xx1”,"xx2…

SELECT DISTINCT(color.id)
FROM
(
SELECT colors FROM default USE INDEX
(test4co1)
WHERE documentType = "CARX"
AND cars.status = 'Open’
AND options IN [‘x1’,‘b21’]
AND ANY equip IN cars.equipments SATISFIES equip IN [“xx1”,“xx2”, “xx3”, “xx4”, “xx5”, “xx6”, “xx7”,…etc]
) as d
UNNEST d.colors as color;

Thanks