Complex index for a complex structure

cecca_92 · July 6, 2017, 1:26pm

Hi, I’ve these two types of structures:

PROF_
{
“prof_id”: “6603”,
“branch”: “Science Department”,
“first_name”: “Paolo”,
“publications”: {
“0”: {
“authors”: [
{
“af_country”: “ita”,
“auth_id”: “55910”,
“first_name”: “Pietro”,
},
…
}
…
}
}

CITY_
{
“authors”: [
{
“auth_id”: “57193309664”,
“prof_ids”: [
“70041”,
“55910”
]
}
],
“cityname”: “Kabul”,
“countryname”: “Afghanistan”,
“lat”: “34.5553494”,
“lon”: “69.207486”
}

and I need to extract the number of cities in every country, containing at least an author with at least a prof_id of a certain branch inside it.
So in the application, I start selecting a branch name and I expect the number of cities for every country, counting only the cities where the prof_id is present inside.
The query I built is the following, and it works. The only problem is that is really slow (from 9 to 20 seconds). How I could use an index on the first FOR, to speed up the execution time?

SELECT cities.countryname, cities.state_long_name, COUNT (cities.cityname) AS count_auths_countries
FROM bucket AS cities
WHERE ANY au IN cities.authors SATISFIES (
ANY pid IN au.prof_ids SATISFIES pid WITHIN (
SELECT DISTINCT (prof_id) FROM bucket USE INDEX (idx_branch_auth_id USING GSI)
WHERE branch LIKE ‘Science Department’ )
END )
END
GROUP BY cities.countryname, cities.state_long_name
ORDER BY cities.countryname, cities.state_long_name

CREATE INDEX idx_branches ON bucket(branch)

Thanks.

vsr1 · July 6, 2017, 1:36pm

what is index definition idx_branch_auth_id? Is this PROF_ document? I don’t see auth_id at top level. Is this giving any results.

cecca_92 · July 6, 2017, 1:46pm

Sorry, I made a mistake with one field name while writing it here.
I edited the question, also with the INDEX.

vsr1 · July 6, 2017, 1:54pm

It looks like PROF_, CITY_ , may be other documents using same bucket. Is there way you can differentiate different documents through inside document field or META().id prefix? Is CTY_ document will always will have countryname, cityname fields.

As you don’t have any other join relation ship through META().id the following is best query and indexes.

CREATE INDEX idx_branch_profid ON bucket(branch,prof_id);
CREATE INDEX idx_countryname_cityname ON bucket(countryname,cityname);

    SELECT cities.countryname, cities.state_long_name, COUNT (cities.cityname) AS count_auths_countries
    FROM bucket AS cities
         WHERE countryname IS NOT MISSING AND cityname IS NOT MISSING AND
               ANY au IN cities.authors SATISFIES (
                    ANY pid IN au.prof_ids SATISFIES pid IN (
                         SELECT RAW DISTINCT (prof_id) FROM bucket USE INDEX (idx_branch_profid USING GSI)
                                    WHERE branch LIKE 'Science Department' ) END )
               END
    GROUP BY cities.countryname, cities.state_long_name
    ORDER BY cities.countryname, cities.state_long_name;

You can also try this.

CREATE INDEX idx_branch_profid ON bucket(branch,prof_id);
CREATE INDEX idx_lenauthors_countryname_cityname ON bucket(ARRAY_LENGTH(authors), countryname,cityname);

SELECT cities.countryname, cities.state_long_name, COUNT (cities.cityname) AS count_auths_countries
FROM bucket AS cities
     WHERE countryname IS NOT MISSING AND cityname IS NOT MISSING AND 
                 ARRAY_LENGTH(authors) > 0 AND
           ANY au IN cities.authors SATISFIES (
                ANY pid IN au.prof_ids SATISFIES pid IN (
                     SELECT RAW DISTINCT (prof_id) FROM bucket USE INDEX (idx_branch_profid USING GSI)
                                WHERE branch LIKE 'Science Department' ) END )
           END
GROUP BY cities.countryname, cities.state_long_name
ORDER BY cities.countryname, cities.state_long_name;

cecca_92 · July 6, 2017, 6:51pm

Thanks, I’ve tried the second solution and that one reduced the execution time and now it’s only 3.50 in the worst case!
I didn’t know the “RAW” clause before.
Can I ask you why you created the index “idx_lenauthors_countryname_cityname” if than is never used? I thought that the reason is that the DMBS will choose that one automatically because it’s the better, but I’m not sure.

vsr1 · July 6, 2017, 7:12pm

Couchbase indexes are b-tree indexes and index selection is rule based on query predicate.

idx_lenauthors_countryname_cityname is used by outer query and idx_branch_profid used by subquery(which also makes covered that means no document fetch). You can omit USE INDEX clause in subquery it will pick the right index.

You have multiple different type of documents in same bucket but query want only cities in outer query (cities.authors in ANY clause. If it is MISSING,NULL or 0 length, ANY will be false).
If use primary index it will need to get all META().ids fetch the documents and discard non matching ones.
To limit the work I have added additional predicate to query and created index on those fields so that IndexScan will eliminate those early.

countryname IS NOT MISSING AND cityname IS NOT MISSING AND ARRAY_LENGTH(authors) > 0

You can do EXPLAIN on query you can see which indexes used.

Following N1QL: A PRACTICAL GUIDE has more details