Aggregation performance while using GSI

I have created a sample bucket called users where each document has first_name property.
I have created a primary index + indexed first_name using the following indices:
CREATE PRIMARY INDEX PX_USERS ON users-test USING GSI
CREATE INDEX SX_USERS_FN ON users-test(first_name) USING GSI

The bucket contains 1M records. There are 20 different first names (50K documents per first name). There are 50 different last names (20K documents per last name)

Now i am trying to query this bucket (via cbq) and the following scenarios were tested:

  1. running the query: SELECT COUNT(*) as cnt, first_name from users-test WHERE first_name IS NOT MISSING GROUP BY first_name
    The result returned after 2m45s but got error 12015 (Index scan timed out)
  2. running the query: SELECT COUNT(*) as cnt, first_name from users-test WHERE first_name = “Gretchen” GROUP BY first_name
    The result returned after 7s
  3. running the query: SELECT COUNT(*) as cnt, last_name from users-test WHERE last_name IS NOT MISSING GROUP BY last_name
    The result returned after 2m30s without any timeout error (last_name is NOT indexed)
  4. running the query: SELECT COUNT(*) as cnt, last_name from users-test WHERE last_name = “Smith” GROUP BY last_name
    The result returned after 2m22s (last_name is NOT indexed)

Conclusions and questions:
Queries 1 and 3 performs the same even though the first is using the GSI and the first is using the primary index (saw that using EXPLAIN).
In relational DBs, when having an index over a column and the following SQL queries run, the results can return much faster. Since it takes so long to get the result and after reading about the high performance provided by N1QL, I am thinking that maybe i am doing something wrong.

The node with Index Service (contains data and query services as well) is running with 50% CPU when querying.

Hi @eyalrubichi,
one quick recommendation is to utilize the parallelized execution. you can do this as long as you have CPU capacity and in your case seems like you may not have much left (50% utilized already).
you can tune the parallelization using the max_parallelism option on the REST API. We are adding this to cbq in the coming coming release but until then, with 4.1 you need to use the REST API directly. Here is an example;

> http://10.0.0.45:8093/query/service?statement=select count(*) from default where a2=4703626 group by a2&max_parallelism=4

it will help parallelize your fetches which I suspect is the most expensive piece in your queries above.

We are also working on a few improvements in performance in this area in the coming release. Especially, upcoming improvements on count will likely boost the results. Please let me know if you’d be interested in previewing the new version.
thanks
-cihan

In a select query I have 25 attributes to select and the content size is around 500-1000Kb.
In bucket around 50 documents are present. It is taking around 10 s to fetch the all data on basis of one attribute condition.
How can make the performance more faster like 500ms ?

Post the query and index definition

You can set pretty=false and adjust the pipeline-batch

curl -u Administrator:pass http://localhost:8093/admin/settings -XPOST -d '{"pretty":false,"pipeline-batch":1024,"pipeline-cap":1024}'

This I the query:
select meta©.id as id,
attribute1,
,attribute2
,attribute3
,attribute4
,attribute5
,attribute7
,attribute8
,attribute9
,attribute10
,attribute11
,attribute12
,attribute13
,attribute14
,attribute15
,attribute16
,attribute17
,attribute18
,attribute19
,attribute20
,attribute21
,attribute22
,attribute23
,attribute24
,attribute25 from Bucket c where c.attribute8="<>";
Index is built on attribute8 and one covered index with all attributes(1…25).

Please help how we can increase the performance ? Is there any hints concept like RDBMS which can be used .

You have too many attributes best option will be not use covered index.
CREATE index ix1 on bucket(attribute8);