How to improve query for 19W+ documents

atom_yang · September 20, 2016, 5:57am

I am using Couchbase Server 4.5.0-2601 Enterprise Edition (build-2601).
And I created a index by

CREATE INDEX `idx-GYEAR-COUNTRY` ON `default`(`GYEAR`,`COUNTRY`) WHERE (_type == "PTD") USING GSI;

And I am using MOI to improve query.

The following N1QL

SELECT count(1)
  FROM default g 
 WHERE g._type == "PTD"

shows in the index I have 198646 documents.
And my query service configure as

{"completed-limit":4000,"completed-threshold":1000,"cpuprofile":"","debug":true,"keep-alive-length":16384,"loglevel":"DEBUG","max-parallelism":2,"memprofile":"","pipeline-batch":512,"pipeline-cap":512,"request-size-cap":67108864,"scan-cap":0,"servicers":8,"timeout":0}

And the following N1QL

SELECT "1965" AS GYEAR,g.COUNTRY,count(1) AS `total`
  FROM default g 
 WHERE g._type == "PTD" AND g.GYEAR == "1965"
 GROUP BY g.COUNTRY
 ORDER BY `total` DESC LIMIT 10;

took 461.77ms
(GYEAR will be a variable)
The logs shows:

_time=2016-09-20T13:51:13.518+08:00 _level=TRACE _msg=Phase aggregates authorize=101.725Âµs _id=8d77f958-5075-4a1d-9f58-34fb39efa56e scan=403.540153ms sort=13.668Âµs run=455.619599ms plan=1.306225ms parse=4.330656ms instantiate=72.104Âµs

Scan took 403.540153ms

I want all the N1QL took time less in 100ms.
How should I improve query?
Add more index service/data service?
is there any suggest critical point for us to improve query,
such as if the records in the index is more than 1W, you should add more data service to keep the records in the index less than 1W.

atom_yang · September 20, 2016, 9:37am

Following are the result of my test

atom_yang · September 20, 2016, 9:57am

Here is a blog for performance of different data structures.

Following is the equivalent results for reads are as follows:

And as the blog shows that:

MOI can provide over 10x better latency and throughput under faster mutations to data compared to standard GSI and here is how it does that:

Lock-free processing for indexing simply allows massive concurrency when maintaining the incoming mutation to the index.
Skiplist structure optimize in-memory storage, as opposed to B+Tree indexes.
Forcing the index into memory storage, means MOI does not run at disk speeds for storing the index - instead it take regular snapshots to disk for recovery only.

And the Max Index RAM used % is only 6.53% for 20000 document, 8.3% for 80000 document.

atom_yang · September 20, 2016, 10:03am

As wiki shows that a skiplist is capable of fast {\displaystyle {\mathcal {O}}(\log n)} {\mathcal {O}}(\log n) insertion and removal of values from a sorted sequence, but it has only slow {\displaystyle {\mathcal {O}}(n)} {\mathcal {O}}(n) lookups of values at a given position in the sequence (i.e. return the 500th value); however, with a minor modification the speed of random access indexed lookups can be improved to {\displaystyle {\mathcal {O}}(\log n)} {\mathcal {O}}(\log n).

atom_yang · September 21, 2016, 4:31am

@geraldss @vsr1 Can you give me some advise on how to improve single index service/query service, and point to is there a inflection point for query service,such as 10^4 might be the inflection point?

keshav_m · September 21, 2016, 5:23am

Hi @atom_yang

Please try this:

To do the grouping of the entire data set, we need to get the whole index data and then do the grouping.
There are couple of optimizations in CB 4.5 for you to exploit this by writing a small script:

SELECT phylum, COUNT(1) FROM bucket group by phylumn;

Step 1. SELECT min(phylum) from bucket where type = ‘blah’;
step 2: select count(1) from bucket where < full condition> AND phylum =
step 3: SELECT min(phylum) from bucket where type = ‘blah’ where phylum > “the previous value
step 4. repeat 2, 3 till you no longer get the min(phylum) value.
Now you have full result with least processing.

–

CB 4.5 optimizes some of the COUNT() and does a quick look up of MIN() using the index order.

atom_yang · September 21, 2016, 5:39am

I am not quite understand the script.
for step 1.

Should I create a index for phylum by

CREATE INDEX `idx-phylum` ON `default`(`phylum`) WHERE (_type == "blah") USING GSI;

because when I run the N1QL,I found the N1QL just use PrimaryScan.

And at the same time,group by is only a example to find out is there a inflection point for query service?
And How should I optimize query service/index service?
Thank you.