Misconception about scaling (query speed) with several nodes

Setup: Couchbase Server 6.5 Beta EE on two nodes. Nodes connected via switch with 1 Gbps.

This post is about my misconception. It isn’t about “this query is slow” or “how can I create an index to speed things up”.

I intentionally excecuted a slow long-running query. Setup is two nodes, both have the query service. Below are screenshots of how many gets per second are executed. As I understand gets per second is a good measure on how fast this particular query runs. The query is the last one run in the screenshot which took several minutes. Both nodes average around 12.5k gets per second. Together around 25k gets per second.


Setup is a single node. Average gets per second is ~ 30k gets per second.

Setup is two nodes. Only one node has query service. Average gets per seconds and graph spikes are very similar to the first setup.


Conclusion is that a single node is faster than 2 nodes. My conception was that several nodes share the query execution. I also hoped that it scales nearly linearly. Clearly I’m wrong and I’m looking for some insight.

Thanks!

I did some more testing. For the following graphs I went back to Couchbase Server 6.0 CE on Ubuntu 16.04. Hardware setup is:

  • 2 nodes
  • same hard drive
  • same RAM
  • 1x 6 core / 12 thread CPU
  • 1x 4 core / 8 thread CPU

In all three tests following below 100% of documents were resident (in RAM). There were no cache misses, hence the hard drive is not important. This time I ran a query which only took a few secods. The query was similar to this query:

SELECT COUNT(*) 
FROM my_bucket
WHERE type = 'myType'
AND myInteger = 0;

The query used this index:

CREATE INDEX `idx_type` ON `my_bucket`(`type`) 

Test A

  • with 6 core CPU
  • single node cluster
  • gets per second: 60k - 70k
    Test%20A

Test B

  • with 4 core CPU
  • single node cluster
  • gets per second: 30k
    Test%20B

Test C

Some notes:

  • Nodes should have equal hardware, especially the CPU
  • Multidimensional scaling is not supported in CE
  • gets per second from Test A + gets per second from Test B is not equal to gets per second from Test C
  • 2 nodes are slower than 1 node. Although the 1 node is part of the 2 node cluster.

Question

How many gets per second can be expected if both nodes had the same 6 core CPU? Would it be 2x 60 - 70k -> 120k - 140k?

You can alway change the query service setting to get more keys per batch. Currently since you index doesn’t fully cover your query it must fetch the keys and the count them.

I would recommend you create a better index like:
CREATE INDEX idx_type_1 ON my_bucket(type) WHERE type = "myType" and myInteger = 0

re-run your query and check for index service time in row 3 , column 4 below and index cache miss ratio in row 4 , column 2. Since you’re running all the service the machines, its better to make the service do less work and move less data around when you want to query them.

You’ll find that you probably have to scale to bigger boxes. You’ll see it in you query with “kernTime”. It will grow and grow as your cluster gets bigger and the service fight for resource from the kernel. The Couchbase service will need to be isolated on different machines for better performance.

@househippo I understand that covering all by an index is going to speed up the query by a lot. My current dev setup was a single node and I assumed by adding another node the query time would be cut in half. I was wrong about that. 2 nodes are slower than a single node. My findings can be found in my previous post in this thread.

I believe this option does not exist in CE. In EE there are:

  • query pipeline batch
  • query pipeline cap
  • query scan cap
  • query timeout
  • query prepared limit
  • query completed threshold
  • query log level
  • query max parallelism
    What is the setting to get more keys per batch please?

I created the index. Querying for type and myInteger = 0 shows in the statistics that there was 1 request/sec. All other graphs don’t change.
Querying for type and myInteger = 1 shows 60M bytes returned / sec for 1 second. Scan latency is 4000ns for 1 second. I will need to read some documentation or experiences as I wouldn’t know if this is good performance for my particular hardware.

I was unable to find kernTime in CE. Bigger boxes can mean many different things. My goal for development purposes is to try out many queries. I might not get a query right the first time, need to edit the query run again, and again. So I won’t have everything indexed. This is all to save time while developing. Once I have the right query then I can create and index for production. So for the (document) gets per second should I invest in:

  • several nodes vs single node
  • high core count vs high base/boost frequency (i.e. 5Ghz)
  • fast RAM vs tighter timings

Since I import a backup of the production bucket into my dev cluster I could also only backup a small part of the bucket. That might be the best option. Still interested in the other things though. And in case I didn’t mention it before I’m not complaining about any slowness but rather interested in how to speed things up (without creating an index).

Thanks!

Some more info provided by @vsr1:

This answers at least this bullet point:

  • high core count vs high base/boost frequency (i.e. 5Ghz)