Performance drop when system is under load

Hi Couchbase Support,

I am using couchbase server 4.5. We are facing huge performance degradation after application is under load. Couchbase N1QL query (using GSI) degrades with time under constant load.

System Configuration:
Server : AWS C4.X.large
Quad Core 2.9 GHz processor, RAM 8 GB.

Couchbase Server configuration
Data Ram Quota: 5000 MB
Index Ram Quota : 1000MB
Full Text RAM : 256 ( we are not using full text search)

We have around 1.3 million documents.
cbq> explain SELECT meta().id, * FROM cms WHERE _type = “Product” AND _active = TRUE AND status = “ONLINE” AND -_createdAt < 0 LIMIT 1000 OFFSET 92000;
{
“requestID”: “c96069c1-d603-4664-9a0d-da851950e693”,
“signature”: “json”,
“results”: [
{
“plan”: {
"#operator": “Sequence”,
"~children": [
{
"#operator": “Sequence”,
"~children": [
{
"#operator": “IndexScan”,
“index”: “prod_status_createdAt_idx”,
“index_id”: “717499adf6dda570”,
“keyspace”: “cms”,
“limit”: “(92000 + 1000)”,
“namespace”: “default”,
“spans”: [
{
“Range”: {
“High”: [
"“Product”",
“true”,
"“ONLINE”",
“0”
],
“Inclusion”: 0,
“Low”: [
"“Product”",
“true”,
"“ONLINE”",
“null”
]
}
}
],
“using”: “gsi”
},
{
"#operator": “Parallel”,
"~child": {
"#operator": “Sequence”,
"~children": [
{
"#operator": “Fetch”,
“keyspace”: “cms”,
“namespace”: “default”
},
{
"#operator": “Filter”,
“condition”: “(((((cms._type) = “Product”) and ((cms._active) = true)) and ((cms.status) = “ONLINE”)) and ((-(cms._createdAt)) \u003c 0))”
},
{
"#operator": “InitialProject”,
“result_terms”: [
{
“expr”: “(meta(cms).id)”
},
{
“expr”: “self”,
“star”: true
}
]
},
{
"#operator": “FinalProject”
}
]
}
}
]
},
{
"#operator": “Offset”,
“expr”: “92000”
},
{
"#operator": “Limit”,
“expr”: “1000”
}
]
},
“text”: “SELECT meta().id, * FROM cms WHERE _type = “Product” AND _active = TRUE AND status = “ONLINE” AND -_createdAt \u003c 0 LIMIT 1000 OFFSET 92000”
}
],
“status”: “success”,
“metrics”: {
“elapsedTime”: “41.561253ms”,
“executionTime”: “41.525795ms”,
“resultCount”: 1,
“resultSize”: 3991
}
}

index : CREATE INDEX prod_status_createdAt_idx ON cms(_type,_active,status,(-_createdAt)) WHERE ((_type = “Product”) and (_active = true));

Scenario


Applying load on a server with 30 threads, all calling an API that consequently hits the above query.
But post 60 odd requests, all APIs start throwing timeout exception.
Error in logs below:

19:53:01 [http-nio-8110-exec-56] INFO ib.cms.dao.CouchbaseDao - Executing Query: SELECT meta().id, * FROM cms WHERE _type = “Product” AND _active = TRUE AND status = “ONLINE” AND -_createdAt < 0 LIMIT 1000 OFFSET 164000
19:53:08 [http-nio-8110-exec-87] ERROR ib.cms.api.controller.ApiExceptionHandler - java.util.concurrent.TimeoutException
java.lang.RuntimeException: java.util.concurrent.TimeoutException
at com.couchbase.client.java.util.Blocking.blockForSingle(Blocking.java:75)
at com.couchbase.client.java.CouchbaseBucket.query(CouchbaseBucket.java:656)
at com.couchbase.client.java.CouchbaseBucket.query(CouchbaseBucket.java:647)
at com.couchbase.client.java.CouchbaseBucket.query(CouchbaseBucket.java:577)
at ib.cms.dao.CouchbaseDao.execute(CouchbaseDao.java:413)

This was produced while doing PT of our application. Please ask if any other detail you need.
We are planning to go live this month. Please help us how we can upgrade the performance.
If we fail to meet 5-10 through (with an avg of 100ms response time) put for such query we wouldn’t be able to go live.

Hi @geraldss , @keshav_m , @johan_larson.

Please help us here. We are waiting for your reply.
Meanwhile we tried N1QL parameterized query. Following is query

{“statement”:“SELECT meta().id, * FROM cms WHERE _type = “Product” AND _active = TRUE AND status = $status AND -_createdAt < $cAt LIMIT 100 OFFSET 0”,"$status":“ONLINE”,"$brandId":null,"$catId":null,"$cAt":0}with params N1qlParams{serverSideTimeout=‘null’, consistency=null, scanWait=‘null’, clientContextId=‘null’, maxParallelism=null, adhoc=false}.


You seem to be using a very high OFFSET value. Are that typical of the queries you are issuing?

OFFSET processing isn’t free. The query engine has to churn past all of those earlier results without returning them.

On the client side you can also try to increase the number of queryEndpoints on the environment to open more sockets against the server. Maybe you get better throughput for the cost of additional latency

The query is part of pagination, hence is the offset.
I am little confused about getting no performance gain after using prepared statements in the query. The CPU usage was same in both the cases.

Earlier I was thinking perhaps it was cbq-engine which was consuming 60-70% full CPU due to query processing, plan creation, etc. But there was no improvement by caching queries statements.

All the indexes were scanned from memory as there were no disk IOps.

If each such query going to consume 50-60% CPU and takes > 50 seconds under very minimal load (2-4 concurrent requests constantly) Its going to be very costly to put multiple boxes to achieve 5-10 through put.

Can you try these queries using cbq shell, so that we can isolate away client-side issues.

Tell us the results from using cbq shell, and we will help you from there. Also make sure you are using 4.5 GA.

I am not sure to fire 4-6 concurrent request constantly using cbq shell automatically.
Yes we are using 4.5 GA. I am not seeing any bottleneck from client side here. The sever load factor (as you see in the top output image) goes to 5-6 due to cbq-engine.

You can write a simple shell script or python script and use curl to post the queries.

Sure. Will do it in few hours and update here.

We are using "4.5.0-2601 Enterprise Edition (build-2601) " as @geraldss suggested, we hit request directly on cbq as follow

http://52.66.140.239:8093/query/service?statement=SELECT+meta().id%2C+*+FROM+cms+WHERE+_type+%3D+'Product'+AND+_active+%3D+TRUE+AND+status+%3D+'ONLINE'+AND+-_createdAt+<+0+LIMIT+1000+OFFSET+1000

Offset is being increased in loop by 1000.

Samples : 100 using 8 threadsAverage response time : 30083Max. Response Time : 50210Throughtput : 16.0/min
CPU load average went upto 6.10 and CPU being taken was close to 90-100% all the time.

Still we are facing same issue.

@vsr1, any thoughts or suggestions?

Could you please let us know how many CPU’s you have on the machine where query engine is running. The number of concurrent can process depends on the number of CPU’s.

Please drop your index and create the following index (index is LEAN and perform better). Verify through explain. Try your original query. If the offset keep increasing the following query should perform better.

CREATE INDEX prod_status_createdAt_idx ON cms(-_createdAt) WHERE ((_type= “Product”) AND (_active = true) AND (_status = “ONLINE”));

SELECT meta(c2).id, c2 FROM (SELECT raw meta().id FROM cms WHERE type = “Product” AND _active = TRUE AND status = “ONLINE” AND -createdAt < 0 LIMIT 1000 OFFSET 92000) mlist INNER JOIN cm2 c2 ON KEYS mlist;