Performance with large volume of calls


#1

I understand that the GA for N1QL is currently in the works. I just wanted to provide a case of the query engine currently works for us in the manner that we intend to eventually use it.

I have the couchbase server (2.2), the query engine (dp4) and the client code (using the 2.1.0 Java SDK) all operating on the same machine (windows 7)

The code I use for this test:

package main.java.dp4Test;

import com.couchbase.client.java.Bucket;
import com.couchbase.client.java.Cluster;
import com.couchbase.client.java.CouchbaseCluster;
import com.couchbase.client.java.document.json.JsonObject;
import com.couchbase.client.java.query.Query;
import com.couchbase.client.java.query.QueryResult;
import com.couchbase.client.java.query.QueryRow;
import com.couchbase.client.java.env.;
import com.couchbase.client.java.query.Statement;
import static com.couchbase.client.java.query.Select.
;
import static com.couchbase.client.java.query.dsl.Expression.*;

import java.io.IOException;
import java.util.Iterator;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.util.logging.FileHandler;
import java.util.logging.SimpleFormatter;

public class SomexpLoscouwMain {

public static void main(String[] args) {
    Logger logger = Logger.getLogger("MyLog");
    FileHandler fh;
    try {
        // This block configure the logger with handler and formatter
        fh = new FileHandler("../../theLog.log");
        logger.addHandler(fh);
        SimpleFormatter formatter = new SimpleFormatter();
        fh.setFormatter(formatter);
        // the following statement is used to log any messages
        logger.info("My first log");
    } catch (SecurityException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
    CouchbaseEnvironment env = DefaultCouchbaseEnvironment.builder()
                                                          .queryEnabled(true).queryTimeout(999999999)
                                                          .build();
    Cluster cluster = CouchbaseCluster.create(env, "127.0.0.1");
    Bucket bucket = cluster.openBucket("beer-sample", "");
    int stop = 0;
    long iteration = 1;
    long totalTime = 0;
    long over400 = 0;
    while (stop==0)
    {
        try
        {
            //String query = "SELECT * FROM `beer-sample`";
            String query = "SELECT * FROM beer-sample LIMIT 1";
            //String query = "SELECT * FROM `beer-sample` LIMIT 1";
            Statement statement = select("*").from("`beer-sample`").limit(1);
            Query cbquery = Query.simple(query);
            long start = System.currentTimeMillis();
            QueryResult result = bucket.query(statement);
            long finish = System.currentTimeMillis();
        if (result.finalSuccess())
        {
            Iterator<QueryRow> rows = result.rows();
            int rowCount = 0;
            while (rows.hasNext())
            {
                QueryRow row = rows.next();
                //System.out.println(row.value());
                rowCount++;
            }
            //System.err.println("Rows printed: " + rowCount);
        }
        else
        {
            for (JsonObject error : result.errors())
            {
                System.err.println(error);
            }
        }
        long timeToComplete = finish-start;
        logger.log(Level.FINE,timeToComplete + "Ms");
        }
        catch(Throwable t)
        {
            logger.log(Level.WARNING, t.getMessage());
            continue;
        }
        iteration++;
    }
}

}

So, the program logs the time it takes to receive a response from the query command and any throwables that occur during the process.

Initially it works well:

Feb 11, 2015 7:01:03 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: My first log
Feb 11, 2015 7:01:07 PM main.java.dp4Test.SomexpLoscouwMain main
FINE: 1435Ms
Feb 11, 2015 7:01:07 PM main.java.dp4Test.SomexpLoscouwMain main
FINE: 460Ms
Feb 11, 2015 7:01:08 PM main.java.dp4Test.SomexpLoscouwMain main
FINE: 376Ms
Feb 11, 2015 7:01:08 PM main.java.dp4Test.SomexpLoscouwMain main
FINE: 469Ms
Feb 11, 2015 7:01:09 PM main.java.dp4Test.SomexpLoscouwMain main
FINE: 425Ms
Feb 11, 2015 7:01:09 PM main.java.dp4Test.SomexpLoscouwMain main
FINE: 384Ms
Feb 11, 2015 7:01:10 PM main.java.dp4Test.SomexpLoscouwMain main
FINE: 457Ms
Feb 11, 2015 7:01:10 PM main.java.dp4Test.SomexpLoscouwMain main
FINE: 411Ms

On other occasions the time has been in the lower 200ms range on this same machine. I remember better results on the CentOS VMs, though those were with a different test than the one used here. Regardless of the usual results, however, in all cases there are occasional hiccups:

Feb 11, 2015 7:11:34 PM main.java.dp4Test.SomexpLoscouwMain main
FINE: 44824Ms
Feb 11, 2015 7:11:50 PM main.java.dp4Test.SomexpLoscouwMain main
WARNING: java.lang.IllegalStateException: Error parsing query response (in TRANSITION) at
Feb 11, 2015 7:11:51 PM main.java.dp4Test.SomexpLoscouwMain main
FINE: 309Ms

or

Feb 11, 2015 7:13:01 PM main.java.dp4Test.SomexpLoscouwMain main
FINE: 518Ms
Feb 11, 2015 7:13:02 PM main.java.dp4Test.SomexpLoscouwMain main
FINE: 832Ms
Feb 11, 2015 7:13:21 PM main.java.dp4Test.SomexpLoscouwMain main
FINE: 19066Ms

These become more frequent the longer the process runs. Eventually, I am able to notice the effects while I perform other tasks as this runs in the background.

Any thoughts on this? Should this type of volume be possible come GA, beta or alpha? Also, would there be a more appropriate way to go about making repetitive calls over a long period?

As an aside, I noticed that the SDK uses observables, which I am rather unfamiliar with. I was wondering if I need to take this into account to prevent calls from conflicting with each other.

Thank you for your time.


#2

The increasing reponse times seems to be a bug with N1QL DP4, @geraldss will know more about that.

Using Observables from the Java SDK is going to help you if you want to analyze the rows that get returned in “real time” and then perform operations on those. It won’t help you with the current issue because it seems to be a DP4 bug (the same behaviour can be seen when using the .NET SDK as @jmorris identified).

In your case you are just printing stuff, what do you eventually want to achieve? We can then see if an asynchronous solution will give you better performance.


#3

Hi somexp,

We are troubleshooting the slowdown, thanks to your posting. As designed, you may see cbq-engine acquire and retain more available memory, so that it is available when needed. However, you should not see the degradation in latency that you are seeing.

For repetitive tests, you have a few options. If you are testing against a cluster of cbq-engine services (coming in Beta), you can use one of the client SDKs. If you are testing against a single cbq-engine, you can use curl to post directly to the cbq-engine REST API. This will allow you to isolate your observations to cbq-engine alone.

Thanks,
Gerald


#4

The cbq-engine was what we believed the culprit was, also. Though, now that I am running other tests with curl with it running fine, it’s hard to tell. And this is with four looped calls coming from two separate VMs. However, this is on our CentOS VMs, so that might also be accounting for some difference. I should have more information to send on this in a little bit.

I’d include the original code that we were using when we encountered this issue, however, it’s quite sizable with many interdependent parts. So I try to reproduce these issues with smaller test examples.


#5

Ok. If you can share the curl setup, that would be great – queries, sample document, request throughput, and response latencies that you are getting.

Thanks.


#6

Here is an abridged copy of the document:

{
“id”: “1ffaa508-b5ac-462a-9ff8-90f15501238a”,
“TCID”: “b748c9cf-1242-4cd5-b628-d7d2825901db”,
“description”: “Network Topology”,
“name”: “topology”,
“linkInfoList”: [
{
“updateTimeMs”: “1516258106942”,
“sourceNode”: {
“id”: “b748c9cf-1242-4cd5-b628-d7d2825901db”,
“name”: “TC-09”,
“serviceDiscoveryRestURL”: “http://0.0.0.0:0000/tc-rest-services-1.0/rest/tc/service-discovery-service”,
},
“destinationNode”: {
“id”: “f2757544-db6a-4e18-a5ce-5d841a63d22e”,
“name”: “TC-04”,
“serviceDiscoveryRestURL”: “http://0.0.0.0:0000/tc-rest-services-1.0/rest/tc/service-discovery-service”,
},
“uniqueLinkId”: “f85ff3e1-9527-4b74-81da-6fd79cdc82a0”
},
{
“updateTimeMs”: “1420060039733”,
“sourceNode”: {
“id”: “b748c9cf-1242-4cd5-b628-d7d2825901db”,
“name”: “TC-09”,
“serviceDiscoveryRestURL”: “http://0.0.0.0:0000/tc-rest-services-1.0/rest/tc/service-discovery-service”,
},
“destinationNode”: {
“id”: “6ee4aeb5-34d4-474c-bb64-713389b53620”,
“name”: “TC-08”,
“serviceDiscoveryRestURL”: “http://0.0.0.0:0000/tc-rest-services-1.0/rest/tc/service-discovery-service”,
},
“uniqueLinkId”: “6c0cbe32-a9d8-4065-9a08-8b81c35a1036”
}
]
}

This is the curl script I was running:

while true; do

curl -i http://[IP Address]:8093/query/service -d ‘statement=SELECT linkInfoList.destinationNode.id FROM default top USE KEYS “topology” UNNEST top.linkInfoList WHERE linkInfoList.sourceNode.id=“b748c9cf-1242-4cd5-b628-d7d2825901db”’

done

Typical results:

{
“requestID”: “c3b5b8fd-159e-45eb-8be9-293f4f679691”,
“signature”: {
“id”: “json”
},
“results”: [
{
“id”: “f2757544-db6a-4e18-a5ce-5d841a63d22e”
},
{
“id”: “6ee4aeb5-34d4-474c-bb64-713389b53620”
}
],
“status”: “success”,
“metrics”: {
“elapsedTime”: “6.45256ms”,
“executionTime”: “6.214949ms”,
“resultCount”: 2,
“resultSize”: 136
}
}

Having four of this scripts running (two each on two separate VMs) and calling the same cbq-engine (remotely) did not seem to cause much of an issue. Issues did arise, however. Particularly when using the following script:

while true; do

curl -i http://[IP Address]:8093/query/service -d ‘statement=SELECT linkInfoList FROM default top USE KEYS “topology”’

done

All of the scripts resulted in high CPU usage (particularly on the side of the VM making the call), but the response time did not appear to change. When using the latter script, none of the metrics appeared too much the worse off from what I saw, but the putty window did start giving me issues. The response times were no different with the latter script, but the display would fail to respond, so I’m not confident that that would have a whole lot to do with couchbase.


#7

Thanks again. Do keep us posted if you run into other issues / questions.

-Gerald


#8

This may be a bit redundant, but I tried something similar to the first test on the CentOS VMs.

The client is running on one of the VMs I used for the curl commands. The cbq-engine and Couchbase Server are also running together on the VM I used for that test.

As before, I am using the Java SDK 2.1, cbq-engine dp4 and the Couchbase Server 2.2.0. I have the results of another test using the 3.0 server, which yield the same results. I can produce those results too.

I changed the code slightly from before. The main difference is that I take the average latency of every 100 queries rather than the latency of every individual query:

package main.java.dp4Test;

import com.couchbase.client.java.Bucket;
import com.couchbase.client.java.Cluster;
import com.couchbase.client.java.CouchbaseCluster;
import com.couchbase.client.java.document.json.JsonObject;
import com.couchbase.client.java.query.Query;
import com.couchbase.client.java.query.QueryResult;
import com.couchbase.client.java.query.QueryRow;
import com.couchbase.client.java.env.;
import com.couchbase.client.java.query.Statement;
import static com.couchbase.client.java.query.Select.
;
import static com.couchbase.client.java.query.dsl.Expression.*;

import java.io.IOException;
import java.util.Iterator;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.util.logging.FileHandler;
import java.util.logging.SimpleFormatter;

public class SomexpLoscouwMain {

public static void main(String[] args) {
    Logger logger = Logger.getLogger("MyLog");
    FileHandler fh;
    try {
        fh = new FileHandler("/opt/dp4Test/logs/theLog.log");
        logger.addHandler(fh);
        SimpleFormatter formatter = new SimpleFormatter();
        fh.setFormatter(formatter);
    } catch (SecurityException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
    CouchbaseEnvironment env = DefaultCouchbaseEnvironment.builder()
                                                          .queryEnabled(true)
                                                          .build();
    Cluster cluster = CouchbaseCluster.create(env, [IP Address]);
    Bucket bucket = cluster.openBucket("default", "");
    int stop = 0;
    long iteration = 1;
    long totalTime = 0;
    while (stop==0)
    {
        try
        {
            //String query = "SELECT * FROM `beer-sample`";
            String query = "SELECT * FROM default LIMIT 1";
            //String query = "SELECT * FROM `beer-sample` LIMIT 1";
            Query cbquery = Query.simple(query);
            long start = System.currentTimeMillis();
            QueryResult result = bucket.query(cbquery);
            long finish = System.currentTimeMillis();
        if (result.finalSuccess())
        {
            Iterator<QueryRow> rows = result.rows();
            int rowCount = 0;
            while (rows.hasNext())
            {
                QueryRow row = rows.next();
                //System.out.println(row.value());
                rowCount++;
            }
            //System.err.println("Rows printed: " + rowCount);
        }
        else
        {
            for (JsonObject error : result.errors())
            {
                System.err.println(error);
            }
        }
        long timeToComplete = finish-start;
            if (iteration%100==0)
            {
                double average = (double)totalTime/100.0;
                totalTime = 0;
                logger.info("Average query time: " + average);
            }
            else
            {
                totalTime += timeToComplete;
            }
        }
        catch(Throwable t)
        {
            logger.info(t.getMessage());
            continue;
        }
        iteration++;
    }
}

}

The results show initially good latency times, but they quickly deteriorate and do not recover. Eventually, it times out to the point that I can wait 10 minutes without receiving and average latency back. This is with the default timeout.

This is what is logged initially:

Feb 13, 2015 10:07:24 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: Average query time: 25.83
Feb 13, 2015 10:07:25 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: Average query time: 15.58
Feb 13, 2015 10:07:27 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: Average query time: 13.97
Feb 13, 2015 10:07:28 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: Average query time: 12.88
Feb 13, 2015 10:07:30 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: Average query time: 14.95
Feb 13, 2015 10:07:31 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: Average query time: 10.57
Feb 13, 2015 10:07:32 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: Average query time: 12.08
Feb 13, 2015 10:07:33 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: Average query time: 10.85
Feb 13, 2015 10:07:34 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: Average query time: 10.8
Feb 13, 2015 10:07:35 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: Average query time: 12.78
Feb 13, 2015 10:07:36 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: Average query time: 9.48
Feb 13, 2015 10:07:37 PM main.java.dp4Test.SomexpLoscouwMain main

After about a minute it begins to deteriorate:

INFO: Average query time: 11.17
Feb 13, 2015 10:08:51 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: Average query time: 100.68
Feb 13, 2015 10:09:01 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: Average query time: 95.3
Feb 13, 2015 10:09:06 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: Average query time: 50.94
Feb 13, 2015 10:09:11 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: Average query time: 47.6
Feb 13, 2015 10:09:18 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: Average query time: 72.88
Feb 13, 2015 10:09:25 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: Average query time: 69.7
Feb 13, 2015 10:09:28 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: Average query time: 29.82
Feb 13, 2015 10:09:46 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: Average query time: 183.18
Feb 13, 2015 10:10:13 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: Average query time: 255.25
Feb 13, 2015 10:10:56 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: Average query time: 426.1
Feb 13, 2015 10:12:23 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: java.util.concurrent.TimeoutException
Feb 13, 2015 10:13:38 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: java.util.concurrent.TimeoutException
Feb 13, 2015 10:14:53 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: java.util.concurrent.TimeoutException
Feb 13, 2015 10:16:08 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: java.util.concurrent.TimeoutException
Feb 13, 2015 10:17:23 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: java.util.concurrent.TimeoutException
Feb 13, 2015 10:18:38 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: java.util.concurrent.TimeoutException
Feb 13, 2015 10:19:53 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: java.util.concurrent.TimeoutException
Feb 13, 2015 10:21:08 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: java.util.concurrent.TimeoutException
Feb 13, 2015 10:22:23 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: java.util.concurrent.TimeoutException
Feb 13, 2015 10:23:38 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: java.util.concurrent.TimeoutException
Feb 13, 2015 10:24:31 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: java.lang.IllegalStateException: Error parsing query response (in TRANSITION) at
Feb 13, 2015 10:25:46 PM main.java.dp4Test.SomexpLoscouwMain main
INFO: java.util.concurrent.TimeoutException
Feb 13, 2015 10:27:01 PM main.java.dp4Test.SomexpLoscouwMain main

Not sure if this shows anything new, but it’s another angle in an environment more closely resembling what we were trying to do. When we initially encountered this it happened when we were querying only every second or two, but eventually the latencies became so long and the timeouts so frequent that they would interfere with the queries still waiting in line. It helped to set a shorter queryTimeout in the couchbaseEnvironment and adding a timeout to the query engine, but that did not seem to prevent the issue of performance degrading over time.

Thanks for everything.


#9

Hi,

And how do these new tests and environment perform if you only use curl and bypass the client SDK?

Thanks,
Gerald


#10

Seems better. The earlier curl tests were done in the same environment as this last one, though the command was different. Here’s another test done today using the same command:

while true; do

curl -i http://[IP Address]:8093/query/service -d ‘statement=SELECT * FROM default LIMIT 1’

done

I left it running for about 15 minutes in the same environment as used by the last sdk test, and it did not deteriorate. Though I still did see 100+ms latencies return occasionally, it would recover immediately. Most of the latencies were in the single digits. I did not experience any permanent changes in performance.

Here’s a snapshot of some of the returns I got after about 10 minutes:

“metrics”: {
“elapsedTime”: “5.2501ms”,
“executionTime”: “5.092687ms”,
“resultCount”: 1,
“resultSize”: 1346

If, however, I were to use multiple curl commands coming from multiple clients I would experience issues, but these would not be seen in latency but instead just make the putty window unresponsive and produce high cpu usage on the machines running the commands.


#11

Thanks again for all the helpful data. So to your original question, the answer is yes, N1QL is aimed squarely at usage like yours. It should not degrade (except for the garbage-collection pauses you are seeing); it should support multiple clients (we will troubleshoot the multiple curl problem); and the client SDKs should work well as they mature towards GA.

Please lmk here if you run into other questions or observations.

Thanks,
Gerald