I am evaluating Couchbase 4.5 on Centos and earlier Ubuntu. On
both operating systems I am observing a weird n1ql query latency distribution
using CB 4.5 Enterprise and the latest python SDK, which I understand is the
C++ sdk underneath. For the following graph I ran 6 Centos 6.4 servers in a CB
4.5 enterprise cluster, and a client on a PC running Windows 7 with the CB python client.
The response sample distribution, with latency measured at the client, looks
like this:
The horizontal axis is latency in milliseconds. The vertical
axis axis is the number of samples observed in about 1/100 of the full range of
latencies (here about 47 to 435 ms). Between the blips in the image, there are mostly
zero observations. The total number of samples is 10000. There is about a 42 ms
ping time between the client and server. Query response size ranges between
about 100 bytes and 100,000 bytes.
Notice the distinct separation of completion times apparently
correlated with complete round-trip times between client and server. This would
be expected e.g. if there were a single large buffer that fills at the client and
then is acknowledged back to the server before any more information is sent
from the server. Nagle TCP issues alone would likely not be able to make so many
blips, but I am not ruling Nagle out entirely—perhaps in conjunction with
something else. However, I found documentation that Nagle is turned off by
default in CB 4.5, and I did not turn it on.
For clients having lower ping times to the server cluster,
this shows up as a long thick tail with blurring between the blips.
Do you have a suggestion about how to eliminate this?
Could I size some buffers differently perhaps, or what
should I do to eliminate these response delays?
If possible, please be specific about exactly what I might do
or attempt to stop this.
This is indeed interesting observation!! On the face of it, doesn’t look like Nagle’s effect.
The response sample distribution, with latency measured at the client…
…
Query response size ranges between about 100 bytes and 100,000 bytes.
I guess, the response times measured include latency to receive complete response (in the given range of 100-100K bytes). If so, can’t this be mostly reflecting the distribution of response sizes ?? That explains multiple blips and their reducing heights. I am not discounting factors such as buffering, which explain the flat intervals between the blips. For example, ~60% of the requests (closer to 47ms latency) are generating small response sizes, but a small % of the samples generate around 100K responses (around 300-400ms).
Can you share more details of the queries that generate different response sizes. What is that distribution?? Meanwhile, I will run it by SDK team for their insights.
I can think of a couple of things. The one that I think is most likely is that we have seen the effect of garbage collection in the query service on a steady state of workload. That will cause the occasional pause.
The second thing, and this one comes with a question, is that we manage to execute our IO in parallel in Python by carefully working with the global interpreter lock. Does your workload have multiple threads?
One good test to see if the occasionally higher latencies are happening at cluster side or the client side would be to look at a “slow” one and see what the metrics returned by the query show. If the metrics there are close to what the client application experienced, then there may be a way to correlate it to memory management/GC on the query service. If, on the other hand, it doesn’t correlate, then we’ll want to look more closely at what’s happening at the app level or the network level.
@mnunberg is the Python/libcouchbase expert in this case. Mark: Is what I’ve said above plausible? Any other recommendations?
Do note that depending on how you use the query results at the client, it may be possible that the last bytes of data are only received once you have fully finished handling the query results at the application level. This is because the client makes use of parsing the JSON stream; so to get accurate results on the raw response time, measure the rtt time as the time between issuing the query and receiving the first result, rather than receiving the last; but do exhaust the returned iterator!
Nagle is disabled at both the client and server levels on Linux platforms, however Nagle is indeed enabled at the client level on the IOCP I/O system used by default on Windows platforms by the C client. You can try switching IO plugins to the select(2) implementation (rather than the default IOCP one) which does indeed have Nagle disabled using LCB_IOPS_NAME (http://docs.couchbase.com/sdk-api/couchbase-c-client-2.6.2/lcb-env-vars-page.html) just to rule it out.
Thank you for your help. After the query is complete in couchbase, in principle the response should arrive at the client based on its size and the available bandwidth. In the test shown, each individual test query completes before another one is launched, so that one query and its transmission are the only load on the servers. At a size of 100K the communication delay should be at most on the order of 16 ms (assuming an effective bandwidth of 50 mb/sec or more on a 100mb/sec link to the client machine), which is much less than even one of those gaps between the blips.
The test query looks like this:
SELECT nme FROM bucketA USE KEYS [“123456789abcdef00123456789abcdef”] UNNEST nme.keyarray skeys JOIN bucketA nme ON KEYS skeys
There is no index required or used in this test database. There is only the one bucket as well. Only one key (between the brackets) is used in these test queries. So, the query looks up the first key and obtains a set of keys from the value of that key (the number ranges from 1 to 30 in the set), and then it combines and sends the values associated with those looked-up keys.
The response sizes are not a perfect predictor of where in the distribution a response will arrive. However, in general, the longer responses take more time. The following is a different run than the one that produced the graph, but the data is very similar and this table shows the minimum and maximum size of the responses observed in each row. A remarkable thing about this graph is that the blips do not flatten out as the delay increases, they just get smaller.
I have only one thread working for that test. I have run tests with multiple threads and I get similarly spread out results. There are two python threads nevertheless in this process although there is only one thread working, the other is the main thread and it is just waiting on the working thread–i.e. it is the same code as for multiple workers, but set for only one worker thread for this particular test.
As to the garbage collection at the server possibility, wouldn’t that be independent of ping times, so that the effect would not produce such perfectly spaced and well defined blips?