Long tail weird latency distribution with N1QL queries

I am evaluating Couchbase 4.5 on Centos and earlier Ubuntu. On
both operating systems I am observing a weird n1ql query latency distribution
using CB 4.5 Enterprise and the latest python SDK, which I understand is the
C++ sdk underneath. For the following graph I ran 6 Centos 6.4 servers in a CB
4.5 enterprise cluster, and a client on a PC running Windows 7 with the CB python client.
The response sample distribution, with latency measured at the client, looks
like this:

The horizontal axis is latency in milliseconds. The vertical
axis axis is the number of samples observed in about 1/100 of the full range of
latencies (here about 47 to 435 ms). Between the blips in the image, there are mostly
zero observations. The total number of samples is 10000. There is about a 42 ms
ping time between the client and server. Query response size ranges between
about 100 bytes and 100,000 bytes.

Notice the distinct separation of completion times apparently
correlated with complete round-trip times between client and server. This would
be expected e.g. if there were a single large buffer that fills at the client and
then is acknowledged back to the server before any more information is sent
from the server. Nagle TCP issues alone would likely not be able to make so many
blips, but I am not ruling Nagle out entirely—perhaps in conjunction with
something else. However, I found documentation that Nagle is turned off by
default in CB 4.5, and I did not turn it on.

For clients having lower ping times to the server cluster,
this shows up as a long thick tail with blurring between the blips.

Do you have a suggestion about how to eliminate this?

Could I size some buffers differently perhaps, or what
should I do to eliminate these response delays?

If possible, please be specific about exactly what I might do
or attempt to stop this.

Thanks,
Karl

Hi Karl,

This is indeed interesting observation!! On the face of it, doesn’t look like Nagle’s effect.

The response sample distribution, with latency measured at the client…

Query response size ranges between about 100 bytes and 100,000 bytes.

I guess, the response times measured include latency to receive complete response (in the given range of 100-100K bytes). If so, can’t this be mostly reflecting the distribution of response sizes ?? That explains multiple blips and their reducing heights. I am not discounting factors such as buffering, which explain the flat intervals between the blips. For example, ~60% of the requests (closer to 47ms latency) are generating small response sizes, but a small % of the samples generate around 100K responses (around 300-400ms).

Can you share more details of the queries that generate different response sizes. What is that distribution?? Meanwhile, I will run it by SDK team for their insights.

regards,
-Prasad

Hi @kdenninghoff and @prasad,

I can think of a couple of things. The one that I think is most likely is that we have seen the effect of garbage collection in the query service on a steady state of workload. That will cause the occasional pause.

The second thing, and this one comes with a question, is that we manage to execute our IO in parallel in Python by carefully working with the global interpreter lock. Does your workload have multiple threads?

One good test to see if the occasionally higher latencies are happening at cluster side or the client side would be to look at a “slow” one and see what the metrics returned by the query show. If the metrics there are close to what the client application experienced, then there may be a way to correlate it to memory management/GC on the query service. If, on the other hand, it doesn’t correlate, then we’ll want to look more closely at what’s happening at the app level or the network level.

@mnunberg is the Python/libcouchbase expert in this case. Mark: Is what I’ve said above plausible? Any other recommendations?

By the way, this code sample shows how you’d access the meta that has the metrics in a query and the N1QL REST API documentation defines what is returned in the metrics.

Do note that depending on how you use the query results at the client, it may be possible that the last bytes of data are only received once you have fully finished handling the query results at the application level. This is because the client makes use of parsing the JSON stream; so to get accurate results on the raw response time, measure the rtt time as the time between issuing the query and receiving the first result, rather than receiving the last; but do exhaust the returned iterator!

Nagle is disabled at both the client and server levels on Linux platforms, however Nagle is indeed enabled at the client level on the IOCP I/O system used by default on Windows platforms by the C client. You can try switching IO plugins to the select(2) implementation (rather than the default IOCP one) which does indeed have Nagle disabled using LCB_IOPS_NAME (http://docs.couchbase.com/sdk-api/couchbase-c-client-2.6.2/lcb-env-vars-page.html) just to rule it out.

Hello Prasad,

Thank you for your help. After the query is complete in couchbase, in principle the response should arrive at the client based on its size and the available bandwidth. In the test shown, each individual test query completes before another one is launched, so that one query and its transmission are the only load on the servers. At a size of 100K the communication delay should be at most on the order of 16 ms (assuming an effective bandwidth of 50 mb/sec or more on a 100mb/sec link to the client machine), which is much less than even one of those gaps between the blips.

The test query looks like this:

SELECT nme FROM bucketA USE KEYS [“123456789abcdef00123456789abcdef”] UNNEST nme.keyarray skeys JOIN bucketA nme ON KEYS skeys

There is no index required or used in this test database. There is only the one bucket as well. Only one key (between the brackets) is used in these test queries. So, the query looks up the first key and obtains a set of keys from the value of that key (the number ranges from 1 to 30 in the set), and then it combines and sends the values associated with those looked-up keys.

The response sizes are not a perfect predictor of where in the distribution a response will arrive. However, in general, the longer responses take more time. The following is a different run than the one that produced the graph, but the data is very similar and this table shows the minimum and maximum size of the responses observed in each row. A remarkable thing about this graph is that the blips do not flatten out as the delay increases, they just get smaller.

ms latency

Count

Min size

Max size

50

6002

98

7784

54

1592

99

9652

58

223

135

6960

62

25

604

3960

66

16

783

3705

70

12

244

8098

73

3

458

2979

77

3

906

2584

81

0

0

0

85

2

3663

3677

89

2

3661

4691

93

329

2331

20112

97

246

760

18482

101

63

192

19908

105

3

10666

14711

109

1

7409

7409

113

1

7340

7340

117

2

4552

11195

121

1

9892

9892

125

0

0

0

129

1

9693

9693

133

0

0

0

137

234

8670

29772

141

238

8990

40661

145

56

9286

36678

149

11

10653

23472

153

1

17508

17508

156

0

0

0

160

2

10397

25643

164

0

0

0

168

0

0

0

172

0

0

0

176

2

22038

29409

180

197

13920

45828

184

148

13456

49102

188

39

10988

47721

192

3

18095

33239

196

1

20100

20100

200

1

25931

25931

204

0

0

0

208

1

32979

32979

212

0

0

0

216

0

0

0

220

14

23323

43193

224

158

18358

58623

228

94

23865

62410

232

16

19496

59209

236

4

36688

47184

239

2

36539

42656

243

0

0

0

247

0

0

0

251

0

0

0

255

0

0

0

259

0

0

0

263

17

31219

51293

267

106

22918

62367

271

27

29029

77060

275

7

29321

64005

279

0

0

0

283

0

0

0

287

0

0

0

291

0

0

0

295

0

0

0

299

0

0

0

303

0

0

0

307

12

33358

51213

311

39

29345

77329

315

9

44177

73280

319

2

33067

73280

322

1

41517

41517

326

0

0

0

330

0

0

0

334

0

0

0

338

0

0

0

342

0

0

0

346

0

0

0

350

5

29667

66072

354

13

47531

77098

358

2

51368

62366

362

1

95322

95322

366

0

0

0

370

0

0

0

374

0

0

0

378

0

0

0

382

0

0

0

386

0

0

0

390

0

0

0

394

3

58775

102664

398

3

52699

84398

402

1

102664

102664

405

0

0

0

409

0

0

0

413

0

0

0

417

0

0

0

421

0

0

0

425

0

0

0

429

0

0

0

433

0

0

0

437

1

77180

77180

441

2

95322

98986

Thanks,
Karl

Here is that table as text only…

ms latency Count Min size Max size
50 6002 98 7784
54 1592 99 9652
58 223 135 6960
62 25 604 3960
66 16 783 3705
70 12 244 8098
73 3 458 2979
77 3 906 2584
81 0 0 0
85 2 3663 3677
89 2 3661 4691
93 329 2331 20112
97 246 760 18482
101 63 192 19908
105 3 10666 14711
109 1 7409 7409
113 1 7340 7340
117 2 4552 11195
121 1 9892 9892
125 0 0 0
129 1 9693 9693
133 0 0 0
137 234 8670 29772
141 238 8990 40661
145 56 9286 36678
149 11 10653 23472
153 1 17508 17508
156 0 0 0
160 2 10397 25643
164 0 0 0
168 0 0 0
172 0 0 0
176 2 22038 29409
180 197 13920 45828
184 148 13456 49102
188 39 10988 47721
192 3 18095 33239
196 1 20100 20100
200 1 25931 25931
204 0 0 0
208 1 32979 32979
212 0 0 0
216 0 0 0
220 14 23323 43193
224 158 18358 58623
228 94 23865 62410
232 16 19496 59209
236 4 36688 47184
239 2 36539 42656
243 0 0 0
247 0 0 0
251 0 0 0
255 0 0 0
259 0 0 0
263 17 31219 51293
267 106 22918 62367
271 27 29029 77060
275 7 29321 64005
279 0 0 0
283 0 0 0
287 0 0 0
291 0 0 0
295 0 0 0
299 0 0 0
303 0 0 0
307 12 33358 51213
311 39 29345 77329
315 9 44177 73280
319 2 33067 73280
322 1 41517 41517
326 0 0 0
330 0 0 0
334 0 0 0
338 0 0 0
342 0 0 0
346 0 0 0
350 5 29667 66072
354 13 47531 77098
358 2 51368 62366
362 1 95322 95322
366 0 0 0
370 0 0 0
374 0 0 0
378 0 0 0
382 0 0 0
386 0 0 0
390 0 0 0
394 3 58775 102664
398 3 52699 84398
402 1 102664 102664
405 0 0 0
409 0 0 0
413 0 0 0
417 0 0 0
421 0 0 0
425 0 0 0
429 0 0 0
433 0 0 0
437 1 77180 77180
441 2 95322 98986

–Karl

One more try, here it is as a pair of pictures.
[cid:image001.png@01D1E8CD.BD7116E0]
[cid:image002.png@01D1E8CD.BD7116E0]

Karl

Hello ingenthr,

I have only one thread working for that test. I have run tests with multiple threads and I get similarly spread out results. There are two python threads nevertheless in this process although there is only one thread working, the other is the main thread and it is just waiting on the working thread–i.e. it is the same code as for multiple workers, but set for only one worker thread for this particular test.

As to the garbage collection at the server possibility, wouldn’t that be independent of ping times, so that the effect would not produce such perfectly spaced and well defined blips?

Thanks,
Karl

Hello mnunberg,

I will do those suggestions and get back to you by Tuesday with results.

Thank you,
Karl