Python client - drastic slowdown after timeout

ml31415 · December 28, 2015, 7:24pm

I would have loved to post this into the github or JIRA bugtracker, but unfortunately the first one is closed, and the second one doesn’t offer a straight forward way to get an account … anyways.

I made my first steps with couchbase today and tested inserting data chunks using upsert_multi. Now this quickly triggered a timeout. So far nothing special, but what I noted, that after once a timeout was triggered, any subsequent upsert_multi ran more than a magnitude slower. This is the code to reproduce:

import sys
import random
import gevent
from time import time
import couchbase
from gcouchbase.bucket import Bucket

print("python " + sys.version.replace('\n', ' '))
print("gevent " + gevent.__version__)
print("couchbase " + couchbase.__version__)
print("libcouchbase {} {}".format(*Bucket.lcb_version()))

c = Bucket('http://localhost/default')
d1 = dict(("a_" + str(x).rjust(12, '0'), dict(type="zzzz", val=random.random())) for x in range(200000))

for timeout in [10, 2, 100]:  # 2s should run into the timeout
	print("Running with {} s timeout".format(timeout))
	c.timeout = timeout
	try:
		t = time(); c.upsert_multi(d1); print(time() - t)
	except Exception as e:
		print(e)

Which gives:

python 2.7.10 (default, Oct 14 2015, 16:09:02)  [GCC 5.2.1 20151010]
gevent 1.0.2
couchbase 2.0.6.dev5+gfc1ddc8
libcouchbase 2.5.4 132356
Running with 10 s timeout
9.90698695183
Running with 2.5 s timeout
<Key=u'a_000000134899', RC=0x17[Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout], Operational Error, Results=300000, C Source=(src/multiresult.c,309)>
Running with 100 s timeout
<Key=u'a_000000134899', RC=0x17[Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout], Operational Error, Results=300000, C Source=(src/multiresult.c,309)>

As seen here, the statement, that ran in less than 10s at first, after the timeout, it even hits the 100s timeout. I’m not sure if it’s a deadlock or just horribly slow. I guess I had one case, when it eventually finished after half an eternity and a less data.

mnunberg · December 28, 2015, 7:36pm

Timed-out requests just mean that the operations could not be sent in a timely manner over the network, but they are however, still being sent. If you wish to purge all previous operations which were timed out, you will need to create a new Bucket object; destroying the old one (perhaps via a Bucket._close()

In general you should limit _multi operations to several MB or several hundred KB

ml31415 · December 28, 2015, 7:44pm

So this means, after every timeout, I have to completely recreate the Bucket instance? That would be quite ugly. And even then, if it would resend the first request, plus the second one, it should be finished after 20 seconds. That can’t be the solution.

mnunberg · December 28, 2015, 8:01pm

There is a significant per-server-response CPU hit, in which the client (libcouchbase) must traverse the entire list of pending commands to see which command the response corresponds to. In the event of timeouts, because the pending command has since been removed from the internal list, the internal list must be traversed each time a response arrives, only at the conclusion of which does the client know the command was cancelled (which appears as a timeout to the user).

When the 2s timeout batch expires, the server has likely not finished responding and still needs to send back 200k responses (the timeout is because those responses did not arrive in time). When you issue another 200k commands (placing them inside the list of pending commands), each response from the server for the previous batch must traverse all those 200k pending commands, causing a massive CPU slowdown; in effect this requires 200k*200k lookups!

It is possible in the future to use an alternate data structure (such as a dictionary) to map responses to requests; however.

ml31415 · December 28, 2015, 9:09pm

Ok, thanks for detailing this! Nevertheless, this means, I got to implement some kind upsert_real_multi (and friends), which take care of chunking the inputs down to something, cb can reasonably handle? Means, I got to guess some as large as possible still working chunk size (to reduce overhead), throw in chunk by chunk, and never hit the magic timeout even for a single chunk, while having variable processing times of the request depending on server load? Well, this is surely everything but elegant.

I’m aware I’m quite new to couchbase, and maybe I’m missing something here. But it is promoted as something for handling big data. Now I’m just throwing tiny chunks of data into it, and already start to run into issues. Isn’t a sane handling of larger chunks something, that the library should be able to handle itself? At least, failing gracefully, as the problem didn’t seem to recover itself in a reasonable amount of time (some minutes of waiting). I hope my grief with this is somewhat understandable.

The 200kx200k lookups that you mentioned: If that’s really the reason for the server slowdown, it seems to be a really bad implementation to me. Whatever the reason for this initially was, there must be smth better. In terms of sending something back to the client, I’d also be totally fine with receiving a simple short message, that the job failed, no data entered to the db and the state preserved as before, dumping all pending stuff in the queue for that request instantly.

That said, I got one more question. I’m running all that stuff locally, so network delays are neglectable. If there would be smth wrong, with the connection, it would pop up quite immediately, so a network timeout of a second or even less would be appropriate. Now I got to run these longer lasting inserts on the server. For that, I’d have to raise the timeout to some quite high value, maybe a minute or more, even for chunks. This also means, an actual network failure remains undiscovered for the full extended timeout.

Is there some way to distinguish between processing time and network delay for the timeout? So that the client knows, the server is busy handling the request and can rest peacefully while waiting for the job to finish?

mnunberg · December 28, 2015, 9:57pm

You should ensure to limit the size of the chunks you pass to any of the _multi() functions. This is difficult to genericize as different applications have different requirements (high responsivness? maximum network efficiency? memory usage?). As far as Couchbase Server itself is concerned, _multi functions do not exist at all. They are a client-side construct pipelining n operations on the network, reducing network latency. Functionally, upsert_multi() on 200k items is the exact same as calling upsert() on each of the 200k items individually; some may fail, some may succeed, etc: it is only a network optimization, and provides no guarantee of atomicity or transactions - that is, some items in _multi can fail and some may succeed.

The server understands only simple “upsert”, “get”, etc. operations; and the server (at least in the KV API) has no concept of batching.

Finally, consider that a 200k list of anything in Python will likely result in significant CPU and memory consumption
To be fair, the lookups are caused by the client-side implementation, and due to the fact that we don’t use a hash for lookups; just a simple linked list. This of course could be changed. The server has nothing to do with it. The library was really not designed to accept batches of greater than several thousand; while there is technically no reason a batch size of 200K or even 2M items would not work, it is not the common case for our users, and therefore the library is not optimized towards this scenario.
Regarding distinguishing network timeouts from processing timeouts; you typically cannot do so. The processing overhead on the client is minimal (in your case indeed things were slowed down because of what might technically be a bug, but there is no “good” way to handle this situation)

ml31415 · December 29, 2015, 11:55am

Ok, guess I can live with that and write some workaround. One final note on the frequency of my usecase. I doubt, that it’s actually that rare. Think about anyone, who is migrating from a previous system, that has grown for a while. How does the current data initially get into couchbase? In mysql you just throw some gigabytes of CSV into the DB on a single sql statement, no questions asked. couchbase itself is supposed to easily handle this amount of data, so should the python client. Some improvements here would be well appreciated!

ml31415 · December 29, 2015, 3:59pm

Sorry again for using the board as a bugtracker …

I ran into some more issues with this. After wrapping upsert and friends, I inserted around 3 million little documents into the DB, which ran in about 75 s. So far quite nice speed. Now I created a query, which has to iterate over all documents, just creating a count of them, or selecting none of them. The query just simulates some simple analytics job, without returning much data. It looks like this:

select count(*) from default where type="abc"

or:

select * from default where type="abc"

Again, the DB doesn’t hold any object of such type, so no big data going forth and back from client to server. The problem: The job seems to take considerably longer, than inserting the lines.

I raised the timeout to 300 seconds, but the client still creates a timeout. Not after 300, but already after 75 seconds. Looks like there is some inbuilt maximum timeout? How am I supposed to run some job that just takes somewhat longer? And why does it take that long at all, to iterate some items? I understand the overhead of maybe unzipping documents, extracting the type value with json etc. Still, all that stuff shouldn’t take longer than writing the objects at first, rather the opposite. Looks like I’m back where I started, how to deal with longer running jobs, that can’t easily be chunked?

I also stumbled upon another issue with the N1QLQuery object, when experimenting with a naive __len__() method for the Bucket. While this works as intended:

list(c.n1ql_query(N1QLQuery('select count(*) from default')))[0]

this line throws the exception syntax error - at $1:

list(c.n1ql_query(N1QLQuery('select count(*) from $1', 'default'))[0]

Same result, when I wrap the $1 in backticks or use keyword arguments. Is it intended, that the bucket after the from keyword can’t be templated? If so, why? It would just force users to go back to naive string replacement.

mnunberg · December 29, 2015, 4:58pm

Templating the bucket name doesn’t work, perhaps it might work in the future, but the bucket itself must be in the query itself in order to determine which indexes will be scanned when using the query.

As far as the query timeout is concerned, you can use the n1ql_timeout (http://pythonhosted.org/couchbase/api/couchbase.html#couchbase.bucket.Bucket.n1ql_timeout) to adjust the timeout. Indeed the default is 75 seconds.

The problem with handling large amounts of data in the client is the timeout value. The typical use case is small timeout and small operations (retrieving single items); for which the timeout defaults to 2.5; the client can certainly handle larger amounts of data, but the user needs to take care of setting the timeout appropriately. On the other hand, MySQL has very high timeout settings.