Query View with include_docs option


#1

Hi.

Python SDK (v.1.2.4) has a kind of “memory leak” when query view with “include_docs” set to True.

I have about 20M docs in my couchbase (one server 30GB of RAM for my bucket) and view that contains about 15M of docs.
I’m trying to iterate via my index to do some aggregation. I’m using “include_docs” option to get docs. But python script memory usage constantly grows (1G, 2G, 3G, etc).

Code:

def run(server, bucket):
    cb = couchbase.Couchbase.connect(host=server, bucket=bucket, timeout=100)
    for item in cb.query('myhouse', 'houses', streaming=True, include_docs=True):
        pass 

def main():
    # .... parsing arguments
    run(args.server, args.bucket)

(Couchbase Server version - 2.5)


#2

I’ve tried to switch off include_docs option and get docs by cb.get() call - problem remains.
Then i’ve tried to use another connection object for getting docs and everything is ok.

So, there is some interference between view query and get requests when using same connection object.


#3

There is some interference there, but nothing I can think of which would affect memory usage in particular. One thing to note however, and this may be true especially if you have many documents to fetch, is that adding more pending operations (like include_docs, or fetching documents while iterating) will actually prolong the “network wait” and allow more data to be buffered into the server; countering to some extent, the impact of streaming=True. To understand what is happening, understand what happens when the library does I/O:

  1. In a normal view request, the library issues an HTTP request. The HTTP results stream in from the server. Once some rows have arrived, the Python extension will ask the library to “Break” from performing I/O
  2. When you use include_docs, you actually tell the library to resume all I/O; this includes fetching pending data in the HTTP stream (which you aren’t yet prepared to handle anyway)

#4

Thanks for reply.

Could you tell me what strategy should i use to fetch many documents via view query? Separate connections for query and fetching?


#5

I guess that would make sense for your particular use case. Normally I’d say to just use the include_docs; but indeed maintaining two connections would reduce the cross-chatter between them.