N1qlrequest fast processing options in python 2.7

rashmigandhe · January 2, 2018, 11:31am

I am querying a basic document in my couchbase bucket using the following query “SELECT item_id, seller_id, date, price from “+connectObj.bucket+” where date>=’”+content[‘from_time’]+"’ and date<= ‘"+ content[‘to_time’]+"’"
I am receiving the response pretty fast in 0.02 sec (0.0165578197854 sec to be exact)
the number of dcouments received is 72236

I want to work with the documents in Python.

If I transform the N1QL JSON resonse list to Python array using an iterator it takes 4-5 seconds

The processing after converting to array is again very fast somewhere 0.1 sec max (I am doing a decision tree implementation here)

I am trying to speed up the conversion process.
Few options I can work on are:

N1QL returns me an array instead of a JSON document list
Process JSON in Python instead of Python Array

Option 1: As per couchbase documentation, there is no attribute which I can pass while querying which will get me a response as array list instead of a list of JSON documents. (It seems PHP has that option, as told by one of my friends.) So this seems to be out of reason

Option 2:

If I try to import the response data in Pandas Dataframe
pandaDataFrame = json_normalize(responseData)
It is giving me error saying: ‘N1QLRequest’ object does not support indexing

If I dump response data in a JSON file
json.dump(responseData, outfile)
It is giving me an error saying : <couchbase.n1ql.N1QLRequest object at 0x077DEE70> is not JSON serializable

If I iteratively dump rows in python array and the feed it to Pandas Dataframe:
for responseRow in connectionBucket.n1ql_query(queryString):
processData.append(responseRow)
pandaDataFrame = json_normalize(processData)

it works fine but 4-5 seconds to dump the data in Dataframe.

Can you help me find a way out of this?
Is there a way to process the response faster?
Can we in any way make the N1QL request JSON serializable or support indexing?

I am unable to find any documentation around it. Please help.

ellis.breen · January 9, 2018, 7:10pm

Hi @rashmigandhe,

Excited to hear you are using Couchbase Python Client with Pandas. With the arrival of the Couchbase Analytics DP we are hoping to cater more to Python Data Scientists in future, so it’s great to see an early adopter in this arena. We are hoping to integrate better with frameworks like Pandas in the future so we can provide better performance and match their syntax and idioms.

The N1QLRequest object only presents an iterator via its interface, hence the requirement to convert to a list before using indexed access. Looking at the code, there is an underlying Python List that is populated with incoming results, but this is only filled as results come in, and then fed through to the iterator as results come in. I’m surprised it’s taking 4-5 seconds to copy all of these results to an array, but perhaps it’s a complicated query, or further optimisation is required on the client side. How soon do you know you have 72336 documents in response? Immediately, or after you have iterated through the entire N1QLResponse object?

One setting you can experiment with is the pipeline_batch setting in the N1QLQuery object - larger settings may be more efficient with large throughput, although the latency will be slightly higher.

Many thanks,

Ellis

carlhyde · April 6, 2021, 12:54pm

Iterating through large pandas dataFrame objects is generally slow. Pandas iteration beats the whole purpose of using DataFrame. It is an anti-pattern and is something you should only do when you have exhausted every other option. It is better look for a List Comprehensions , vectorized solution or DataFrame.apply() method.

Pandas DataFrame loop using list comprehension example

result = [(x, y,z) for x, y,z in zip(df['column_1'], df['column_2'],df['column_3'])]