N1QL - pagination via SDK

n1ql

#1

I am using java SDK and N1ql to access couchbase cluster and extract data to a text file, the select query can return millions of documents…

N1qlQueryResult that I use is not “cursor” based so it can cause memory problems.

What are my options? Does SDK allow querying with pagination?

I know about LIMIT and OFFSET but it’s not a good solution for me


#2

LIMIT and OFFSET are what is recommended by query for pagination. You can implement that with the SDK of course.

Are you processing any of the results before writing to a text file? If so, the most memory efficient way to do this is probably to use the async API to react to rows as they come in.

You can see one (slightly complicated) example of that here:

The basic idea is to query and chain in a doOnNext() to handle rows as they come back in. Your code is likely to be much simpler than what is here, but the concept is approximately the same. No need to paginate, and items can be processed and written to a file as needed.

Also, note that if all you need to do is execute the query and write it to a file without processing, there may be an easier way to handle the raw results. Let me know if you want that.


#3

Just curious: Can you explain why LIMIT and OFFSET aren’t good for you?

Couchbase 5.0 exploits suitable index for both predicate pushdown and ORDER BY.
With a suitable index, performance is reasonable.
https://blog.couchbase.com/optimizing-database-pagination-using-couchbase-n1ql/

Alternate approach is keyset pagination, by remembering the last value fetched and setting the right parameters for the NEXT fetch.

See the articles below.




#4

Yes, I am processing data before dumping to file.

The AsyncBucket is interesting, using this method will I be getting similar functionality to “cursor” based databases?
Meaning I wouldn’t need to pull the whole data to memory before starting writing to file?

Another setback that I am having with this, that I still need it to act as synchronous process (eg calling “next()” explicitly to get next result from the database).

So to achieve that I will have to “mutilate” the above async(), and implement some nasty blocking mechanism that will block until a notification arrives…


#5

The reason that I don’t want to use it,
I don’t want to execute query N-times, until all data is retrieved.


#6

Not exactly. A cursor on a traditional RDBMS is additional state handled at the server side that then the client can move about, iterate forward, etc. For your use case here, unless there is an ORDER BY clause, this solution is better than a cursor.

In this solution, items are processed as they come in and the memory is freed as you’re done processing the item. The reason it’s better than a cursor, is that the query processor doesn’t need to manage additional state either. It can more efficiently just pass items along.

In this case, you would not need to call next(). That’d be a java Iterable approach. In observables, you chain in code to be called when data arrives. It’s generally more efficient and expressive than working with iteration.

The RxJava example there actually does the same thing. It’s not really a mutilate kind of thing though. It’s natural to need to block until everything is complete, handle errors, etc. The async API we expose built on the streaming parser of responses will let you chain in methods for blocking, error handling and so on. Have a look at the introduction to reactive extensions and then look at our docs and blogs for more info.


#7

This is exactly the problem.

My “client” that uses the SDK, should “call” it to get next data item and not to be “called” when data item arrives…

SO the solution that I can think about right now, is to add Observable .Subscriber, and inside onNext that is triggered when data arrives, to push the data into some blockingqueue .

Then my “client” code, will be able to iterate the queue like cursor-based database.


#8

Yes, if you’re forced into that kind of API, that would be a solution. Note that you can possibly lazily fetch the actual document itself so you’d be able to hold the whole queue of IDs of documents in memory. That way you don’t need to hold the actual document and do all of the processing up front.

Do be aware, there can be some consistency issues with this approach, but you may have to control for concurrent modifications anyway. Consider using a query that returns the meta().id and CAS value perhaps.

Hope that helps!