Java SDK / N1QL: Recommendation on how to do RYOW (scan_consistency at_plus) with 100k+ documents?

synesty · September 13, 2017, 9:02pm

We are building a long running ETL process which works roughly like this:
CB Version: 5.0 Beta2

T1. Insert / Update 300k documents (e.g. 100k updates, 200k inserts) using the asynch bucket api.
T2. Do some other data crunching
T3. Do a N1QL query which needs to return documents from T1 (Read your own writes)

As described in https://blog.couchbase.com/high-performance-consistency/ at_plus consistency allows to pass the Documents or mutations_sequence_ids of the documents in T1 to the Query in T3.

But the example is mainly talking about a use case where you are dealing with very few documents like updating a single user.
What would be an approach in our case where we are dealing with huge mass-import scenarios?
We need to guarantee that query in T3 sees all documents of T1. At the moment we do REQUEST_PLUS, but we are worried that this could lead to slow downs, where AT_PLUS could be more efficient in theory. We have lots of other processes writing and updating to, so the indexer could be quite busy permanently.

Should we really pass 300k document Ids using N1qlParams.consistentWith(…) ?
Or should we try a compromise approach to just store the 1st and 300kth docId?
Or is REQUEST_PLUS our best option here?

Any advice would be appreciated.

Siri · September 14, 2017, 5:50am

Hi - you can use the N1QL at_plus consistency with any number of documents. It works by keeping track of high sequence number generated per vBuckets at the time of SET/GET operations and so is independent of how many documents are modified. In other words, the cost of using at_plus depends on the details of your operations, it’s probably easiest for you to try both options and use whatever works better. Both are regularly optimized for best performance.

On the server side, the way N1QL consistency works is the GSI indexer parks the incoming index scan request till the indexer has reached a point where the consistency constraints of the query are met, and it is ready for execution. Having a lot of these queries will have additional impact on server, but it is still an optimized code path, and it should work well given proper system resources.

Finally, consistency performance is vastly improved with Memory Optimized Indexes. If you are evaluating EE, then you should consider using storage mode as Memory Optimized.

synesty · September 14, 2017, 6:39am

Thanks @siri for your answer.
I am a little confused as you are not mentioning “at_plus” but only request or statement consistency.
On this page there are 3 consistency levels and at_plus is mentioned in context of RYOW

not_bounded Default value for single-statement requests.
No timestamp vector is used in the index scan. This is also the fastest mode as we eliminate the cost of obtaining the vector and any wait time for the index to catch up with the vector.

at_plus This implements bounded consistency. The request includes a scan_vector parameter and a value, which is used as a lower bound. This can be used to implement read-your-own-writes (RYOW).

request_plus This implements strong consistency per request. Before processing the request, a current vector is obtained. The vector is used as a lower bound for the statements in the request. If there are DML statements in the request, RYOW is also applied within the request.
If request_plus is specified in a query that runs during a failover of an index node, the query waits until the rebalance operation completes and the index data has rebalanced before returning a result.

statement_plus This implements strong consistency per statement. Before processing each statement, a current vector is obtained and used as a lower bound for that statement.

Are you suggesting that Statement consistency is appropriate for the use case I described?
I attended Couchbase day in Berlin 2 days ago and there the presentation by Tom was stating the at_plus should be used for RYOW. Statement_plus in my opinion is too strict. But maybe I misunderstand.

Thanks for the hint on MOI but this is not an option in our use case, as we need the Disk, because not all documents need to be in Memory all the time.

Siri · September 14, 2017, 7:13am

not mentioning “at_plus”

@synesty Sorry, I’ve reworded my response using the N1QL terminology. Perhaps @ingenthr can provide us a link to details of mapping between SDK options and the N1QL consistency levels.

not all documents need to be in Memory all the time

There is a new index storage engine, Plasma in EE, which provides significantly better index performance without requiring entire index to be resident in memory coming up, please see Introducing Plasma to try out the beta. While it is not as fast as Memory Optimized Indexes, it is still significantly higher performance than 4.x standard GSI index storage, while not require all parts of index to be resident in memory, unlike Memory Optimized Indexes.

subhashni · September 14, 2017, 5:55pm

Hi @synesty,

Yes you would need to provide the 300k documents to consistentWith for at_plus as they may belong to different vbuckets. at_plus needs the mutation state that is tracked with each document which is passed to the indexer. The compromise approach wouldn’t work here. at_plus and request_plus are both suitable in this case, it would be better to try both options and choose which has better performance.