How can an Intersect Scan return more items than MIN(inputs from respective indices). It's returning the MAX

naftali · November 14, 2019, 4:57pm

Need some insight into how intersectScans work. All I have to go on is this article:https://dzone.com/articles/performance-ingredients-for-nosql-intersect-scans

I’m having an issue where I have an intersect scan that involves two index scans:
the first returns x items and the second returns y=(x-z) items.

the intersect scan takes in (x+y) items and returns x items (the max of its inputs)

I’m missing something big, how is that possible. I’m guessing it’s a performance optimization where beyond some threshold of something, the engine just gives up and returns the larger results set?

(not an academic problem…after the 50,000 doc fetch and re-application of predicates, the result set = 163, really need to get this right.)

Please help. (CB v6 Enterprise)

vsr1 · November 14, 2019, 9:34pm

IntersectScan must do Fetch. And it re-applies the predicates. So the output of IntersectScan can have false positives.
#itemsIn in IntersectScan is sum of the all the input sources.
N1QL optimizer uses rule based. Don’t know anything about how many items each input produces.
Due to that IntersectScan on the N1QL is little different than pure Intersect.
When ever any input is done it uses all items received from that input as possible candidate and terminate other inputs early vs wait for all the inputs do pure intersect. Some times this performs well (input produces 10, input 2 produces 1M - in this case it terminates input 2 early) and some times not ( input 1 produces 50,000 input 2 produces 40,000 - intersect is only 100).

If really need pure Intersect use SET INTERSECT ( Q1 INTERSECT Q2) for each query as covered and projecting document ids as USE KEYS.

If intersectScan performing poorly avoid it by using USE INDEX.

naftali · November 14, 2019, 9:45pm

thank you, absolutely amazing explanation

naftali · March 12, 2020, 8:14pm

@vsr1

My understanding is that query node will make data requests from the indexer nodes for all qualifying indexes in an intersect scan, but will only use the first returning result set.

So these requests are done in parallel. (Correct?) My question is about load: even though the query node ignores all but the first returning results, the index node(s) either have to do all the work or else they have to do as much of the work as they get done before being told to stop if they are being told to stop.

How does this work (does the index node(s) do all the scan work, or are they told to stop?)

Are these scans done in parallel in a one index node deployment by default or not (version 6.0)

Thank you for your help as always.

vsr1 · March 12, 2020, 8:28pm

In the explain underneath IntersectScan there is IndexScan. All Index scans are done in parallel irrespective of indexer nodes. As explained previous post N1QL uses early termination (vs pure intersect) i.e when one indexscan completes it terminates all other indexscans under intersectscan and uses all the results from the indexscan as possible candidates and does fetch .

Some times this works better if one of the scan completes quickly. If not there might be lot work might be wasted. Avoid intersectScan specifying single index in USE INDEX hint.

naftali · March 12, 2020, 8:37pm

Right, I’m trying to understand more about " it terminates all other indexscans under intersectscan"

If I’m understanding correctly, it’s possible that for a query requiring an intersectScan over, say, 3 secondary indexes, the indexer node(s) end up doing 3 times the work, depending on how early or late into the respective scan operations, they get terminated, and the more distributed the index nodes, the more likely that they do more work for a intersectScan.

If so this give us an additional incentive (in addition to unpredictable query response time) to avoid intersectScan.

vsr1 · March 12, 2020, 8:43pm

It is possible. Some times it is much worse because it needs to fetch the documents.
As rule based decision can’t decide which index is better it does InterSectScan.
Based on data/patterns do experiment with one by one index for all index(s) in intersectscan and select best one, provide as USE INDEX hint.

Also you can use https://index-advisor.couchbase.com/indexadvisor/ see if you get any better recommendation