Top N over aggregated (MapReduced)

Hi,
I have a scenario where I need to find the top N result of a map reduce (aggregation). How do I do that without retrieving the entire result of the map reduce and then sort it?
I will give an SQL example:

Select top 10 departmentID, count(*) departmentCount
from employees
group by departmentID
order by departmentCount desc

I know how to write a map-reduce view to count per deprtmentID but the reduce will set the result to “value” and as far as I can tell - couchbase.query cannot sort by value (only by key) of the view.
So the only option I see now (without using N1QL) is to retrieve the entire view, sort the result based on value and return the top N resutls. That produces very bad performance if the number of departments is millions or billions (like in my case).

2 Likes

Hi @alonspiegel,

I would think that N1QL would be the right fit, since you are already comfortable with SQL and you’ve already used it to describe your problem declaratively. If you’re having a performance issue with N1QL, you may want to discuss it on the N1QL forums (https://www.couchbase.com/forums/c/n1ql).

I’ve tried to figure out how to do this with Map/Reduce, and I think the source of the problem is the inability to sort the results by the values. So, using group_level=1, I can properly aggregate the counts (very simple with _count), but you’re right that you’d have to then sort them on the client side. If that count could somehow be added to the key, then sorting would be possible (you might have to add leading zeros, since I believe it sorts the keys lexically, and not numerically). There might be a trick I’m missing, but I don’t see a way to do that.

Thank you Matthew for your response.
However, I don’t see how N1SQL will really be effective.
It will not save network traffic as the data have to be retrieve to the query server (using the index server of course) from all other nodes in the cluster instead of being retrieved to my own backend server.
It will not save calculation resources (sort and topN) as it will just do it on the query server instead of on my own backend server.
Also, one of my main concerns is to reduce the amount of reads (physical/logical) per query since this query needs to run hundreds of times per second - so N1QL will exhaust my cluster’s resources only for this query.

In my mind I can imagine a MapReduce that can accept input from other view (Map/MapReduce) and not only from documents (like today). This would certainly do the trick.
I know it is not possible today - so I would hope it will be road-mapped in the coming Couchbase release. Just imagine the level of complexity one can achieve with such a feature…
Respectfully and humbly,
Alon

@alonspiegel,

I understand, and I will pass along your feedback.

One more suggestion, have you looked at this blog post about using Map and Reduce View for Ranking?

Hi @alonspiegel,

Can you write the results of the first view into a bucket, and then apply a different MapReduce to get the top N.

Hi Geraldss,
Of course I can, but doing so defies the purpose which is to eliminate the massive reads of the first phase in your suggestion.
Since I cannot tap into the incremental map-reducer mechanism I would need to query the entire first map-reduce in order to repopulate the 2nd collection of documents. So maybe instead of scanning the first map-reduce 100 times a second I can do it once a second from my code - but even that operation will take more than 10 seconds end-to-end: if we are talking about million of documents (deleting all of the old document, and the new results will have to be retrieved, transferred to my backend and transferred back to Couchbase - and reapplying the 2nd map-reduce on them)

Regards,
Alon

I read the blog post - it uses a value from the original document to sort - it is a result from the map phase.
This, unfortunately, is not my case.
My case involves sorting by the result of a reduce phase.

Hi @till @yingyi,

Is this a good use case for Analytics DP? Either current DP or next one?

You can also test this with N1QL. @yingyi added an optimization for Top N, so that only a few items are sorted. cc @vsr1

I am not a N1QL expert but I am guessing the optimization is to reduce memory footprint during the sorting process. It does not reduce the amount of reads.

I am also not a Couchbase Analytics expert but I am guessing that the only improvement will come from using parallel working threads - still needing to read all the data all the time. It might reduce the time to respond but it will not reduce the amount of resources consumed to achieve the response.

What I am trying to say is that all the solutions involving massive scans - ultimately will translate to the need to pay for more IOPS. I wish to scale my system using intelligent methodology, not by purchasing stronger and more expensive hardware (when I don’t have to of course).

I chose Couchbase thanks to the incremental map-reduce. That is a fantastic feature which makes Couchbase (and CouchDB) kind of unique. This is the kind of smart methodology I am looking for.

Regards,
Alon

@geraldss Yes, that’s a good use case for Analytics DP.
The current DP can setup the user experience while the next DP will be more reliable.

Download (click the “extensions” tab):
https://www.couchbase.com/downloads

Documentation:
http://docs.couchbase.com/prerelease/analytics/introduction.html

Blog:
https://blog.couchbase.com/analytics-dp-1/

Talk video:

@alonspiegel,

I have created a case to add the ability for MapReduce views to accept input from other views. You can track the case here: https://issues.couchbase.com/browse/MB-23066