FTS: slow sort if sorted by indexed field


#1

Hi,

I have a fulltext index on a bucket with about 250k documents with the json format:

{
“title”: “foo bar”,
“timestamp”: 1234,
“lead”: “foo bar”,
“type”: “story”,
“text”: “foo and bar”
}

A query like

{“size”:100,“sort”: ["_score"],“query”:{“query”:“and”}}

returns in < 20 ms with about 75k results (yes I know “and” a bad example, but a good test to see what happens with a large result set).

If I change to a custom sort field like

{“size”:100,“sort”: [“timestamp”],“query”:{“query”:“and”}}

the query needs > 1s, so 20 times slower.

Is there a way to make this faster? Or at least a reason why a custom sort is so much slower than a sort with _score or _id?


#2

Hi,

The first thing to note is that “and” is a stop word, removed by both the “en” and “standard” analyzers. So I’m assuming you’ve already adjusted the mapping to use a different analyzer.

Why is custom sort so much slower than _score or _id? The reason is that for every hit matching the search (ie, all 75k) we have to load (often from disk) additional informational we don’t have (in this case the timestamp). Even though you only want the top 100, we have to do these loads on all 75k matches to find the top 100 timestamp values. In contrast, for both _score and _id, we already have this information readily available, and no additional loads are required.

The only tip to make it faster is to try and reduce the number of documents matching the search (ie, the 75k not the 100). In your example, if you already knew that you would get more than 100 hits from “today”, you could add an additional search clause to filter the matches by timestamp as well. This would reduce the number hits that we have to load the timestamp from.

marty


#3

Hi Marty,

I did it in german, just translated everything for the forum post here . So the german “und” was not a stop word using the standard analyzer :slight_smile:

Thanks for your answer, very interesting. Do you know why Elastic Search (yes I know, bleve is always compared to this ;-)) has similar answer times, no matter which sort option I choose?

Actually I just play with Couchbase FST, and except of this issue I’m quite happe with it. Maybe it can replace ES for us at a later time, reducing complexity and separate servers is always good :wink:

Thanks, Pascal


#4

Lucene has two capabilities FieldCache and IndexDocValues that are used to make this faster. I’m not exactly sure how ES uses these without some explicit configuration (automatically building them for all fields would seem to waste RAM and not be what you want).

Here are two articles explaining how they work:

marty


#5

Do you plan to implement something like this in Bleve? A manual configuration of fields using for sort would be cool.
Thanks for the link, now I have something to read at this rainy weekend :wink:


#6

Yes, we’d like to do something similar, probably initially with something in the mapping to give us a hint that you plan to sort on a field.

marty


#7

A year ago we decided to stay with elastic search because fast sort by date was too important for us. Now I was testing fts again. First with moss storage, with the same result as a year ago. After a hint from @sreeks I tested the same with scorch, and… The results are amazing. More than 10 times faster than moss for this sort queries, and comparable with ES for my test set of now 200k results. So for everyone who needs custom sort, scorch is definitely worth to try.

@mschoch, do you have some Infos how much the overhead of custom sort now is, compared to native _id or _score? There could still be the idea of a workaround by using a timestamp as _id that allows the correct sort order if that’s faster and more cpu friendly.


#8

Hi Pascal, As my time zone is more overlapping here, let me pour some insights.

Yes, scorch has newer implementation (it’s own versions of docValues/FieldCaches). But bleve implementation is not that elaborate and optimised as that of lucene.

The overall the custom sort flow mentioned earlier in this thread still remains the same.
Hence the overheads are still more for custom fields compared to that of native _score/_id.

To me, the work around you suggested seems plausible to bring some improvements as per theory, may be worth giving another try.

Cheers,
Sreekanth