Any way to use standard analyzer & still include stop words?


#1

Is there any way to force the Standard analyzer to include stop words?

As documented here (https://issues.couchbase.com/browse/MB-18631), stop words are removed from the index when using the standard analyzer.
You can demonstrate this with Bleve’s online Text Analysis @ http://analysis.blevesearch.com/analysis.

We have an FTS index on a string field, and have a requirement to include stop words in the queries & results.

We are using a PhraseQuery for the search, separating the search string into terms at spaces.

This worked fine, until we tested finding a stop word, which failed.

Removing stop words from the query terms also failed (I’m guessing this happens because the index includes stop words when calculating the “position” of the indexed fields).

We changed the analyzer of the FTS index to “simple”, which does not exclude stop words.
After that seaching for a stop word worked.

Given a document containing “THIS GEORGE OF ENGLAND” in the indexed field:

  PhraseQuery
  Search Term(s)           Analyzer   Match Found?
  -------------------      ---------  ------------
  "george"                 standard        Y
  "george"/"of"            standard        N
  "george"/"england"       standard        N
  "george"/"of"/"england"  standard        N

  "george"                 simple          Y
  "george"/"of"            simple          Y
  "george"/"england"       simple          N
  "george"/"of"/"england"  simple          Y

However, searching for a word that contains an apostrophe always fails with the “simple” analyzer.

Given a document containing “THIS G'EORGE OF ENGLAND” in the indexed field:

  PhraseQuery
  Search Term(s)            Analyzer   Match Found?
  -------------------       ---------  ------------
  "g'eorge"                 standard        Y
  "g'eorge"/"of"            standard        N
  "g'eorge"/"england"       standard        N
  "g'eorge"/"of"/"england"  standard        N

  "g'eorge"                 simple          N
  "g'eorge"/"of"            simple          N
  "g'eorge"/"england"       simple          N
  "g'eorge"/"of"/"england"  simple          N

Is there any way to customize/hack the query or index to use the standard analyzer and still include stop words?

Or can we create our own analyzer that behaves like “simple” but handles apostrophes,etc. properly?

Also, we have a RegexpQuery that works with the standard analyzer, but fails with the simple analyzer.
The document’s indexed field contains 578206327, the regex is “^\d{5}6327”.
Using the standard analyzer finds a match.
Using the simple analyzer does not.
Is that proper behavior?

Thanks very much for your time!
Jeff


#2

Small update:

I dug into the Bleve source and found that the Simple analyzer has the “letter” tokenizer & the “lowercase” token filter.

The Standard has the “unicode” tokenizer, and “lowercase” & “english stop words” token filters.

We are trying with a custom analyzer using the “unicode” tokenizer & “lowercase” token filter.
Initial results are promising…


#3

There are a lot of questions here, I’ll try to address them but if I miss one let me know.

First, regarding the standard analyzer removing stop words, you are correct. Many modern search solutions no longer choose to remove stop words, so why does FTS still do it? Right now, the FTS index has a size issue, and it can get quite large. Stop words being very frequent thus make the FTS index even larger. So, for now we’ve decided to continue removing stop words. In the future, as we address the index size issues, we will revisit this decision.

Second, you reported problems with PhraseQuery, and this too is expected when using it on a field that does analysis, as the PhraseQuery does not analyze the terms. In your case you should use MatchPhraseQuery, and you’ll find that works correctly when you include the stop words. The reason is as you figured out subsequently, you need the stop words to be removed, but the position offsets to take into consideration their inclusion. This works correctly when the search phrase includes the “this george of england”, because we remove this/of, but note the correct relative offsets of george/england. Please try MatchPhraseQuery and see if it is a simpler solution to your problem.

Third, there is no way to change the behavior of the standard analyzer, but FTS includes the ability to define your own analyzer. So, what you can do is define your own, following the pattern of the standard analyzer, but simply omit the stop-token removal filter. Here are the rough steps:

  1. Inside the mapping, expand the Analyzers section
  2. Press the + Add Analyzer button
  3. Choose a name like standard-no-stop
  4. Skip the character filters section, we don’t need any
  5. In the tokenizer section, make sure unicode is selected (should be the default already)
  6. In the token filters section select “to_lower”, it should be the last item in the list.
  7. Be sure to press the +Add button
  8. Now press Save
  9. Finally, to make this new analyzer the default for you index, expand the Advanced section and in the field Default Analyzler, change it from “standard” to the new on you defined “standard-no-stop”.

Finally you reported a number of problems with the simple analyzer. I do not recommend you use it. It is included because it is very fast, but unfortunately it accomplishes this by doing a very poor job of tokenizing the text. As you observed, it completely skips over numbers and does not handle punctation very well either.

Hopefully these other options, either using MatchPhraseQuery, or your own custom analyzer standard-no-stop work for you.

marty


#4

Marty,

Thanks very much for all the good info!

I think we had tried MatchPhraseQuery in the past and ruled it out because we only want exact matches (this search field is for proper names). My memory is a bit hazy :frowning:.

If I recall correctly, we settled on PhraseQuery because we want exact matches, and also want whitespace around & between the search words to be ignored. So searching for “van gough” should find “vincent van<space><space>gough, III”, etc.

It sounds like a custom analyzer should get us almost all of what we want… At least until we get the inevitable requirement that searching for “george england” should find “george of england”. But we’ll deal with that when/if it happens.

Thanks again for your quick response!
Jeff


#5

On Step 7 , even though i press + Add button , and then save , the “to_lower” selection is not saved … when i open the edit analyzer , i dont see "to_lower " option selected … is there a programmatic way to use the "to _lower " option ?


#6

I’d like to explore this with you directly to see if it is a bug with the UI.
But yes, everything can be defined programmatically via the JSON index definition. I show an example in this short video.

- as you edit in the UI, it updates the JSON sample so you can copy/paste easily.


#7

@mschoch, I understand that performance is a primary objective for FTS. But I’d like to make a case for exposing the stop words behavior as a configuration toggle (on the FTS index, or wherever).

A Common Use Case
We chose to implement Couchbase Lite for our enterprise Xamarin app, specifically because FTS comes out of the box, and our searchable corporate directory is a very popular feature. There are about 9,000 records right now, with a likely maximum around 25,000.

When I search our corporate directory for names using an FTS wildcard search (across 3 properties: firstName, lastName, departmentName), valid, real-world search criteria produces inconsistent results. For example, searching for “And*” should give me people records for “Andy”, “Andrew”, “Anderson”, etc. But of course I get zero results because “and” is a stop word. Conversely, if I search for “Tho*” then I do get records for “Thomas”, “Thompson”, etc. So our user experience has a noticeable consistency problem, from the end user perspective.

Unintended Consequences
Returning zero results instantly feels like a bug. So now, this creates a UX problem to solve, because we have to educate the user about why their valid search criteria has some technical limitations, which words are invalid, etc. Although there are ways to do this, we shouldn’t have to take on a user education burden. As you probably know, well-designed mobile apps inflict minimal cognitive loads on their users. So, we feel painted into a corner…

Possible Solution
Is it feasible for Couchbase Lite’s internal implementation to use a boolean flag which determines whether or not stop words are included? If so, the flag would be false by default, so that stop words are not included. But then expose this flag as an optional parameter when an FTS index is created in code, by developers.

  1. If performance suffers, it’s easy enough to switch the toggle back to its default value.
  2. Small databases (mobile) should be less affected, correct? In other words, does the performance penalty correlate directly to the FTS index size (rows * indexed columns)?

Thanks for considering… :slight_smile:

Andy


#8

@ajhuntsman

Everything you’ve described is possible today, it just isn’t as easy to configure as a checkbox. As has been explored in the rest of this thread, it’s possible to build your own version of the standard analyzer, except keep the stop words in the index.

When we made the decision to ship analyzers which removed stop words, it wasn’t just about performance in a vacuum. It was the specific fact that it makes the index larger, in the context that FTS indexes are already quite large to begin with. Now, since that time, we’ve made some great improvements to reduce the size of the index (upcoming scorch index format), so we are having conversations internally about revisiting this decision. That being said, index size is just one factor in the decision to include stop words. If we were to promote building indexes that include stop words, we’d probably also need to ship something like the “common terms query” that Elasticsearch offers. This type of query can cope with very high frequency terms at query time. See this link for more information on it: https://www.elastic.co/blog/stop-stopping-stop-words-a-look-at-common-terms-query

Finally, you mentioned Couchbase Lite near the end of this post, and I just want to clarify that the full-text offering in Couchbase Lite has a completely different implementation, and so mixing these two in the same conversation is likely to lead to confusion.