FTS partial phrase search


#1

Hi All,
I want to search a document field with large paragraph data.
ie :lets say field value is “Today is a rainy day.Umbrella is required.I came ye,sterday.I will come for five weekdays.”

Does fts create inverted index with tokenizing with space and let me search by exact word match ??
ie :search by “rainy” will return 1 result and search by “required” return 1 result.search by “Tod” will return 0. since there it is a partial match

I want to search for partial phrase “day” .What additional index level configs are needed to perform partial matched in a text field regardless of being a prefix or suffix?

Please send your replies.
Thanks
Isuru


#2

FTS creates an index by tokenising (text analysis) the document field contents. But this depends on the type of analysers configured for the field in the index definition/mapping.
useful information/clues can be found here,
https://docs.couchbase.com/server/6.0/fts/fts-creating-indexes.html
https://docs.couchbase.com/server/6.0/fts/fts-using-analyzers.html

For partial searches, you may need to use different query types like prefix, wildcard, or regex.
https://docs.couchbase.com/server/6.0/fts/fts-queries.html

Depending in the analyser’s used, the token “rainy” could become “rain” in the index.
Another important thing to be aware is that, during the query processing phase, the query text also goes through the same text analysis process. It would use the same analyser configured for the respective fields in the original index definition on which you are querying against.

regards,
Sreekanth


#3

Hi Sreekanth,
Thanks for the reply.
Just little more clarification needed.

Lets say I want to do a traditional LIKE search %searchText% .This will list down all the matches regardless of searchtext is at prefix or postfix.

You mean in FTS you need to combine prefix query ,suffixquery and another regex query for other scenarios ?Meaning you need altogether 3 queries at least to do a partial search?


#4

I guess you might have figured this already by going over the query types.
You just need to use either of those query types based on the exact query requirements.

In case you need to perform exact partial/substring searches as you mentioned before, you may need to use n-gram analysers during indexing. We don’t support certain cases of regular expressions (in regex queries) for performance/scalability considerations. For eg: “word boundaries are not allowed”.

https://docs.couchbase.com/server/6.0/fts/fts-using-analyzers.html


#5

Another option is to use FTS to narrow intermediate results to a narrow the candidate result and then use N1QL to apply the LIKE predicate on top of it.

See: https://blog.couchbase.com/curl-comes-n1ql-querying-external-json-data/

Doing things like this will become easier in the upcoming release.


#6

Sreeks,
Is there anyway to pass minimum and maximum to ngram analyser depending on token length?
i e.ngram 1 and length(token)

This will ensure all the required token combinations are indexed

Thanks


#7

Don’t think there is a way to do this with dynamically varying token sizes.
But certainly this approach won’t scale. It will have huge space amplification factor for the inverted index and the approach won’t work out with a reasonably heavy loaded system.


#8

Thanks sreeks
Hmm then way to go should be regex based search for partial phrases.
I presume regex should be enough to do a prefix,suffix or middle match on a certain field


#9

Hi Sreeks,
Need clarification on the below.

  1. When index is defined in the fts console we define index for a “country” field in a given type and in the java sdk we write like this i e SearchQuery.match(“usa”) so it will simply match against the country index field.
    If we write SearchQuery.match(“FL”).field(“state”) does it avoid indexes and do a key value search or do we need to create additional index using “state” field?

  2. If I search using multiple fields lets say 3 fields in the same document type is it a good practice to create 3 indexes and use 3 conjunction queries to append them or some other method?

  3. Is there any way to search by other fields apart from the indexed fields ?Might be correlated to 1st concern

  4. SearchQuery.regexp(".texttosearch.") (dotasterixtexttosearchdotasterix)is fetching all the partial phrases matching “texttosearch” Seems prefixes and suffixes are included too…Just need to ensure lower case text is passed.So this will resolve partial text search issue.

Please clarify me the top3 issues and appreciate your feedback on regex based partial text search.

Thanks
Isuru


#10

Sreek
After further exploration I found that we can create multiple indexes under same ftsindex and I can use field name to do search on desired index field.

You can ignore my previous question.However you can still give your invaluable input
Thanks
Isuru


#11

Hey,
Let me try to give a brief answer to each of your concerns,

  1. FTS is capable of serving search request /performing search only on the indexed field. It won’t do anything extra nor will throw an error saying field not indexed. It is upto the administrator to ensure that right fields are indexed. For SDK specific queries, better to post that in SDK forums to get quick and precise answers.

  2. For just three fields, its always normal/better to create a single FTS index comprising all three of them and perform suitable queries on it. Having said this, there could be exceptional cases arising from scaling/ performance / data impedance reasons to do it otherwise.

[There is a feature called - FTS index alias - An alias can cover multiple FTS indexes behind that and once you submit a query against an index alias - the search gets performed on all the indexes behind it and the results are returned. refer documentation]

  1. Searching un-indexed fields is not a capability FTS provides.

  2. Passing lower case is expected as that is resulting from the analysers used. Please refresh yourself on those documentation links. :slight_smile: .

Cheers!,
Sreekanth


#12

Sreeks
Thanks for the reply.I have now demystified what is needed for my app.For lower case I can use to_lower filter so that indexes are stored in lower case .

Thanks for the continuous support

Regards
Isuru


#13

Sreeks,Based on your inputs and couchbase urls I did a public speech on couchbase FTS

If you get time go through it and let me know if I misspeak. I am a newbie to couchbase and fts.

Thanks
Isuru