Character Filter


#1

Hi,

I’m playing with character filter. Idea is to filter some german umlauts and other characters. I can’t use standard “de” filter, because I need to use prefix/wildcard search. So idea is to filter them in couchbase via character filter (ü -> u etc.) and manually do the same with the query string (because fts don’t use analyzer for wildcard/prefix).

I indexed some documents with texts like “hello mister müller”.

Standard (no filter): wildcard query with müll* works.
character filter with “regular expression = ü, replace=u”: wildcard query with mull* does NOT work
character filter with “regular expression = ü, replace=[emtpty]”: wildcard query with mll* works
character filter with “regular expression = e, replace=a”: wildcard query with hall* works

So something seems to be wrong with ü -> u replacement, while ü -> empty or e -> a works perfectly.

Do I something wrong? Or could it be that is a problem of utf8, because ü ist a 2 byte character, while u is 1 byte?

Thanks, Pascal


#2

I created a inex now with edge_ngram token filter. Now it works as expected with matchquery and is also faster than prefix queries… I’ll continue with testing that for my use case.