Full text search configuration for French names with accents

Hi,

I have a couchbase server with french names (with accents, like Jérémie), and I want to configure a FTS index with autocomplete like features.

I have problems understanding how to configure my index so it ignores accents. I am using french analyzers, normalize_unicode as filters, but how do I configure my index so “je” “jé” “jérém” “jérem” “jrmie” “jeremi”… matches “jérémie” ?

Any hint ?
Thanks a lot

PS : I have used ElasticSearch, so maybe I’m expecting too much from Couchbase ? Not sure CB FTS is as performant as ES FTS.

Hey @jeremieburtin, couchbase FTS does not natively support auto complete but this kind of a thing can be implemented within your application via certain token filters that we provide, as highlighted in this article …

As for your question on how to configure your index to match all the sub strings of “jérémie”, I’d try setting up an index mapping over the field that carries the term, with a custom analyzer using the edge ngram token filter (with min length: 2 and max length: 7) and also the to_lower tokenizer (to ignore case).

Setting up one such index mapping would store the following tokens for the term “jérémie” …

 jé - 423 (1a7) posting byteSize: 18 cardinality: 1
 jér - 462 (1ce) posting byteSize: 18 cardinality: 1
 jéré - 501 (1f5) posting byteSize: 18 cardinality: 1
 jérém - 540 (21c) posting byteSize: 18 cardinality: 1
 jérémi - 579 (243) posting byteSize: 18 cardinality: 1
 jérémie - 618 (26a) posting byteSize: 18 cardinality: 1

Any of these tokens can be searched for.

Note that if you’d like to match terms that aren’t indexed here but are close like “jérem”, “jrmie”, “jeremi”, “je” etc. you could employ fuzziness (edit distance) within your match query.
Here’s a sample match query with fuzziness factor 2 …

{"query": {"match": "jeremi", "field": "name", "fuzziness": 2}}

, or via a query string …

name:jeremi~2

Here’s the sample index mapping …

{
  "type": "fulltext-index",
  "name": "default",
  "uuid": "",
  "sourceType": "couchbase",
  "sourceName": "default",
  "sourceUUID": "",
  "planParams": {
    "maxPartitionsPerPIndex": 171,
  },
  "params": {
    "doc_config": {
      "docid_prefix_delim": "",
      "docid_regexp": "",
      "mode": "type_field",
      "type_field": "type"
    },
    "mapping": {
      "analysis": {
        "analyzers": {
          "custom_unicode": {
            "token_filters": [
              "to_lower",
              "edge_ngram_min_2_max_7"
            ],
            "tokenizer": "unicode",
            "type": "custom"
          }
        },
        "token_filters": {
          "edge_ngram_min_2_max_7": {
            "back": "false",
            "max": 7,
            "min": 2,
            "type": "edge_ngram"
          }
        }
      },
      "default_analyzer": "standard",
      "default_datetime_parser": "dateTimeOptional",
      "default_field": "_all",
      "default_mapping": {
        "dynamic": false,
        "enabled": true,
        "properties": {
          "name": {
            "dynamic": false,
            "enabled": true,
            "fields": [
              {
                "analyzer": "custom_unicode",
                "docvalues": true,
                "include_in_all": true,
                "include_term_vectors": true,
                "index": true,
                "name": "name",
                "store": true,
                "type": "text"
              }
            ]
          }
        }
      },
      "default_type": "_default",
      "docvalues_dynamic": true,
      "index_dynamic": true,
      "store_dynamic": false,
      "type_field": "_type"
    },
    "store": {
      "indexType": "scorch"
    }
  },
  "sourceParams": {}
}

hey @jeremieburtin,

it would be a great help to have your feedbacks about this experiments with FTS, esp since you have tried ES.

Cheers!
Sreekanth

Hello @abhinav, thanks for your answer!

My configuration matches your example, so it seems I managed to understand how FTS works in couchbase :slight_smile:

I have more questions though :
Using fuzziness, to much matches. For instance : amfkenfjemfjn (i just typped random letters) matches “Alfred”.
Adding the highlight option, it shows me that alf is the part that matches. I understand that, but I would think that adding random letters after would not match at all.

Second question: trying to get rid of the accents in French names, I was looking for an ascii folding filter and found this merged PR on bleve github (https://github.com/blevesearch/bleve/pull/1070). It is not yet available in Couchbase, do you know if it will be soon ?

Third question (related to the previous one) : without an ascii folding filter, I tried using character regexp filters, to transform “éèÉÈëËêË” to “e”, but it seems to transform it to “ee”. I found this post (Character Filter) where the person has a similar problem and asks :

Or could it be that is a problem of utf8, because ü ist a 2 byte character, while u is 1 byte?

Can it be the reason ?

@sreeks : I’ll tell you more about my experience with couchbase FTS when I’ll have use it a little more :slight_smile:

Thanks a lot
Jérémie

Hi Jérémie,

The person in the other thread was me :slight_smile: After testing a lot I ended with a solution that works with type ahead and finds jeremie or jérémie, but not jrmie (but I don’t really need that).

Solution was:

1.) Create token filter named “edge_ngram” with type edge_ngram (min 3, max 15)
2.) Create analyzer “store” with: to_lower, stop_fr, edge_ngram, stemmer_fr_light
3.) Create analyzer “query” with: to_lower, stemmer_fr_light
4.) Index all fields with analyzer “store”
5.) query using analyzer “query”

Thats does not work perfectly in every case (f.e. because of the stemmer some times you get more result if you type more characters), but I’m happy so far with that.

Pascal

1 Like

hey @jeremieburtin,

-“alf” would have fuzzy matched at an edit distance of 1 for “amf” when you triggered the first search request while typing.
As you rightly guessed -if you re triggered the search with more characters to narrow scope the candidate terms it would give better relevant results. The application decides when to refresh/retry the searches underneath the text box and refreshes the retrieved options for the user.

-Ascii folding filter would solve the problem you perceived and it will be available in upcoming Couchbase 6.5.0 release. (already beta available - https://docs.couchbase.com/server/6.5/release-notes/relnotes.html)

Cheers!
Sreekanth

@sreeks thanks :slight_smile:

@gizmo74 thanks too for you answers ! I’ll try your exemple, one question though : how do you specify with analyzer must be used when you query ? I can’t find in the docs right now.
EDIT : ok I think I found the option to specify analyzer. Using the API, I have to add “analyzer”: “analyze_name” in the “query” part.

Thanks again
Jérémie

the query part in go is something like:

field = append(field, cbft.NewMatchQuery(v).Analyzer(“query”).Field(“title”))

Using nodejs I ended doing this :

    // Spécifie dans quels champs rechercher
    const match = SearchQuery
        .match(search)
        // permet d'avoir une marge d'erreur
        .fuzziness(1)
        // Utilise l'analyzer "query" pour la requête
        .analyzer('pulse_query')
    ;

    // Recherche sur l'email exact
    const email = SearchQuery
        .term(search)
        .field('mefPatient.mefPatientEmail')
    ;

    // Prépare un filtre sur l'agence
    const filterAgeId = SearchQuery.term(ageId).field('ageId');

    // Jointure des requêtes
    const shouldQuery = SearchQuery.disjuncts(match, email);
    const mustQuery = SearchQuery.conjuncts(filterAgeId, shouldQuery);
    
    const booleanQuery = SearchQuery.boolean()
        .must(mustQuery);

    const query = SearchQuery.new(this._ftsMedicalFile, booleanQuery)
        //.highlight()
        // Champs retournés
        .fields(
            'mftLabel',
            'mefNonCompliant',
            'mefNonCompliantForce',
            'mefStatus',
            'sysCredate',
            'mefPatient.mefPatientFirstName',
            'mefPatient.mefPatientLastName',
            'mefPatient.mefPatientDateOfBirth',
            'mefPatient.mefPatientEmail'
        ).limit(50);

In case someone else is struggling with FTS

Does the solution with the 2 analyzers work for your needs?