FTS Index to find improperly concatenated terms or phrases

We are using Couchbase 6.0.4 and have an FTS index covering a string field for organizations and/or common names.
We are seeking assistance with the scenario similar to the below:

Search term: “Aero Dynamic Freight” might be imported as “Aerodynamic Freight” or “Aero Dynamicfreight” in addition to “Aero Dynamic Freight”

If one enters a search phrase “Aero Dynamic Freight” we need the index to find not only the exact match for the phrase but also variants “Aero Dynamicfreight”, “Aerodynamic Freight”, and “Aerodynamicfreight”

look forward to your reply,

JG

hi @The_Cimmerian,

One way to achieve this is by using a custom analyser which uses shingle based token filters for the field you wanted to query.
ref - https://docs.couchbase.com/server/current/fts/fts-using-analyzers.html#token-filters

Word shingles of min/max (2/3 words) basically form tokens like below for a given text “aero dynamic freight” provided one chose to keep the input tokens as it is. (That is an option while creating the custom shingle token filters in the UI)

Blockquote
aero dynamic freight => aero, dynamic, aerodynamic, freight, dynamicfreight, aerodynamicfreight

Attaching a sample index definition for the field “fieldName” with the travel-sample bucket here.

Blockquote
{
“type”: “fulltext-index”,
“name”: “IndexName”,
“sourceType”: “couchbase”,
“sourceName”: “travel-sample”,
“planParams”: {
“maxPartitionsPerPIndex”: 1024,
“indexPartitions”: 1
},
“params”: {
“doc_config”: {
“docid_prefix_delim”: “”,
“docid_regexp”: “”,
“mode”: “type_field”,
“type_field”: “type”
},
“mapping”: {
“analysis”: {
“analyzers”: {
“custom_analyser”: {
“token_filters”: [
“shingle_2_3”,
“to_lower”
],
“tokenizer”: “unicode”,
“type”: “custom”
}
},
“token_filters”: {
“shingle_2_3”: {
“filler”: “”,
“max”: 3,
“min”: 2,
“output_original”: true,
“separator”: “”,
“type”: “shingle”
}
}
},
“default_analyzer”: “standard”,
“default_datetime_parser”: “dateTimeOptional”,
“default_field”: “_all”,
“default_mapping”: {
“dynamic”: true,
“enabled”: false
},
“default_type”: “_default”,
“docvalues_dynamic”: true,
“index_dynamic”: true,
“store_dynamic”: false,
“type_field”: “_type”,
“types”: {
“airline”: {
“dynamic”: false,
“enabled”: true,
“properties”: {
“fieldName”: {
“dynamic”: false,
“enabled”: true,
“fields”: [
{
“analyzer”: “custom_analyser”,
“docvalues”: true,
“include_in_all”: true,
“include_term_vectors”: true,
“index”: true,
“name”: “fieldName”,
“type”: “text”
}
]
}
}
}
}
},
“store”: {
“indexType”: “scorch”
}
},
“sourceParams”: {}

And a match query ought to work in this case.

Blockquote
{“query”: {“match”: “Aero Dynamic Freight”, “field”: “fieldName”}}’

Thank you very much. We will give this a shot and report back.

JG

@sreeks - is there a means by which to preview how an FTS index stores text it processes and indexes?

There are two options,

  1. http://analysis.blevesearch.com/analysis. .
  2. https://docs.couchbase.com/server/current/rest-api/rest-fts-indexing.html#p-api-index-name-analyzeDoc

Cheers!

I will review this. Thank you as always.

JG

I finally found some time to implement your suggested index. I compared the index’s JSON output with the one you provided , ensuring all the settings matched.

I did change “IndexName” to the name of the index and the “custom_analyzer” to the name of the custom analyzer I created. The construction of the analyzer was identical to your JSON output.

Of course, I also updated the type name and field name to the name of the field upon which the index would be applied. In short, I changed only the names of the items.

I repeated the search from the example in my opening post. The objective is to return all the variations of “Aero Dynamic Freight”. The results desired would return

“Aero Dynamic Freight”
“Aerodynamicfreight”
“Aerodynamic Freight”
“Aero Dynamicfreight”

Interchangeably, each of the above searched should return all the other variant listed above. So, searching “Aerodynamicfreight” would return the same or highly similar results as those above.

Unfortunately, what was returned were only exact matches. “Aerodynamicfreight” only returned “Aerodynamicfreight” and “Aero Dynamicfreight” only returned “Aero Dynamicfreight”, none of the other variants.

JG

How did you try your searches? Was it from fts search page?
Can you try a match query like in the above example if not tried already?

Yes. I tried it from the FTS search page input box. I will try it from the rest api. Actually, I will try it first just as you suggested in postman

JG

Hey!! I think you nailed it my friend. I ran the query you provided in postman and it worked!! I entered “Aerodynamicfreight” and received all the variations and in the order of similarity expected.

Thank you sooo much!

Glad to hear that @The_Cimmerian,

This is one important rule to remember in FTS -
While querying you have to specify the target field in the query so that the right analyser (which would have used for tokenising the field values during indexing) will get used for analysing the query text to have this working correctly.
Even from UI, this would have worked if you had field scoped it like fieldName: “query str”.

Cheers!

1 Like

Yes, thank you. I was doing that. The query works in the SDK and via the REST API. It would not work in the console. I presume this is a limitation?

Anyway, your recommendations worked and I am grateful to you for it.

Cheers!

JG