Wild card query not giving the desired result

Phuong_Hoang · August 1, 2019, 8:17pm

Hi,

I am running into a problem using Full Text Search for this data set.

[
    {
        "type": "hdr-type",
        "value": "Dense"
    },
    {
        "type": "hdr-type",
        "value": "Moby D"
    },
    {
        "type": "hdr-type",
        "value": "Land"
    },
    {
        "type": "hdr-type",
        "value": "Sample"
    }
]

This is the query that I use for this data set.

{
    "size": 10,
    "from": 0,
    "query": {
        "conjuncts": [
            {
                "field": "type",
                "match_phrase": "hdr-type"
            },
            {
                "disjuncts": [
                    {
                        "field": "valueMatch",
                        "match": "l*d"
                    },
                    {
                        "field": "valueMatch",
                        "wildcard": "l*d*"
                    }
                ]
            }
        ]
    }
}

With this data set and the query, I expect the query result only contains document with value = Land. However, I got back all 4 documents. Any suggestions of where I did wrong is welcomed.

Note:
This is my index definition. I use the ngram tokenizer because I also want to support the use case where search for a single character, i.e. searching for a character d will give back documents with Moby D, Land, and Dense.

{
    "name": "fts_index",
    "params": {
        "doc_config": {
            "docid_prefix_delim": "",
            "docid_regexp": "",
            "mode": "type_field",
            "type_field": "type"
        },
        "mapping": {
            "analysis": {
                "analyzers": {
                    "keyword-tolower": {
                        "token_filters": [
                            "to_lower"
                        ],
                        "tokenizer": "single",
                        "type": "custom"
                    },
                    "standard-no-stop": {
                        "token_filters": [
                            "to_lower",
                            "token-filter-ngram"
                        ],
                        "tokenizer": "unicode",
                        "type": "custom"
                    }
                },
                "token_filters": {
                    "token-filter-ngram": {
                        "max": 20,
                        "min": 1,
                        "type": "ngram"
                    }
                }
            },
            "default_analyzer": "standard",
            "default_datetime_parser": "dateTimeOptional",
            "default_field": "_all",
            "default_mapping": {
                "dynamic": true,
                "enabled": false
            },
            "default_type": "_default",
            "docvalues_dynamic": true,
            "index_dynamic": true,
            "store_dynamic": false,
            "type_field": "_type",
            "types": {
                "hdr-type": {
                    "default_analyzer": "keyword-tolower",
                    "dynamic": false,
                    "enabled": true,
                    "properties": {
                        "type": {
                            "dynamic": false,
                            "enabled": true,
                            "fields": [
                                {
                                    "analyzer": "keyword-tolower",
                                    "include_term_vectors": true,
                                    "index": true,
                                    "name": "type",
                                    "type": "text"
                                }
                            ]
                        },
                        "value": {
                            "dynamic": false,
                            "enabled": true,
                            "fields": [
                                {
                                    "analyzer": "standard-no-stop",
                                    "include_term_vectors": true,
                                    "index": true,
                                    "name": "valueMatch",
                                    "type": "text"
                                },
                                {
                                    "analyzer": "keyword-tolower",
                                    "include_term_vectors": true,
                                    "index": true,
                                    "name": "value",
                                    "type": "text"
                                }
                            ]
                        }
                    }
                }
            }
        },
        "store": {
            "indexType": "scorch",
            "kvStoreName": ""
        }
    },
    "planParams": {
        "maxPartitionsPerPIndex": 171,
        "numReplicas": 0
    },
    "sourceName": "configuration",
    "sourceParams": {},
    "sourceType": "couchbase",
    "sourceUUID": "35abc23f87205e885d7423a1c4916e0f",
    "type": "fulltext-index",
    "uuid": "19ab3aa6834410d7"
}

abhinav · August 1, 2019, 10:12pm

@Phuong_Hoang I’m not sure what your intention with the first query within your disjuncts clause is, but it is the culprit…

{
    "field": "valueMatch",
    "match": "l*d"
}

What it’d do is apply the standard-no-stop analyzer (from the valueMatch field) over the match term “l*d” which generates tokens: {“l”, “d”} . The valueMatch field of every document is analyzed with the standard-no-stop analyzer which has to_lower and token-filter-ngram analyzers. So this means the match query would match all documents as every document either has “l” or “d” amongst the tokens generated for the field “value”.

Phuong_Hoang · August 2, 2019, 2:09pm

Hi @abhinav ,

I have a library that generates the query from the using the search term that I supply. So if the search term is daniel, the query becomes

{
    "size": 10,
    "from": 0,
    "query": {
        "conjuncts": [
            {
                "field": "type",
                "match_phrase": "hdr-type"
            },
            {
                "disjuncts": [
                    {
                        "field": "valueMatch",
                        "match": "daniel"
                    },
                    {
                        "field": "valueMatch",
                        "wildcard": "daniel*"
                    }
                ]
            }
        ]
    }
}

So the intention is to search for the term daniel anywhere in the text and anywhere where daniel is a prefix of a word. I now realize it does not work for the case where the term contains wildcard symbols with the use of the analyzer that I defined. Thanks for pointing out the problem quickly.

This brings me to a couple of questions:

Is there any documentation describing the output of the response JSON when the explain is enabled? The documentation on this page (link here) does not say how I should interpret the result. I used the explain for my query but the response is very hard to understand.
I think that the analyzer treats * symbol as a character in the match query. You said otherwise. Is there a way to check what the analyzer will do for a given text?
Is my FTS index configured properly for my use cases (see below) and I only have to use the correct query?
- Using a single character d should give back results where the text contains the character d
- Using a search term land should give back the results containing land
- Using wild card in search term, e.g. l*d, should give back result containing land

abhinav · August 2, 2019, 6:58pm

Is there any documentation describing the output of the response JSON when the explain is enabled? The documentation on this page (link here) does not say how I should interpret the result. I used the explain for my query but the response is very hard to understand.

When explain is enabled, what is explained is how scoring from the subqueries affects the actual score of the hits returned. The “message” section will indicate how the aggregation of the sub queries is done (for example “sum of” or “product of”). I understand that the documentation doesn’t really cover this section - I’ve filed an internal ticket for us to improve it with the next release.

I think that the analyzer treats * symbol as a character in the match query. You said otherwise. Is there a way to check what the analyzer will do for a given text?

Yes that’s correct, feel free to use this service we host to understand the behavior of various analyzers: http://bleveanalysis.couchbase.com/analysis. You’re also allowed to set up custom analyzers and note their behavior.

Is my FTS index configured properly for my use cases (see below) and I only have to use the correct query?

Using a single character d should give back results where the text contains the character d

Using a search term land should give back the results containing land

Using wild card in search term, e.g. l*d , should give back result containing land

Yes, your index mapping would match all 3 search terms above. Note that your ngram token filter is at a max of 20 so it wouldn’t match terms greater than that length.