Full text search for embedded words


#1

Given the following “test_doc”, I was expecting a full text search for ‘tld’ would find this document. Searching for the entire string “http://server.org.tld/index.html”, 3.45, or ‘test’ works as expected. However, partial words (ie. ‘tld’, or 3.4) return empty result sets. Am I expecting too much from ‘full text’ search? I found a prior post suggesting post-filtering all the results for ‘contains a substring’, but this seems to defeat the idea behind a full-text search.

test_doc:
{
“f1”: 1.2,
“anobj”: {
“objf1”: “3.45”,
“obst1”: “anobj test text”
},
“i1”: -1,
“t2”: “http://server.org.tld/index.html”,
“alist”: [
“one”,
“two”
],
“t1”: “text field t1”,
“type”: “mytesttype”
}

I am using:
couchbase-server-4.6.2-3905
libcouchbase2-core-2.7.7-1
libcouchbase-devel-2.7.7-1


#2

Unfortunately, “full-text search” does not mean match any arbitrary sub-string anywhere in the document.

Instead, it is a way of applying techniques to match common variations of text. The specific techniques you choose depend on what you know about the data. If the text is english text, we split it into words, put it in lowercase, and apply english language stemming. Obviously those specific techniques do not help you find “tld” inside text containing a URL.

If your goal was to match TLDs, there are probably some techniques you could use to make that work. But, if that was just coincidental in your example, and you want to match any arbitrary sub-string, then full-text search is not going to help much.

marty


#3

Marty,

Thanks for the explanation. Useful info.

In this case ‘tld’ was an arbitrary sub-string. So, it looks like post-filtering will be required here.