Weird behaivor from CB Lite Full Text Search on string lists

.net
#1

Hello,

I’ve been trying to use full text search queries on .Net driver, so far this N1QL approach works well with the online server but for offline I get weird responses.

My models look like this

public class DummyModel
{
    public string Id { get; set; }
    public string SomeString { get; set; }
    public List<string> SomeList { get; set; }
}

And after add the following models to the database:

var top = new DummyModel()
{
	SomeString = "Top 100 movies",
	SomeList = new List<string>
	{
		"The Shawshank Redemption (1994)",
		"The Godfather (1972)",
		"The Godfather: Part II (1974)",
		"The Dark Knight (2008)",
		"12 Angry Men (1957)",
		"Schindler's List (1993)",
		"The Lord of the Rings: The Return of the King (2003)",
		"Pulp Fiction (1994)",
		"Avengers: Endgame (2019)",
		"The Good, the Bad and the Ugly (1966)",
	}
};

var topAction = new DummyModel()
{
	SomeString = "Top 100 Action movies",
	SomeList = new List<string>
	{
		"The Mountain II (2016)",
		"Avengers: Endgame (2019)",
		"The Dark Knight (2008)",
		"Inception (2010)",
		"The Matrix (1999)",
		"Star Wars: Episode V - The Empire Strikes Back (1980)",
		"Uri: The Surgical Strike (2019)",
		"Léon: The Professional (1994)",
		"Star Wars: Episode IV - A New Hope (1977)",
		"Dangal (2016)"
	}
};

var topThriller = new DummyModel()
{
	SomeString = "Top 100 Thriller movies",
	SomeList = new List<string>
	{
		"The Dark Knight (2008)",
		"Inception (2010)",
		"The Usual Suspects (1995)",
		"Se7en (1995)",
		"Léon: The Professional (1994)",
		"The Silence of the Lambs (1991)",
		"Andhadhun (2018)",
		"The Prestige (2006)",
		"The Departed (2006)",
		"Memento (2000)"
	}
};

var topHistory = new DummyModel()
{
	SomeString = "Top 100 History movies",
	SomeList = new List<string>
	{
		"Schindler's List (1993)",
		"Ayla: The Daughter of War (2017)",
		"Braveheart (1995)",
		"Amadeus (1984)",
		"Lawrence of Arabia (1962)",
		"Downfall (2004)",
		"Raise the Red Lantern (1991)",
		"The Message (1976)",
		"Andrei Rublev (1966)",
		"The Great Escape (1963)"
	}
};

And create full text indices like so:

var fullTextFields = new List<string>{ "SomeList" };
if (!fullTextFields.Any()) return;
var indices = new List<IFullTextIndexItem>();
foreach (var field in fullTextFields)
{
	indices.Add(FullTextIndexItem.Property(field));
}
var ftsIndex = IndexBuilder.FullTextIndex(indices.ToArray()).IgnoreAccents(false);

bucket.Context.CreateIndex("someIndex", ftsIndex)

Searching for “Avengers” gives one result, although it should be two. Searching for Pulp Fiction or Pulp gives zero results but it should be one. Searching for “shawshank” gives one result.

Here’s the full result for each word in the database:

the, 0, 
shawshank, 1, Top 100 movies
redemption, 1, Top 100 movies
(1994), 1, Top 100 movies
godfather, 0, 
(1972), 0, 
godfather:, 0, 
part, 0, 
ii, 1, Top 100 Action movies
(1974), 0, 
dark, 2, Top 100 Action movies, Top 100 Thriller movies
knight, 2, Top 100 Action movies, Top 100 Thriller movies
(2008), 2, Top 100 Action movies, Top 100 Thriller movies
12, 0, 
angry, 0, 
men, 0, 
(1957), 0, 
schindler's, 1, Top 100 History movies
list, 1, Top 100 History movies
(1993), 1, Top 100 History movies
lord, 0, 
of, 0, 
the, 0, 
rings:, 0, 
return, 0, 
king, 0, 
(2003), 0, 
pulp, 0, 
fiction, 0, 
avengers:, 1, Top 100 Action movies
endgame, 1, Top 100 Action movies
(2019), 4, Top 100 movies, Top 100 Action movies, Top 100 Thriller movies, Top 100 History movies
good,, 0, 
bad, 0, 
and, 0, 
ugly, 0, 
(1966), 0, 
mountain, 1, Top 100 Action movies
(2016), 1, Top 100 Action movies
inception, 2, Top 100 Action movies, Top 100 Thriller movies
(2010), 2, Top 100 Action movies, Top 100 Thriller movies
matrix, 1, Top 100 Action movies
(1999), 1, Top 100 Action movies
star, 0, 
wars:, 0, 
episode, 0, 
v, 0, 
-, 0, 
empire, 0, 
strikes, 0, 
back, 0, 
(1980), 0, 
uri:, 0, 
surgical, 0, 
strike, 0, 
léon:, 0, 
professional, 0, 
iv, 0, 
a, 0, 
new, 0, 
hope, 0, 
(1977), 0, 
dangal, 0, 
ayla:, 0, 
daughter, 0, 
war, 0, 
(2017), 0, 
braveheart, 0, 
(1995), 1, Top 100 Thriller movies
amadeus, 0, 
(1984), 0, 
lawrence, 0, 
arabia, 0, 
(1962), 0, 
downfall, 0, 
(2004), 0, 
raise, 0, 
red, 0, 
lantern, 0, 
(1991), 0, 
message, 0, 
(1976), 0, 
andrei, 0, 
rublev, 0, 
great, 0, 
escape, 0, 
(1963), 0, 
usual, 1, Top 100 Thriller movies
suspects, 1, Top 100 Thriller movies
se7en, 0, 
silence, 0, 
lambs, 0, 
andhadhun, 0, 
(2018), 0, 
prestige, 0, 
(2006), 0, 
departed, 0, 
memento, 0, 
(2000), 0, 

I really don’t understand the pattern here, for online database, it works as expected.

This is the function to search by the way:

var ftsExpression = FullTextExpression.Index(_index).Match($"'{_term}'");

var query = QueryBuilder.Select(SelectResult.Expression(Meta.ID))
	.From(DataSource.Database(bucket.Context))
	.Where(ftsExpression)
	.Limit(Expression.Int(_limiting.Limit == 0 ? 1000 : _limiting.Limit), Expression.Int(_limiting.Offset));

var res = query.Execute();
foreach (var result in res)
{
	var id = result.GetString("id");
	var model = someFunctionToGetModel(id);
	yield return JsonConvert.DeserializeObject<ttv>(JsonConvert.SerializeObject(model));
}

Thanks in advance.

#2

I don’t think FTS in CBL will work correctly with a property whose value isn’t a string. Looking through the code, it seems like the internal encoded value will be fed directly into the FTS indexer, which in the case of an array won’t tokenize into words very well.

I’ve filed an issue on this: https://github.com/couchbase/couchbase-lite-core/issues/772

The workaround would be to add another property to the document that’s a single string containing all the text, and to index that property.

#3

Thanks for the suggestion, if I join those strings in another field, it works. But FTS in CBLite still has some weird issues, for example, this is the online databases answer to each word in those lists:

the, 0, 
shawshank, 1, Top 100 movies
redemption, 1, Top 100 movies
(1994), 3, Top 100 Action movies, Top 100 movies, Top 100 Thriller movies
godfather, 1, Top 100 movies
(1972), 1, Top 100 movies
part, 1, Top 100 movies
ii, 2, Top 100 Action movies, Top 100 movies
(1974), 1, Top 100 movies
dark, 3, Top 100 Action movies, Top 100 movies, Top 100 Thriller movies
knight, 3, Top 100 Action movies, Top 100 movies, Top 100 Thriller movies
(2008), 3, Top 100 Action movies, Top 100 movies, Top 100 Thriller movies
12, 1, Top 100 movies
angry, 1, Top 100 movies
men, 1, Top 100 movies
(1957), 1, Top 100 movies
schindler's, 2, Top 100 History movies, Top 100 movies
list, 2, Top 100 History movies, Top 100 movies
(1993), 2, Top 100 History movies, Top 100 movies
lord, 1, Top 100 movies
of, 0, 
the, 0, 
rings, 1, Top 100 movies
return, 1, Top 100 movies
king, 1, Top 100 movies
(2003), 1, Top 100 movies
pulp, 1, Top 100 movies
fiction, 1, Top 100 movies
avengers, 2, Top 100 Action movies, Top 100 movies
endgame, 2, Top 100 Action movies, Top 100 movies
(2019), 2, Top 100 Action movies, Top 100 movies
good,, 1, Top 100 movies
bad, 1, Top 100 movies
and, 0, 
ugly, 1, Top 100 movies
(1966), 2, Top 100 History movies, Top 100 movies
mountain, 1, Top 100 Action movies
(2016), 1, Top 100 Action movies
inception, 2, Top 100 Action movies, Top 100 Thriller movies
(2010), 2, Top 100 Action movies, Top 100 Thriller movies
matrix, 1, Top 100 Action movies
(1999), 1, Top 100 Action movies
star, 1, Top 100 Action movies
wars, 1, Top 100 Action movies
episode, 1, Top 100 Action movies
v, 1, Top 100 Action movies
empire, 1, Top 100 Action movies
strikes, 1, Top 100 Action movies
back, 1, Top 100 Action movies
(1980), 1, Top 100 Action movies
uri, 1, Top 100 Action movies
surgical, 1, Top 100 Action movies
strike, 1, Top 100 Action movies
léon, 2, Top 100 Action movies, Top 100 Thriller movies
professional, 2, Top 100 Action movies, Top 100 Thriller movies
iv, 1, Top 100 Action movies
a, 0, 
new, 1, Top 100 Action movies
hope, 1, Top 100 Action movies
(1977), 1, Top 100 Action movies
dangal, 1, Top 100 Action movies
ayla, 1, Top 100 History movies
daughter, 1, Top 100 History movies
war, 1, Top 100 History movies
(2017), 1, Top 100 History movies
braveheart, 1, Top 100 History movies
(1995), 2, Top 100 History movies, Top 100 Thriller movies
amadeus, 1, Top 100 History movies
(1984), 1, Top 100 History movies
lawrence, 1, Top 100 History movies
arabia, 1, Top 100 History movies
(1962), 1, Top 100 History movies
downfall, 1, Top 100 History movies
(2004), 1, Top 100 History movies
raise, 1, Top 100 History movies
red, 1, Top 100 History movies
lantern, 1, Top 100 History movies
(1991), 2, Top 100 History movies, Top 100 Thriller movies
message, 1, Top 100 History movies
(1976), 1, Top 100 History movies
andrei, 1, Top 100 History movies
rublev, 1, Top 100 History movies
great, 1, Top 100 History movies
escape, 1, Top 100 History movies
(1963), 1, Top 100 History movies
usual, 1, Top 100 Thriller movies
suspects, 1, Top 100 Thriller movies
se7en, 1, Top 100 Thriller movies
silence, 1, Top 100 Thriller movies
lambs, 1, Top 100 Thriller movies
andhadhun, 1, Top 100 Thriller movies
(2018), 1, Top 100 Thriller movies
prestige, 1, Top 100 Thriller movies
(2006), 1, Top 100 Thriller movies
departed, 1, Top 100 Thriller movies
memento, 1, Top 100 Thriller movies
(2000), 1, Top 100 Thriller movies

While this is the offline answer:

the, 0, 
shawshank, 1, Top 100 movies
redemption, 1, Top 100 movies
(1994), 3, Top 100 Action movies, Top 100 movies, Top 100 Thriller movies
godfather, 1, Top 100 movies
(1972), 1, Top 100 movies
part, 1, Top 100 movies
ii, 2, Top 100 Action movies, Top 100 movies
(1974), 1, Top 100 movies
dark, 3, Top 100 Action movies, Top 100 movies, Top 100 Thriller movies
knight, 3, Top 100 Action movies, Top 100 movies, Top 100 Thriller movies
(2008), 3, Top 100 Action movies, Top 100 movies, Top 100 Thriller movies
12, 1, Top 100 movies
angry, 1, Top 100 movies
men, 1, Top 100 movies
(1957), 1, Top 100 movies
schindler's, 2, Top 100 History movies, Top 100 movies
list, 2, Top 100 History movies, Top 100 movies
(1993), 2, Top 100 History movies, Top 100 movies
lord, 1, Top 100 movies
of, 0, 
the, 0, 
rings, 1, Top 100 movies
return, 1, Top 100 movies
king, 1, Top 100 movies
(2003), 1, Top 100 movies
pulp, 1, Top 100 movies
fiction, 1, Top 100 movies
avengers, 2, Top 100 Action movies, Top 100 movies
endgame, 2, Top 100 Action movies, Top 100 movies
(2019), 4, Top 100 Action movies, Top 100 History movies, Top 100 movies, Top 100 Thriller movies
good,, 1, Top 100 movies
bad, 1, Top 100 movies
and, 0, 
ugly, 1, Top 100 movies
(1966), 2, Top 100 History movies, Top 100 movies
mountain, 1, Top 100 Action movies
(2016), 1, Top 100 Action movies
inception, 2, Top 100 Action movies, Top 100 Thriller movies
(2010), 2, Top 100 Action movies, Top 100 Thriller movies
matrix, 1, Top 100 Action movies
(1999), 1, Top 100 Action movies
star, 1, Top 100 Action movies
wars, 2, Top 100 Action movies, Top 100 History movies
episode, 1, Top 100 Action movies
v, 1, Top 100 Action movies
empire, 1, Top 100 Action movies
strikes, 1, Top 100 Action movies
back, 1, Top 100 Action movies
(1980), 1, Top 100 Action movies
uri, 1, Top 100 Action movies
surgical, 1, Top 100 Action movies
strike, 1, Top 100 Action movies
léon, 2, Top 100 Action movies, Top 100 Thriller movies
professional, 2, Top 100 Action movies, Top 100 Thriller movies
iv, 1, Top 100 Action movies
a, 0, 
new, 1, Top 100 Action movies
hope, 1, Top 100 Action movies
(1977), 1, Top 100 Action movies
dangal, 1, Top 100 Action movies
ayla, 1, Top 100 History movies
daughter, 1, Top 100 History movies
war, 2, Top 100 Action movies, Top 100 History movies
(2017), 1, Top 100 History movies
braveheart, 1, Top 100 History movies
(1995), 2, Top 100 History movies, Top 100 Thriller movies
amadeus, 1, Top 100 History movies
(1984), 1, Top 100 History movies
lawrence, 1, Top 100 History movies
arabia, 1, Top 100 History movies
(1962), 1, Top 100 History movies
downfall, 1, Top 100 History movies
(2004), 1, Top 100 History movies
raise, 1, Top 100 History movies
red, 1, Top 100 History movies
lantern, 1, Top 100 History movies
(1991), 2, Top 100 History movies, Top 100 Thriller movies
message, 1, Top 100 History movies
(1976), 1, Top 100 History movies
andrei, 1, Top 100 History movies
rublev, 1, Top 100 History movies
great, 1, Top 100 History movies
escape, 1, Top 100 History movies
(1963), 1, Top 100 History movies
usual, 1, Top 100 Thriller movies
suspects, 1, Top 100 Thriller movies
se7en, 1, Top 100 Thriller movies
silence, 1, Top 100 Thriller movies
lambs, 1, Top 100 Thriller movies
andhadhun, 1, Top 100 Thriller movies
(2018), 1, Top 100 Thriller movies
prestige, 1, Top 100 Thriller movies
(2006), 1, Top 100 Thriller movies
departed, 1, Top 100 Thriller movies
memento, 1, Top 100 Thriller movies
(2000), 1, Top 100 Thriller movies

Notice for wars, offline returns two result, one containing star wars and the other Ayla: The Daughter of War, so it does stemming for English language although I didn’t specify any language. And for (2019) it returns 4 results, it should be two and it’s really interesting why it does that. I don’t know if paranthesis has a special meaning for offline FTS though.

Thanks again for your help, @jens

#4

Hm. Couchbase Lite Core (which I work on) doesn’t do any stemming unless the index options specify a language. Couchbase Lite may be setting the language option by default. @Sandy_Chuang or @borrrden may know the details of that.

I’m pretty sure the tokenizer ignores parentheses, so searching for “(1994)” should be the same as searching for “1994” unless, as you say, the SQLite FTS query syntax has some special meaning for parentheses.

#5

On Couchbase Lite .NET the default value for the locale inside of a full text index is whatever is returned by CultureInfo.CurrentCulture.TwoLetterISOLanguageName