Text search for the poor

.net
n1ql

#1

Hi All,

Don’t know if this is the right place but lets try…
I need a way to do text search but before you recommend me to use Elastic Search I will say that the only thing i need is to be able to search documents of users by **part **of their first/last name.

So apparently indexing is not working since the server need to do a full scan over the entire user documents and try to match first/last name that contains the characters…

Is there a way to do it or the only option will be Elastic Search ?

Thanks


#2

Would that be truly random parts or can they be identified beforehand (eg. by splitting on spaces, dashes, etc… to query on compound names or the like)?

For example do you want something along the lines of matching “Pinkman” with the criteria “ink”, or more like matching “Anna-Lucia” with the criteria “Lucia” (because Lucia is half of the compound name, bounded by “-”)?

I’m asking because I can imagine a way to do the second option: that would involve building an index that splits the firstnames and lastnames according to this limited set of rules (spaces, dashes, any relevant separator) and index each splitted element. I’m sure it can be done with views (having multiple emits), less sure how to do that with N1QL (any idea @geraldss?).

The first alternative though would indeed benefit from full text search. We are working on CBFT which aims at integrating full text search into Couchbase, but this is not yet production ready.


#3

Hi, there are three options from Couchbase —

  • CBFT, the full-text search feature in Developer Preview

  • limit your search to the beginning of the first name and / last name; match “Pinkman” on “Pin%” or “Pinkm%”. In this case, you can use a regular N1QL index, and use LIKE in your query

  • otherwise, there will be a full scan

Gerald


#4

Hi geraldss,

Thanks for the reply.
I’ve tried cbft but its not production ready so I’m unable to use it.
I’ve tried to use the LIKE ‘xxx%’ solution and I found something quite disturbing.
When doing this query from workbench, It take about **100 **msec:

SELECT usr.fullName AS FullName,
usr.dateOfBirth AS DateOfBirth,
usr.userId AS UserId,
usr.roles AS Roles
FROM default usr
WHERE usr.documentType = ‘User’ AND usr.fullName LIKE ‘james%’

Running the following query from .net client takes 20 seconds:

var queryRequest = new QueryRequest(
@“SELECT usr.fullName,
usr.dateOfBirth,
usr.userId,
usr.roles
FROM default usr
WHERE usr.documentType = $documentType AND
usr.fullName LIKE $textToSearch”)
.AddNamedParameter(“documentType”, DocumentType.User)
.AddNamedParameter(“textToSearch”, “james%”);

when I removed the NamedParams and use string.Format(), It takes **3 **seconds…
Can you tell me why is this huge difference happen ?

Thanks


#5

The major difference between the two are that when you use AddNamedParameter, the key/value making up the substitution are sent separately from the query in the HTTP request body; the actual substitution takes place on the server.

When you use String.Format (I am assuming this what you meant), then substitution will take place on the client and the statement will be sent to the server for execution.

From tests I did on my machine (I am making assumptions about what you meant here), he see fairly similar results for both (actually slightly slower for local substitution). For example:

 using (var bucket = _cluster.OpenBucket())
 {
     var queryRequest = new QueryRequest(
     @"SELECT usr.fullName,
     usr.dateOfBirth,
     usr.userId,
     usr.roles
     FROM default usr
     WHERE usr.documentType = $documentType AND
     usr.fullName LIKE $textToSearch")
                .AddNamedParameter("documentType", "user")
                .AddNamedParameter("textToSearch", "james%");

    for (int i = 0; i < 10; i++)
    {
             var stopWatch = new Stopwatch();
             stopWatch.Start();
             var result = bucket.Query<dynamic>(queryRequest);
             stopWatch.Stop();
             Console.WriteLine("Completed in {0}ms - {1}", stopWatch.ElapsedMilliseconds, result.Status);
     }
}

The results:

Completed in 2218ms - Success
Completed in 43ms - Success
Completed in 48ms - Success
Completed in 31ms - Success
Completed in 35ms - Success
Completed in 31ms - Success
Completed in 32ms - Success
Completed in 31ms - Success
Completed in 28ms - Success
Completed in 30ms - Success

using (var bucket = _cluster.OpenBucket())
            {
                var queryRequest = new QueryRequest(
                string.Format(@"SELECT usr.fullName,
                usr.dateOfBirth,
                usr.userId,
                usr.roles
                FROM default usr
                WHERE usr.documentType = '{0}' AND
                usr.fullName LIKE '{1}'", "user", "james%"));

                for (int i = 0; i < 10; i++)
                {
                    var stopWatch = new Stopwatch();
                    stopWatch.Start();
                    var result = bucket.Query<dynamic>(queryRequest);
                    stopWatch.Stop();
                    Console.WriteLine("Completed in {0}ms - {1}", stopWatch.ElapsedMilliseconds, result.Status);
                }
            }

The results for String.Format:

Completed in 2058ms - Success
Completed in 53ms - Success
Completed in 35ms - Success
Completed in 41ms - Success
Completed in 35ms - Success
Completed in 57ms - Success
Completed in 105ms - Success
Completed in 107ms - Success
Completed in 34ms - Success
Completed in 35ms - Success

The first request within a process will always take more time because the ServicePointManager is creating and initializing a pool for subsequent requests as part of the .NET stack.

If this isn’t what you meant by using String.Format, then LMK with details. Also, if you can recreate the issue then submit a bug report and include an example project and I will look into it.

-Jeff


#6

Hi,

Did the same test on local Couchbase instance with indexed ‘fullName’ field and I get completely different scores.
I tried to check with other queries but they all worked fine as you said, It’s just with this specific query. Maybe it’s something to do with the 'LIKE’statement cause all other queries I have don’t use it.


#7

@israelolcha -

I don’t think it would be the client here making the difference, since the work is all server side if your using named parameters, but I cannot tell without looking at your code. You could use Fiddler are something similar to determine the request/response time without measuring client overhead. That might help isolate the performance degradation.

-Jeff


#8

Hi Jeff
I took your advise and used fiddler to try and understand why there is a performance difference.
You can see from the image there are 502 every time the call is made to 10.0.0.6 which actually increase the overall time since there is a retry mechanism. The server looks OK from the dashboard… any suggestions?


#9

@israelolcha -

What is the body of the 502? I don’t know why you would get a “Bad Gateway” - that would be a server-side response. You might want to check the server logs. Perhaps @geraldss could shed some light here?

-Jeff


#10

Hi,

Well the node was unreachable which caused a delay so I replaced that machine.
But still the query has this odd behavior … Will try and look again.
Thanks anyway