Full text search using fuzzy queries with multiple threads

java

#1

Hello,

Im trying to run a full text search query using the java api 2.5.9 on couchbase server 5.0.1 community edition. If i run a query with many MatchQueries with fuzziness as 2, wrapped within a boolean query - it has 4 MatchQueries without fuzziness which as part of boolean query must, and another set of 5-6 MatchQueries which as part of the boolean should query. I have noticed the more number of MatchQueries we add for fuzzy search, the more time the fts takes. Indexes are dynamic on a json object field and use the standard analyzer. Also, when I run tests hitting the fts service with multiple threads (200+) i see that the CPU usage reaches 99% on multiple nodes (there are 4 nodes in the cluster, all support full text searches), and I get requestcancelled exception for many requests. Can you help me with any configuration needed for such high load?

Regards,
c.


#2

Hey, not sure whether you got the query expressed in the best way for getting the right search results. But keep in mind that, every fuzzy query will end up in computing edit distance across a large number of terms indexed against that field and once we have a set of terms within the edit distance for a given term, we trigger searches on all of those candidate terms. So internally its a lot of work with every fuzzy query.
All this effort will compound as you have more and more fuzzy queries within your search request. And if you have large number of documents/contents indexed this would result in heavy computations at fts side. So heavy cpu usage for 200+ queries isn’t that surprising.

Not sure of your exact use case, but the first recommendation would be to cross check your query and adjust that to get the optimum performance and results.

In case of request cancelled exceptions, are you seeing any errors like “TooManyClauses”?
Are you on an MDS enabled cluster of 4 nodes Or those nodes host other services as well in meantime?

regards,


#3

I dont see any TooManyClauses errors, but I will double check on that. You are right that the more fuzzy search candidates a query has, the more time and resources it takes. The thing is, I have our current system running with similar queries and load on lucene, and it runs perfect with very less latency (2 nodes running the query service using lucene). I am trying to move from couchdb to couchbase, and one of our major use cases involve full text searches, and so was trying to see how this work with couchbase.
The 4 nodes I have setup run all services (data,index,full text and query).

Regards,
c