Can PrefixQuery work with a single-character search string?


#1

Hello.

We are unable to get the expected results when running a PrefixQuery with a search string that contains a single character. However, we do get the expected results when the search string is two or more characters.

The same things appear to happen for RegexpQuery and WildCardQuery.

Is there a requirement for search strings/terms to be at least two characters?

[We are building the query with SearchQuery.prefix().field(). Our FTS index uses a custom analyzer with the “unicode” tokenizer and the “to_lower” filter.]

Thanks very much for your time.
Jeff


#2

There is no specific limitation to prevent a prefix query of a single letter, but there is a common reason the query types you describe are failing.

Internally, prefix, regexp and wildcard queries all have something in common. Based on the input, they find a number of terms in the dictionary (t1, t2, t3, …). Then when the query is executed, we’re actually searching a disjunction of those terms, like: t1 OR t2 OR t3 OR …

The execution of these queries can get quite expensive and use a lot of memory, so we put a restriction on disjunction queries to limit the number of clauses to 1024. The choice of 1024 was arbitrary.

When you do a prefix query of a letter like “m”, you may find that the query fails and it is because this expands to more than 1024 terms. A similar prefix query for the letter “z” might work though, because there are fewer terms staring with the letter “z”. In this way, the exact behavior you observe is dependent on your dataset.

Another thing to note, is that we did not intend for you to get an empty result set. What is supposed to happen is that you get an error clearly stating that TooManyClauses[maxClauseCount is set to 1024]. However, we recently found a bug in the code which led to us hiding this error message. This has been addressed in the upcoming release.

As a workaround, one thing you can try to do is to manually expend the prefix queries your self. It is quite ugly, but you could do “ma” OR “mb” OR “mc” … etc. This is not a good solution, but might work for the short term.

We are also investigating ways of making this limit of 1024 more configurable in the future.


#3

Thank you very much for the info!

And FYI we just discovered we can see the “TooManyClauses[maxClauseCount is set to 1024]” errors with this logic (using v2.5.3 of the java client):

SearchQuery theQuery=...

// Execute the query
SearchQueryResult result = theBucket.query (theQuery);

// Check the results
SearchStatus resultStatus = result.status();

if( resultStatus.isSuccess() == false ) {
    logger.error ("Query FAILED!");
    logger.error ("-----");
    logger.error ("Errors: ");
    for( String errorStr : result.errors() ) {
        logger.error (errorStr);
    }

} else {
    // Extract the search results
    List<SearchQueryRow> hitRows = result.hits();

    if( hitRows.isEmpty() ) {
        // No (more) results

    } else {
        for( SearchQueryRow resultRow : hitRows ) {
        [...]

Thanks again!
Jeff