Proper way to FTS Index and search Meta().id or docids using regex or wildcard seaches

Using Node SDK 2.6.9 and Couchbase Server 6.0.x

Objective:
Illustration how to use FTS Search for documents with IDS’s matching a regular expression pattern without storing the Meta().id as a field within each target document.

What I have achieved
I have a working FTS index which includes an “id” field configured with the “keyword” analyzer. This id field is set from the Meta().id of the document. I am able to apply regex searches using this regex pattern: “^[a-z]{5,15}::dp::entity::[0-9]{4,10}$” successfully using the SDK regex search function in a variety of ways.

I would prefer not to store this ID within the document. I have reviewed the documentation for indexing as a “Doc ID with regex”. However, the documentation only illustrates examples using full document ids with the “docIds” function in the SDK. How does one implement a regex search against an index configured as Doc Id with regex"?

Thanks

JG

“Doc ID with regex” is a way couchbase FTS will allow you to extract “type identifiers” for indexing. Let me explain this better with an example …

Ex: Say you have these documents on your couchbase bucket…

  • doc-airline-23
  • doc-airline-421
  • doc-airport-992

Using the regexp: -[a-z]*-, the docIDs that would match the pattern are: “-airline-” and “-airport-”.
Setting the above regex in your “Doc ID with regex” field, you’ll be able to define top level type mappings: “-airline-” or “-airport-” to index only those documents whose Doc IDs contain the term.

So setting this as your index definition would index all attributes of documents with “-airline-” in their IDs …

{
  "name": "temp",
  "type": "fulltext-index",
  "params": {
    "doc_config": {
      "docid_regexp": "-[a-z]*-",
      "mode": "docid_regexp"
    },
    "mapping": {
      "default_analyzer": "standard",
      "default_datetime_parser": "dateTimeOptional",
      "default_field": "_all",
      "default_mapping": {
        "dynamic": true,
        "enabled": false
      },
      "types": {
        "-airline-": {
          "dynamic": true,
          "enabled": true
        }
      }
    },
    "store": {
      "indexType": "scorch"
    }
  },
  "sourceType": "couchbase",
  "sourceName": "bucket",
  "sourceUUID": "",
  "uuid": ""
}

Hope that explains what “Doc ID with regex” is intended to do. Note that this won’t assist you in searching the doc IDs based on a regexp - it’ll only index ones that you’ve matched using the regexp.

The Doc ID query would only assist you to retrieve documents if you specify the exact ID(s) - so I gather this isn’t what you want.

Now if you want to search across all documents by applying a regexp you would need to issue a regexp query against the “_id” field that FTS automatically indexes.

Taking the previous example again, the following query would fetch me all documents that have “airline” in it …

{
  "query": {
    "regexp": ".*airline.*",
    "field": "_id"
  }
}

Thank you very much @abhinav.

Yes, I need to be able to find all documents which match this pattern: “^[a-z]{5,15}::dp::entity::[0-9]{4,10}$”

In all be a few cases, the “[a-z]{5,15}” section will be set to a known customer account code:
^(sampleco|anotherco)::dp::entity::[0-9]{4,10}$

I found I must “\” escape each of the colons to work properly with the “keyword” analyzer setting in the index.

The mystery I believe you solved is against what may one index and you have stated it is the “_id” code.

From your reply I believe the update in the screen-cap below is what is needed in the FTS index. Please confirm my understanding of your instructions is correct:

Indexing “id” field isn’t necessary.
The meta().id field is already indexed (automatically) by FTS (as is => keyword) and can be searchable by specifying “_id” in the field section of your query.

1 Like

@abhinav - Ok. I removed the indexing on the “_id” field in the FTS index. I modified my Nodejs SDK implementation of the query to the below:

My intent is to find all documents with ID’s matching EITHER of these two patterns.
AND;
with the code field in the country object = “SA”

Does look correct?

{
    "query": {
        "conjuncts": [
            {
                "disjuncts": [
                    {
                        "regexp": "abc::dp::entity::[0-9]{4,10}", 
                        "field": "_id"
                    }, 
                    {
                        "regexp": "xyz::dp::entity::[0-9]{4,10}", 
                        "field": "_id"
                    }
                ]
            }, 
            {
                "match": "SA", 
                "field": "country.code"
            }
        ]
    }
}

JG

Looks good, or simply one regex to match either of those patterns …

{
  "query": {
    "conjuncts": [
      {
        "regexp": "(?:abc|xyz)::dp::entity::[0-9]{4,10}",
        "field": "_id"
      },
      {
        "match": "SA",
        "field": "country.code"
      }
    ]
  }
}
1 Like

I must have something else going on to block me. There query string is identical to yours now but I am still getting no results. How may one check this in the Couchbase UI?

{
    "query": {
        "conjuncts": [
            {
                "regexp": "(?:abc|xyz)::dp::entity::[0-9]{4,10}", 
                "field": "_id"
            }, 
            {
                "match": "SA", 
                "field": "country.code"
            }
        ]
    }
}

JG

@abhinav - I have marked your answer as the solution as I am sure it is the correct way. I have something outside of the FTS interfering with the successful result.

I will try to resolve this later. In the meantime, I have returned to the work around I was using where I am filtering in the dynamic N1QL query to which I am joining the results of the FTS query.

I do appreciate your assistance.

Regards,

JG

There isn’t a way you can run a regexp with the couchbase UI - as all it accepts is a query string query.
See here on supported UI syntax: https://docs.couchbase.com/server/current/fts/query-string-queries.html

If you’re using the SDK, you should perhaps look for any errors you may be getting while executing the regexp query.
You can POST a curl request directly to the REST endpoint to see your results …

curl -XPOST -H "Content-Type: application/json" \
-u <username>:<password> http://<ip>:8094/api/index/<index_name>/query \
-d '{
  "query": {
    "regexp": "(?:abc|xyz)::dp::entity::[0-9]{4,10}",
    "field": "_id"
  }
}'

Thank you. I will try the CURL request and see if anything shows up. To be clear, I am not receiving any errors. The query runs successfully but it doesn’t find matches.

The solution I was seeking to replace leverages FTS to search on country code, or, agency code, or both, limited by the type. The results are then injected into a dynamic N1QL query which selects most of the fields from the source documents and joined to the results from the FTS results on the id fields. I am using a REGEXP_LIKE function to filter those id’s by the regex pattern I provided in previous replies.

By applying the documentid filter at the FTS step, I would greatly reduce the number of records in the join.

As I mentioned in my opening post, I have successfully applied regex pattern matching against document ids in FTS but only when the Meta().id was already embedded in the target document and set to a field and that field was configured with the “keyword” analyzer in the index.

I believe your recommendations are sound. I just need to try it another day when my head is a little clearer.

Regards,

JG

As promised, I ran the FTS query using the CURL command and it does return a verbose error, “…TooManyClauses[maxClauseCount is set to 1024]”.

The SDK appears to swallow this and fail silently.

I have learned what causes this. Unfortunately, I cannot narrow the regex to bring the clause count under the 1024 threshold. We are also not able to upgrade to 6.5 yet. It appears we must continue with our work-around until we can complete our current release and take up the upgrade to 6.5 in our next development cycle.

Thank you for your assistance with this issue.

Regards,

JG

In 6.5, we support an easy way to change the maxClauseCount …

curl -XPUT -H "Content-type:application/json" \
http://<username>:<password>@<ip>:8094/api/managerOptions \
-d '{"bleveMaxClauseCount": "10000"}'

Right. I read that. Unfortunately, I cannot upgrade at this time due to testing has completed. We will upgrade in our next sprint.

JG

I am back, unfortunately.

The solution you suggested for finding URN-compliant docids using FTS regex matches worked great, until…
The number of records in our collection against which that search was applicable exceeded 1024.

This is curious because previously, we could search without the id filter just fine with only a date range required. This of course was insufficient because we maintain separation between tenants via a domain code encoded in the document id as explained in my original posts.

Now the document results exceed 1024 we get no results but only when we include the id filter. Weird. if we search only on date range results come back from all our customers - not acceptable. If we use the doc id filter the query breaks and we get nothing.Below is the response from the FTS REST API:

Why would FTS return results with the other filters but shut down once the ID filter is applied?

"status": {
        "total": 6,
        "failed": 6,
        "successful": 0,
        "errors": {
            "screened_name_request_test_3f33ca20613e41ce_13aa53f3": "TooManyClauses[maxClauseCount is set to 1024]",
            "screened_name_request_test_3f33ca20613e41ce_18572d87": "TooManyClauses[maxClauseCount is set to 1024]",
            "screened_name_request_test_3f33ca20613e41ce_54820232": "TooManyClauses[maxClauseCount is set to 1024]",
            "screened_name_request_test_3f33ca20613e41ce_6ddbfb54": "TooManyClauses[maxClauseCount is set to 1024]",
            "screened_name_request_test_3f33ca20613e41ce_aa574717": "TooManyClauses[maxClauseCount is set to 1024]",
            "screened_name_request_test_3f33ca20613e41ce_f4e0a48a": "TooManyClauses[maxClauseCount is set to 1024]"
        }
    },
    "request": {
        "query": {
            "must": {
                "conjuncts": [
                    {
                        "start": "2020-06-29T00:00:00-04:00",
                        "end": "2020-06-30T23:59:59-04:00",
                        "inclusive_start": true,
                        "inclusive_end": true,
                        "field": "completed"
                    },
                    {
                        "regexp": "biilthakatt::dp::some::thing::activity::[0-9]{4,10}",
                        "field": "_id"
                    }
                ]
            }
        },
        "size": 20,
        "from": 0,
        "highlight": null,
        "fields": [
            "*"
        ],
        "facets": null,
        "explain": false,
        "sort": [
            "-start"
        ],
        "includeLocations": false
    },
    "hits": null,
    "total_hits": 0,
    "max_score": 0,
    "took": 4438027,
    "facets": null
}

@The_Cimmerian The query …

{
  "query": {
    "must": {
      "conjuncts": [
        {
          "start": "2020-06-29T00:00:00-04:00",
          "end": "2020-06-30T23:59:59-04:00",
          "inclusive_start": true,
          "inclusive_end": true,
          "field": "completed"
        },
        {
          "regexp": "biilthakatt::dp::some::thing::activity::[0-9]{4,10}",
          "field": "_id"
        }
      ]
    }
  }
}

… is a conjunction of a date range query and a regexp query.

  • The date range query does NOT launch multiple term searchers to match all qualifying dates.
  • The regexp query on the other hand does spin off multiple term searchers - one for each unique term identified that matches the pattern. If the number of unique terms exceeds the bleveMaxClauseCount, you will see the error - “TooManyClauses”.

We’ve discussed this earlier that we’ve added support for users to raise the threshold for the setting from a default of 1024 - this being available from 6.5 on.

As always, I appreciate your reply. Yes, you have mentioned you have added support for users to raise the threshold for the settings from 1024 to 10,000 - in ver 6.5. My reply to you was we are not yet in a position to make that update as it requires we update our node SDK to 3.0.x from 2.6.x. Not until we get through or list of core features of our app may we entertain a version upgrade.

In addition, the 10,000 limit you mentioned will not work either. If we have over 10,000 matches for the same document ID pattern, we will still run into the same limitation.

Given this, I have reworked the query to bypass the FTS and utilize a purely N1QL query indexed to filter the documents by their ID as well as the other conditions. This turned out surprisingly well, and, the performance was improved nearly 25%. This surprised me.

Anyway, this issue is resolved for us now.

Again, thank you for you help .

JG