Proper way to FTS Index and search Meta().id or docids using regex or wildcard seaches

Using Node SDK 2.6.9 and Couchbase Server 6.0.x

Objective:
Illustration how to use FTS Search for documents with IDS’s matching a regular expression pattern without storing the Meta().id as a field within each target document.

What I have achieved
I have a working FTS index which includes an “id” field configured with the “keyword” analyzer. This id field is set from the Meta().id of the document. I am able to apply regex searches using this regex pattern: “^[a-z]{5,15}::dp::entity::[0-9]{4,10}$” successfully using the SDK regex search function in a variety of ways.

I would prefer not to store this ID within the document. I have reviewed the documentation for indexing as a “Doc ID with regex”. However, the documentation only illustrates examples using full document ids with the “docIds” function in the SDK. How does one implement a regex search against an index configured as Doc Id with regex"?

Thanks

JG

“Doc ID with regex” is a way couchbase FTS will allow you to extract “type identifiers” for indexing. Let me explain this better with an example …

Ex: Say you have these documents on your couchbase bucket…

  • doc-airline-23
  • doc-airline-421
  • doc-airport-992

Using the regexp: -[a-z]*-, the docIDs that would match the pattern are: “-airline-” and “-airport-”.
Setting the above regex in your “Doc ID with regex” field, you’ll be able to define top level type mappings: “-airline-” or “-airport-” to index only those documents whose Doc IDs contain the term.

So setting this as your index definition would index all attributes of documents with “-airline-” in their IDs …

{
  "name": "temp",
  "type": "fulltext-index",
  "params": {
    "doc_config": {
      "docid_regexp": "-[a-z]*-",
      "mode": "docid_regexp"
    },
    "mapping": {
      "default_analyzer": "standard",
      "default_datetime_parser": "dateTimeOptional",
      "default_field": "_all",
      "default_mapping": {
        "dynamic": true,
        "enabled": false
      },
      "types": {
        "-airline-": {
          "dynamic": true,
          "enabled": true
        }
      }
    },
    "store": {
      "indexType": "scorch"
    }
  },
  "sourceType": "couchbase",
  "sourceName": "bucket",
  "sourceUUID": "",
  "uuid": ""
}

Hope that explains what “Doc ID with regex” is intended to do. Note that this won’t assist you in searching the doc IDs based on a regexp - it’ll only index ones that you’ve matched using the regexp.

The Doc ID query would only assist you to retrieve documents if you specify the exact ID(s) - so I gather this isn’t what you want.

Now if you want to search across all documents by applying a regexp you would need to issue a regexp query against the “_id” field that FTS automatically indexes.

Taking the previous example again, the following query would fetch me all documents that have “airline” in it …

{
  "query": {
    "regexp": ".*airline.*",
    "field": "_id"
  }
}

Thank you very much @abhinav.

Yes, I need to be able to find all documents which match this pattern: “^[a-z]{5,15}::dp::entity::[0-9]{4,10}$”

In all be a few cases, the “[a-z]{5,15}” section will be set to a known customer account code:
^(sampleco|anotherco)::dp::entity::[0-9]{4,10}$

I found I must “\” escape each of the colons to work properly with the “keyword” analyzer setting in the index.

The mystery I believe you solved is against what may one index and you have stated it is the “_id” code.

From your reply I believe the update in the screen-cap below is what is needed in the FTS index. Please confirm my understanding of your instructions is correct:

Indexing “id” field isn’t necessary.
The meta().id field is already indexed (automatically) by FTS (as is => keyword) and can be searchable by specifying “_id” in the field section of your query.

1 Like

@abhinav - Ok. I removed the indexing on the “_id” field in the FTS index. I modified my Nodejs SDK implementation of the query to the below:

My intent is to find all documents with ID’s matching EITHER of these two patterns.
AND;
with the code field in the country object = “SA”

Does look correct?

{
    "query": {
        "conjuncts": [
            {
                "disjuncts": [
                    {
                        "regexp": "abc::dp::entity::[0-9]{4,10}", 
                        "field": "_id"
                    }, 
                    {
                        "regexp": "xyz::dp::entity::[0-9]{4,10}", 
                        "field": "_id"
                    }
                ]
            }, 
            {
                "match": "SA", 
                "field": "country.code"
            }
        ]
    }
}

JG

Looks good, or simply one regex to match either of those patterns …

{
  "query": {
    "conjuncts": [
      {
        "regexp": "(?:abc|xyz)::dp::entity::[0-9]{4,10}",
        "field": "_id"
      },
      {
        "match": "SA",
        "field": "country.code"
      }
    ]
  }
}
1 Like

I must have something else going on to block me. There query string is identical to yours now but I am still getting no results. How may one check this in the Couchbase UI?

{
    "query": {
        "conjuncts": [
            {
                "regexp": "(?:abc|xyz)::dp::entity::[0-9]{4,10}", 
                "field": "_id"
            }, 
            {
                "match": "SA", 
                "field": "country.code"
            }
        ]
    }
}

JG

@abhinav - I have marked your answer as the solution as I am sure it is the correct way. I have something outside of the FTS interfering with the successful result.

I will try to resolve this later. In the meantime, I have returned to the work around I was using where I am filtering in the dynamic N1QL query to which I am joining the results of the FTS query.

I do appreciate your assistance.

Regards,

JG

There isn’t a way you can run a regexp with the couchbase UI - as all it accepts is a query string query.
See here on supported UI syntax: https://docs.couchbase.com/server/current/fts/query-string-queries.html

If you’re using the SDK, you should perhaps look for any errors you may be getting while executing the regexp query.
You can POST a curl request directly to the REST endpoint to see your results …

curl -XPOST -H "Content-Type: application/json" \
-u <username>:<password> http://<ip>:8094/api/index/<index_name>/query \
-d '{
  "query": {
    "regexp": "(?:abc|xyz)::dp::entity::[0-9]{4,10}",
    "field": "_id"
  }
}'

Thank you. I will try the CURL request and see if anything shows up. To be clear, I am not receiving any errors. The query runs successfully but it doesn’t find matches.

The solution I was seeking to replace leverages FTS to search on country code, or, agency code, or both, limited by the type. The results are then injected into a dynamic N1QL query which selects most of the fields from the source documents and joined to the results from the FTS results on the id fields. I am using a REGEXP_LIKE function to filter those id’s by the regex pattern I provided in previous replies.

By applying the documentid filter at the FTS step, I would greatly reduce the number of records in the join.

As I mentioned in my opening post, I have successfully applied regex pattern matching against document ids in FTS but only when the Meta().id was already embedded in the target document and set to a field and that field was configured with the “keyword” analyzer in the index.

I believe your recommendations are sound. I just need to try it another day when my head is a little clearer.

Regards,

JG

As promised, I ran the FTS query using the CURL command and it does return a verbose error, “…TooManyClauses[maxClauseCount is set to 1024]”.

The SDK appears to swallow this and fail silently.

I have learned what causes this. Unfortunately, I cannot narrow the regex to bring the clause count under the 1024 threshold. We are also not able to upgrade to 6.5 yet. It appears we must continue with our work-around until we can complete our current release and take up the upgrade to 6.5 in our next development cycle.

Thank you for your assistance with this issue.

Regards,

JG

In 6.5, we support an easy way to change the maxClauseCount …

curl -XPUT -H "Content-type:application/json" \
http://<username>:<password>@<ip>:8094/api/managerOptions \
-d '{"bleveMaxClauseCount": "10000"}'

Right. I read that. Unfortunately, I cannot upgrade at this time due to testing has completed. We will upgrade in our next sprint.

JG