Full Text Search relevance scores in multi tenants environment

Hi,

Our application uses one single bucket to store documents from multiple tenants. We will create one FTS Index based on this bucket. When we do search on the FTS Index, we will always search within a particular tenant like the following (All documents have a tenantId field):

///////////////
{
“explain”: true,
“fields”: [
“*”
],
“sort”: [
“type”,
“-_score”
],
“highlight”: {},
“query”: {
“must”: {
“conjuncts”:[{“field”:“tenantId”, “match”: “test123456”}, {“field”:"_all", “match”: “Highway”}]}

}

}

/////////////////////

Based on CouchBase online document, the Search result against the FTS Index will return the relevance score which is calculated based on
tf-idf method of scoring (https://en.wikipedia.org/wiki/Tf–idf). In this method, the inverse document frequency also contribute to the score. The inverse document frequency is defined as

///////////////////////
a measure of how much information the word provides, i.e., if it’s common or rare across all documents.
//////////////////////

Since the search result is scored based on whether the search term is common or rare across all documents, i assume that the “all documents” mean all documents in our FTS Index. In our product scenario, the FTS Index contains data from all tenants. Does that mean the search score returned from our query (as I mentioned earlier, our query is always to search a text for one tenant) will be polluted by the documents from other tenants? Please let me know if I have any misunderstanding. Please advice on how to obtain an accurate search relevancy score for each individual tenant. Thank you.

Hey @jessyang, the score is estimated based on the search term across the mentioned fields across all indexed documents.
So assuming I’ve understood your question correctly … even though the score takes into account all the documents, it should be looked at relatively for each individual tenant. What I mean to say here is, rather than looking into the absolute value of the score - check how the scores look in relation to each other for each individual tenant.

Taking you’re query …
“conjuncts”:[{“field”:“tenantId”, “match”: “test123456”}, {“match”: “Highway”}]}

This would return results only for tenantId: test123456.
The score will be estimated for each of the sub queries above and aggregated. So rather than looking at the value of the score for each hit, check how they appear in relation to each other. The documents will be sorted based on relevance (highest to lowest).

Hi, @abhinav

I am still a bit confused. For my query:

//////////////
“must”: {
“conjuncts”:[{“field”:“tenantId”, “match”: “test123456”}, {“field”:"_all", “match”: “Highway”}]}
//////////

Let’s say that there are 100 tenants in the bucket and there is a FTS Index created for this bucket, the search score will do the estimate for each of the above sub queries and aggregated. The score of {“field”:"_all", “match”: “Highway”} will be impacted by all the documents of the 100 tenants. Is this correct? If my understanding is correct, this is different from what our application intended to do, we don’t want the score of {“field”:"_all", “match”: “Highway”} impacted by the documents of the other 99 tenants. When our application does a query to search for any document/field contains the string ‘Highway’, it is only in the context of a particular tenant. Each tenant in our application considers the documents in other 99 tenants as totally irrelevant information. If the score of {“field”:"_all", “match”: “Highway”} is also based on documents from other 99 tenants, this makes the search score not very useful for our application. If this is the case. I am trying to see whether there is anyway to have the search score more relevant to the current tenant and somehow not impacted by the documents in the other 99 tenants.

Let’s say that there are 100 tenants in the bucket and there is a FTS Index created for this bucket, the search score will do the estimate for each of the above sub queries and aggregated. The score of {“field”:"_all", “match”: “Highway”} will be impacted by all the documents of the 100 tenants. Is this correct?

Yes that is accurate.
So if you don’t want the score to be affected by the documents that belong to other tenants - the index will need to hold only documents that belong to the particular tenant, meaning the type mapping needs to be built over the tenantId.

Hi, @abhinav regarding your comment on

///////////
meaning the type mapping needs to be built over the tenantId.
//////////

Do you actually mean that the document type of our documents will need to include tenant ID as part of the document type? Let’s say that our application supports the following document types:

SalesOrder
Inventory
Product

We will need to change our document type to be specific per tenant. So for Tenant with Tenant_id=test123456, the list of document types will be something like:

test123456:SalesOrder
test123456:Inventory
test123456:Product

Then we will need to create a FTS index for each tenant and configure the index to only includes documents with the above document types which contains the tenant ID value (e.g test123456:SalesOrder, test123456:Inventory, test123456:Product) . Please let me know if my understanding is incorrect.

If my understanding is correct, I have the following concerns:

  • If there are lots of tenants, there seem to be lots of FTS indexes to create and manage.
  • The other issue is that we also allow our customers to upload the documents of their own
    document types. In this case, we don’t really know what document types the customers are going to need in advance. In this case, our FTS index will not able to return the correct query result which is to find any document contains a particular string for a specific tenant.
  • If there are lots of tenants, there seem to be lots of FTS indexes to create and manage.

Yes that’s unfortunately how you can achieve a pure tenant score that you’re looking for at the moment. We do intend to support custom scoring in the future where the user can determine the score for each document based on their algorithm/needs.

  • The other issue is that we also allow our customers to upload the documents of their own
    document types. In this case, we don’t really know what document types the customers are going to need in advance. In this case, our FTS index will not able to return the correct query result which is to find any document contains a particular string for a specific tenant.

If the “type” of the document is not known and the users are not restricted to uploading documents that carry the necessary field, then I’d recommend you set up a default dynamic mapping. In this case all documents will be indexed. You will be able to choose fields of interest however by setting up child fields and mappings under the default type mapping (to reduce the size of the index and to increase the search time).

Seeing that an accurate score based on only one sub-query is your most important requirement, the default mapping may not be the best option again. Note that results obtained from each sub-query within the main query will play a role when the aggregate score is determined.

Hi, @abhinav, Based on our application’s top priority requirement:

  • Allow our customers to upload the documents of their own custom
    document types. In this case, we don’t really know what document types the customers are going to need in advance. Our application needs to allow customer to find any documents which contains a specific string/text for a particular tenant.

Due to the custom document types which can be provided by our customers, it seems that we can not create one FTS index per tenant. We will need to create one single FTS index which includes documents for all the tenants, we will configure the index as follows to avoid storing too much data in the FTS index:

  • exclude any document types which don’t need to be searched by customers since we have some document types which are used internally and not meant for the customers to search on.
  • exclude any fields which don’t need to be searched on
  • Only store document ID in the index

Due to the current limitation of how Counchbase computes search score in the multi tenants environment, our application will ignore the search score for now. When the Couchbase provides advanced configuration options for the search score in the future, we can enhance our application’s search function based on the search score. Please let me know if you disagree or have additional suggestion with my proposed approach. Thanks.

hi @jessyang,

Just out of curiosity , can you please brief a bit on why the worry on the score when you have conjuncts which explicitly identify documents like tenantID?
Do you have 1000s os documents for that tenantID and you are concerned about the lost relevancy here among those 100os of documents?

Another dimension you could explore is - you can always boost your sub queries to give more weightage for the score computations. For eg: Just try a boost factor of zero/0 for sub query you want to ignore while computing the score.

Cheers!

@jessyang I think I may have another way for you.

Like my colleague @sreeks suggests, you could look into boosting the second query so that would give more weightage for the query that matters.

Now - since you don’t want the tenantId search to impact the score at all, you could set a boost of “0” to it, so this could be your new query:

“conjuncts”:[{“field”:“tenantId”, “match”: “test123456”, "boost": 0}, {“match”: “Highway”}]}

And this would fetch you scores based on the second query in the conjunction query, without the tenantId impacting them. However, scores would still be affected by other documents that matched “Highway”, and not the first query.

The only way to get the accurate score for documents that apply only to a specific tenant - is to set up an index for each tenant, and since that isn’t feasible for you, what you summarize sounds OK to me.

Hi, @abhinav @sreeks
Thanks for the suggestion, adding the "boost": 0 will remove the impact to the score for the tenantId field.

However, I think there is still possibility that the search score for other non tenant ID related sub queries is impacted by the documents which belong to the other unrelated tenants (e.g. all the other tenants whose tenant id is not test123456). For example, we can have query like the following:

“conjuncts”:[{“field”:“tenantId”, “match”: “test123456”, "boost": 0}, {“field”:"_all", “match”: “Green Highway”} ]}

In the tf-idf method of scoring (https://en.wikipedia.org/wiki/Tf–idf), the inverse document frequency also contribute to the score. In our example, the couchbase will calculate the search score for the following sub query:

{“field”:"_all", “match”: “Green Highway”}

The search score will be the sum of search score of “Green” and search score of “Highway”.

Assuming that for our tenant test123456:

  • the text Green is NOT very common among all documents belong to this tenant test123456.
  • the text Highway is common among all documents belong to this tenant test123456.

For the other 99 tenants which are not test123456, their documents (which means the majority of the documents in the bucket) have the following characteristics:

  • the text Green is very common among all documents in the bucket
  • the text Highway is NOT common among all documents in the bucket

For our search query, we would like the search score to be computed only based on the documents of tenant test123456 since our users in tenant test123456 does not have access and does not care about documents belong to other tenants. However, in the above situataion, the other documents belong to different tenants will impact the search score and make the search score not what we expected.

Well summarized @jessyang. Your analysis is accurate.

Thanks, @abhinav , Regarding our application’s specific requirement on the search score in multi tenants environment, do you happen to know whether the coming release of the Couchbase will be able to address this requirement?

Cheers. Support for custom scoring is most certainly on our road map here at couchbase. However, the next release has already been defined … so I’d stay tuned in for the release after the next.