Find non unique fields across large database

I have a large(ish) database, say 10M+, where I want to find documents in which some field is equal (ie find all documents in which field X has arbitrary value Y)

I have created indexes on the relevant fields by

CREATE INDEX `idx_covering_breed` ON `bucket`((meta().`id`),`breed`) WHERE ((meta().`id`) like "animal:%")

and are attempting to get the result by

SELECT d.breed, ARRAY_AGG(META(d).id) AS targets
FROM `bucket` AS d
WHERE META().id LIKE 'animal:%'
GROUP BY d.breed
HAVING COUNT(1) > 1

But this query is inefficient for a large database/eats my memory.

Looking for a way to improve upon my query to achieve the goal efficiently

You might try (in addition to your idx_covering_breed index):

CREATE INDEX idx_breed ON `bucket`(`breed`);

And run:

SELECT meta(d2).id
FROM ( SELECT d1.breed
                FROM `bucket` AS d1
                WHERE d1.breed IS VALUED
                AND meta(d1).id LIKE 'animal:%'
                GROUP BY d1.breed
                HAVING COUNT(1) > 1
              ) AS breeds
JOIN `bucket` d2 ON d2.breed = breeds.breed ;

The aim is to be able to stream the results rather than have to hold all keys in memory (your ARRAY_AGG).

HTH.

@dh Query with Hash Join , single index

CREATE INDEX `idx_covering_breed` ON `default`(`breed`) WHERE meta().`id` like "animal:%";

SELECT RAW META(d2).id
FROM ( SELECT d1.breed
       FROM default AS d1
       WHERE d1.breed IS VALUED AND meta(d1).id LIKE 'animal:%'
       GROUP BY d1.breed
       HAVING COUNT(1) > 1
      ) AS breeds
JOIN default d2 USE HASH (PROBE) ON d2.breed = breeds.breed AND meta(d2).id LIKE 'animal:%';

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.