Problems getting unnest query to use nested array index

johan.andren · December 20, 2018, 4:44pm

Hi, working on an Akka Persistence plugin for Couchbase and struggling with creating an index for nested arrays to be used. The documents looks like this:

{
  "persistence_id":"...",
  "messages": [
    {
      "ordering": "2018-12-20T09:10:53.368000000_13706215326216848386",
      "tags": [
        "tag-1"
      ]
    }, ...
  ]
}

And we are trying to make a query for a specific “tag” in a range of “ordering”, we have it working with two indexes, one for tag and one for ordering, but realized that the paging that we do (using LIMIT) doesn’t scale to the amount of events we expect, we hope that if we can achieve one combined index that would push the limit operation to the index (ofc we may be wrong about this).

Latest try to do a single index has been like this:

CREATE INDEX `tags-ordering` ON `akka` 
  (ALL ARRAY 
    (ALL ARRAY [t.tag, m.ordering]
       FOR t IN m.tag_seq_nrs END) 
    FOR m IN messages END) 
WHERE `type` = "journal_message"

But we can’t even get the simplest query to use this index:

SELECT a.persistence_id, m.* FROM akka a UNNEST messages m UNNEST m.tag_seq_nrs t
WHERE m.ordering > "2000-12-20T08:10:53.368000000_13706215326216848386" 
AND a.type = "journal_message"
AND t.tag = "tag-1"

Would be awesome if someone could tell us what it is we are missing or have misunderstood.

vsr1 · December 20, 2018, 5:36pm

You don’t have predicate on leading index key.
The leading array index key you have [t.tag, m.ordering] this is array. This is complex , If equality predicate it is easy for others it is complex and need to aware how arrays are compared able to translate.

Option 1:

As you need whole m only index m.ordering and apply tags post fetch

CREATE INDEX `tags-ordering` ON `akka`
  (DISTINCT ARRAY m.ordering
    FOR m IN messages END)
WHERE `type` = "journal_message";

SELECT a.persistence_id, m.*
FROM akka a
UNNEST messages a.m
WHERE m.ordering > "2000-12-20T08:10:53.368000000_13706215326216848386"
AND a.type = "journal_message"
AND ANY t IN m.tags SATISFIES t IN ["tag-1"] END;

Option 2:

As you need whole m only index t.tag and apply m.ordering post fetch

CREATE INDEX `tags-ordering` ON `akka`
  (DISTINCT ARRAY
    (DISTINCT ARRAY t
       FOR t IN m.tags END)
    FOR m IN messages END)
WHERE `type` = "journal_message";

SELECT a.persistence_id, m.*
FROM akka a
UNNEST messages a.m
UNNEST m.tags t
WHERE m.ordering > "2000-12-20T08:10:53.368000000_13706215326216848386"
AND a.type = "journal_message"
AND t = "tag-1";

Option 3:

Index as you have but change predicate

CREATE INDEX `tags-ordering` ON `akka`
  (DISTINCT ARRAY
    (DISTINCT ARRAY [t, m.ordering]
       FOR t IN m.tags END)
    FOR m IN messages END)
WHERE `type` = "journal_message"

SELECT a.persistence_id, m.*
FROM akka a
UNNEST messages a.m
UNNEST m.tags t
WHERE a.type = "journal_message"
AND [t, m.ordering] >= ["tag-1", "2000-12-20T08:10:53.368000000_13706215326216848386"]
AND [t, m.ordering] < [SUCCESSOR("tag-1")] ;

johan.andren · December 21, 2018, 8:10am

That is some super fast help @vsr1 thanks a lot, will try these out and see which one fits.

chbatey · December 21, 2018, 9:22am

Hi @vsr1 thanks for the help

The 3rd index has us nearly there.

This allows the following query to complete fast:

SELECT a.persistence_id, m.*
FROM akka a
UNNEST messages m
UNNEST m.tags t
WHERE a.type = "journal_message"
AND [t, m.ordering] >= ["tag1", "1582-10-15T00:00:00.000000000_                   0"]
AND [t, m.ordering] <= ["tag1", "2018-12-21T08:59:00.713000000_                   0"]
LIMIT 100

However once we try and get the results back ordered it takes many minutes with a few million documents.

SELECT a.persistence_id, m.*
FROM akka a
UNNEST messages m
UNNEST m.tags t
WHERE a.type = "journal_message"
AND [t, m.ordering] >= ["tag1", "1582-10-15T00:00:00.000000000_                   0"]
AND [t, m.ordering] <= ["tag1", "2018-12-21T08:59:00.713000000_                   0"]
ORDER BY [t, m.ordering]
LIMIT 100

Is there a way to get the ordering with a limit done in the index? Does the [t, m.ordering] index mean it is sorted by t and then m.ordering or do we need to add something for that?

Thanks for your help

vsr1 · December 21, 2018, 12:39pm

Unnest is JOIN it can’t use index order if not covered. If your array elements are small you can try covered index.

[t, m.ordering] index mean it is sorted by t and with in the same value of t then m.ordering.

CREATE INDEX `tags-ordering-cover` ON `akka`
  (ALL ARRAY
    (ALL ARRAY [t, m.ordering, m]
       FOR t IN m.tags END)
    FOR m IN messages END, persistence_id)
WHERE `type` = "journal_message";

SELECT a.persistence_id, [t, m.ordering, m][2].*
FROM aka a
UNNEST a.messages AS m
UNNEST m.tags t
WHERE a.type = "journal_message"
AND [t, m.ordering, m] > [$tag1, $fromOffset]
AND [t, m.ordering, m] < [$tag1, $toOffset]
ORDER BY [t, m.ordering, m]
LIMIT $limit;

perry · December 29, 2018, 12:28am

Just to elaborate a bit on vsr1’s answer (who is by FAR the expert, I’m just adding some explanation):

When using UNNEST, the ORDER BY operator can only be leveraged by the index (pushed down) in this case if the whole query can be fully answered using only the data in the index (covered). In addition to the filtering fields (a.type, m.ordering and the tags), the query is asking for a.persistence_id and m.*, so those must also be in the index. ALL ARRAY instead of DISTINCT ARRAY is also necessary to fully cover the query.

If you didn’t need the whole m.* each time, you could save some space by replacing it with a sub field in the index and the query:

CREATE INDEX `tags-ordering-cover` ON `akka`
  (ALL ARRAY
    (ALL ARRAY [t, m.ordering, m.payload]
       FOR t IN m.tags END)
    FOR m IN messages END, persistence_id)
WHERE `type` = "journal_message";

SELECT a.persistence_id, [t, m.ordering, m.payload][2]
FROM akka a
UNNEST a.messages AS m
UNNEST m.tags t
WHERE a.type = "journal_message"
AND [t, m.ordering, m.payload] > [$tag1, $fromOffset]
AND [t, m.ordering, m.payload] < [$tag1, $toOffset]
ORDER BY [t, m.ordering, m.payload]
LIMIT $limit;

Another option would be to query only the key of each document and then manually get() each one which would allow you to push down the ORDER and LIMIT to the index without keeping all of that data within the index itself:

CREATE INDEX `tags-ordering-cover-meta` ON `akka`
  (ALL ARRAY
    (ALL ARRAY [t, m.ordering]
       FOR t IN m.tags END)
    FOR m IN messages END)
WHERE `type` = "journal_message";

SELECT DISTINCT meta(a).id
FROM akka a
UNNEST a.messages AS m
UNNEST m.tags t
WHERE a.type = "journal_message"
AND [t, m.ordering] > [$tag1, $fromOffset]
AND [t, m.ordering] < [$tag1, $toOffset]
ORDER BY [t, m.ordering]
LIMIT $limit;

Then batch the resulting list of keys to retrieve the messages field of each:

DocumentFragment<Lookup> result = bucket
    .lookupIn(<key>)
    .get("messages")
    .execute();

This is a bit more work within the client but would be more space-efficient within the index. There shouldn’t be much of a performance difference unless the messages array is MUCH larger than the set of individual messages that the more selective N1QL query would return.

johan.andren · January 6, 2019, 6:59pm

We already speculated if an index with just the ids and then fetching the documents would be a possible solution before we got the reply and implementing this seems to give sufficient performance. Thanks for all the help @vsr1 and @perry