Move documents between buckets efficiently

n1ql

#1

[using CB 4.5 Enterprise]

We are trying to implement a solution to reduce load from our queries on our main application bucket(s).
What we have decided to do was create an “archive” buckets and move documents from main buckets to it when they are not needed (say, older than X days)

We want to implement that in N1QL.

Ideally, it can be done in a single N1QL statement, so it’s “atomic”.
However, we could not find a way to do it.

We thought to use “DELETE … RETURNING” on the main bucket, and wrap that in an INSERT.
Something like:

INSERT INTO `archive` (KEY _k, VALUE _v)
SELECT _k, _v FROM 
(
    DELETE from `main` a
    WHERE a.`document_time` <= DATE_ADD_MILLIS(NOW_MILLIS(), -15, 'day')
    RETURNING meta(a).id _k, a _v
)

But that fails with “syntax error - at DELETE”

Or, as a second option, “UPDATE … RETURNING” and again wrap that in the INSERT to archive
(and now at least the docs that were copied are “marked” for deletion):

INSERT INTO `archive` (KEY _k, VALUE _v)
SELECT _k, _v FROM 
(
    UPDATE `main` a
    SET `archive_candidate` = 1
    WHERE a.`document_time` <= DATE_ADD_MILLIS(NOW_MILLIS(), -15, 'day')
    RETURNING meta(a).id _k, a _v
) 

that too fails the same with “syntax error - at UPDATE”.

For some reason, the “DELETE/UPDATE” statements with “RETURNING”, even tho they return a very real looking result sets, can’t be used in an inner SELECT statement (with or without the wrapping INSERT).

Which leaves us with the least favorable option - to do it in 3 steps:

  • PASS 1: Update candidates for deletion
  • PASS 2: “copy” candidates (INSERT with SELECT)
  • PASS 3: Delete those documents

Any way possible to make it work in 1 (or 2) steps?
Any reason for the inner UPDATE/DELETE not to be allowed?

thanks


#2

Hi @josh21, you need to use two statements. Here is the safest approach, which should also perform well.

UPSERT INTO dest_bucket ( ... )
SELECT ... FROM source_bucket ...;

DELETE FROM source_bucket
USE KEYS ( SELECT RAW META().id FROM dest_bucket ... );

#3

Hi @geraldss - thanks for the reply.

However, I have two comments:

  1. You didn’t explain why an UPDATE/DELETE with a “RETURNING” can’t be used as an inner query (although it seems to return a result-set).
  2. You option, although should work just fine - seems not very good for scale. As described in the first post, this is to implement an “Archive” bucket. Imagine we move documents every day that are older than X days. Say we move 50k documents every time. Now, think of the query you mentioned running after a year - in order to perform the DELETE part on a mere 50k document, your inner SELECT will scan AND return close to 20 million IDs. That sounds very inefficient. Am I missing something?

Thanks


#4

Any update on this? has something changed?


#5

If you want copy whole bucket you can use backup/restore with rename bucket during restore.
Otherwise you need to use options suggested above


#6

No, copy whole bucket is not interesting… you can also use XDCR for it.
What I’m (and I presume @josh21 was/is) looking for was a way to archive old(er) documents to a different bucket. So the “main” bucket would be fast, smaller, with multiple indexes and serve the day to day operation, while the “Archive” bucket will be growing and only be used if someone has to dive into old data.
so if I have a bucket that produces 50MIL documents a month, i can maintain in it a moving window of say, 60 days to work with.