Easier way to mass update with n1ql?


#1

Hi,

We currently have 30+ millions documents. I would like to delete one of the values in the documents.

This query works for a smaller set of documents.

update bucket unset myvalue

This obviously does not work for 30+ millions documents. I always get index timeout error and I think that is because couhcbase has to fetch documents.

Are there any other easier ways other than looping through 30+ millions documents and update them one by one?


#2

You can use the LIMIT clause with UPDATE, and then use that in a loop. The LIMIT allows you to work around the timeouts.

UPDATE mybucket
UNSET myvalue 
LIMIT 30000;

#3

It’s a little bit tricker than that I think.

If I simply unset and limit, then it is possible that it overwrites the same document over and over as there is no offset.


#4
CREATE INDEX my index ON mybucket(myvalue);

UPDATE mybucket
UNSET myvalue 
WHERE myvalue IS NOT MISSING
LIMIT 30000;

#5

Please disregard - DELETE


#6

I actually have where clause in my index definition.

I think I also have a different use case for this, where I’m creating an index with undefined value.

This is from my local test machine.

Documents

{
  "published_at": "value1",
  "form": "test"
}

{
  "published_at": "value2",
  "form": "test"
}

Index

create index tmp_idx2 on catalog(publishedAt, form) where form='test';

Explain

explain update catalog set publishedAt = published_at where publishedAt is missing and form='test';

Output

{
    "requestID": "4f74a77b-f3c7-4147-a10d-73778aff5863",
    "signature": "json",
    "results": [
        {
            "#operator": "Sequence",
            "~children": [
                {
                    "#operator": "PrimaryScan",
                    "index": "#primary",
                    "keyspace": "catalog",
                    "namespace": "default",
                    "using": "gsi"
                },
                {
                    "#operator": "Parallel",
                    "~child": {
                        "#operator": "Sequence",
                        "~children": [
                            {
                                "#operator": "Fetch",
                                "keyspace": "catalog",
                                "namespace": "default"
                            },
                            {
                                "#operator": "Filter",
                                "condition": "(((`catalog`.`publishedAt`) is missing) and ((`catalog`.`form`) = \"test\"))"
                            },
                            {
                                "#operator": "Clone"
                            },
                            {
                                "#operator": "Set",
                                "set_terms": [
                                    {
                                        "path": "(`catalog`.`publishedAt`)",
                                        "value": "(`catalog`.`published_at`)"
                                    }
                                ]
                            },
                            {
                                "#operator": "SendUpdate",
                                "alias": "catalog",
                                "keyspace": "catalog",
                                "limit": null,
                                "namespace": "default"
                            }
                        ]
                    }
                },
                {
                    "#operator": "Discard"
                }
            ]
        }
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "10.611063ms",
        "executionTime": "10.5791ms",
        "resultCount": 1,
        "resultSize": 1998
    }
}

Using “use index” does not help in this case.

Do I have to take out where clause?


#7

With 4.1, you need to remove the WHERE clause.

4.5 will work much better for your needs. I understand the challenges of using EE or building from source.


#8

Thank you for the confirmation.

It’s not really challenges of using EE or building from source. I asked a question on 4.5 Beta thread if it is safe to use on production and I was told it is not (at least that was my impression).

As this can be done with view, I will move on with that solution for now.

I hope to see stable version of 4.5 soon :slight_smile: