N1QL CBQ Update million records is not working

ismail.iqbal.ap · January 6, 2021, 10:46pm

Couchbase Team,
Bucket user-bucket has 30million+ records in it. I need to update the domain of the email id present in the document from xxmail.com to yymail.com.
I have created an index and executed the script in cbq as below:

cbq> CREATE INDEX USER_BKT-EMAIL-IDX ON user-bucket(emailId) WHERE emailId LIKE “%xxmail.com”;

cbq> SELECT COUNT(*) AS count FROM `user-bucket` WHERE emailId LIKE "%xxmail.com";
{
    "requestID": "64633c5c-4559-4335-b363-4376d11d4ee6",
    "signature": {
        "count": "number"
    },
    "results": [
        {
            "count": 3063621
        }
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "11.009255457s",
        "executionTime": "11.009217044s",
        "resultCount": 1,
        "resultSize": 40
    }
}

cbq> UPDATE `user-bucket` SET emailId = REPLACE(emailId, "xxmail.com", "yymail.com") where emailId LIKE "%xxmail.com";
cbq>

This update statement executed for quite long time(15mins+), and it comes out with out any output message.
Then I executed the count query and found out only 100k+ records are updated.
Kindly help me in getting the update query executed for 30million+ records.

Thanks
Ismail

vsr1 · January 6, 2021, 10:54pm

@ismail.iqbal.ap,

Mass update of Millions takes time due to full document update + distributed nature…

Would recommend solution from @jon.strabala described here https://stackoverflow.com/questions/65312247/add-field-to-existing-documents-over-million-records/65365278#65365278

OR

Use Prefix search as described below

CREATE INDEX ix1 ON  `user-bucket` (emailId) WHERE REVERSE(emailId) LIKE  RVERSE("%xxmail.com");
 UPDATE `user-bucket` SET emailId = REPLACE(emailId, "xxmail.com", "yymail.com")
WHERE REVERSE(emailId) LIKE  RVERSE("%xxmail.com");

Use scan_consistency request_plus and timeout or error retry again so that it continues where it left off.

ismail.iqbal.ap · January 7, 2021, 1:01am

I tried with Prefix search (REVERSE) as you mentioned, and the update query executed for quite long time(15mins+), and it comes out with out any output message. I found below error in the couch logs:

Service ‘query’ exited with status 134. Restarting. Messages:
os.(*File).read(0xc42000c050, 0xc4200f0000, 0x10000, 0x10000, 0xc4204f8ee8, 0x0, 0x0)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.1/go/src/os/file_unix.go:165 +0x4f fp=0xc4204f8e60 sp=0xc4204f8e18
os.(*File).Read(0xc42000c050, 0xc4200f0000, 0x10000, 0x10000, 0x0, 0x0, 0x0)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.1/go/src/os/file.go:101 +0x76 fp=0xc4204f8ec8 sp=0xc4204f8e60
bufio.(*Reader).Read(0xc420014840, 0xc4200f0000, 0x10000, 0x10000, 0x56, 0x0, 0x0)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.1/go/src/bufio/bufio.go:199 +0x198 fp=0xc4204f8f58 sp=0xc4204f8ec8
main.(*childReader).loop.func1(0xc42000e540, 0xc4200f0000, 0x10000, 0x10000, 0xc4204b57a0)
/home/couchbase/jenkins/workspace/couchbase-server-unix/ns_server/deps/gocode/src/goport/main.go:73 +0x50 fp=0xc4204f8fb8 sp=0xc4204f8f58
runtime.goexit()
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.1/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc4204f8fc0 sp=0xc4204f8fb8
created by main.(*childReader).loop
/home/couchbase/jenkins/workspace/couchbase-server-unix/ns_server/deps/gocode/src/goport/main.go:79 +0x11c

jon.strabala · January 7, 2021, 7:51pm

As @vsr1 indicated Eventing can provide a high performance solution:

I addition I would encourage you to look at Function: fixEmailDomains | Couchbase Docs it should be close to the exact Eventing code that you are looking for.

ismail.iqbal.ap · January 7, 2021, 8:01pm

Hi,
As we are using Couchbase server 5.1, I dont think fixEmailDomains is available.

jon.strabala · January 7, 2021, 8:17pm

Right you are Eventing was introduced in version 5.5.

In the 6.5 release, the handler code can directly mutate (or write back) to the source bucket, prior to 6.5 you needed to use an intermediate bucket (or write the KEYs to an intermediate bucket and let eventing use that as it source to update the true target).

ismail.iqbal.ap · January 7, 2021, 9:54pm

I dont see Eventing Service with Couchbase server 5.1

vsr1 · January 7, 2021, 9:57pm

You can also update as chunks. and repeat with request_plus

    UPDATE `user-bucket` SET emailId = REPLACE(emailId, "xxmail.com", "yymail.com")
    WHERE REVERSE(emailId) LIKE  RVERSE("%xxmail.com") LIMIT 100000;

ismail.iqbal.ap · January 8, 2021, 12:16am

Hi,
I tried as you said, and the update time is huge. To update 10k records, its taking more than 2minutes. Below indexes are created in the bucket before executing update script:

CREATE INDEX USERID-IDX ON user-bucket(emailId) WHERE REVERSE(emailId) LIKE REVERSE(“%xxmail.com”);
CREATE INDEX USERIDIQ-IDX ON user-bucket(emailId) WHERE REVERSE(emailId) LIKE REVERSE(“%yymail.com”);

vsr1 · January 8, 2021, 12:59am

post the explain plan.

ismail.iqbal.ap · January 8, 2021, 2:37am

> {
>   "plan": {
>     "#operator": "Sequence",
>     "~children": [
>       {
>         "#operator": "IndexScan2",
>         "index": "USERID-IDX",
>         "index_id": "fa3a305f87cade40",
>         "index_projection": {
>           "primary_key": true
>         },
>         "keyspace": "user-bucket",
>         "namespace": "default",
>         "spans": [
>           {
>             "range": [
>               {
>                 "inclusion": 0,
>                 "low": "null"
>               }
>             ]
>           }
>         ],
>         "using": "gsi"
>       },
>       {
>         "#operator": "Sequence",
>         "~children": [
>           {
>             "#operator": "Parallel",
>             "~child": {
>               "#operator": "Sequence",
>               "~children": [
>                 {
>                   "#operator": "Fetch",
>                   "keyspace": "user-bucket",
>                   "namespace": "default"
>                 },
>                 {
>                   "#operator": "Filter",
>                   "condition": "(reverse((`user-bucket`.`emailId`)) like reverse(\"%xxmail.com\"))"
>                 }
>               ]
>             }
>           },
>           {
>             "#operator": "Limit",
>             "expr": "10000"
>           },
>           {
>             "#operator": "Parallel",
>             "~child": {
>               "#operator": "Sequence",
>               "~children": [
>                 {
>                   "#operator": "Clone"
>                 },
>                 {
>                   "#operator": "Set",
>                   "set_terms": [
>                     {
>                       "path": "(`user-bucket`.`emailId`)",
>                       "value": "replace((`user-bucket`.`emailId`), \"xxmail.com\", \"yymail.com\")"
>                     }
>                   ]
>                 },
>                 {
>                   "#operator": "SendUpdate",
>                   "alias": "user-bucket",
>                   "keyspace": "user-bucket",
>                   "limit": "10000",
>                   "namespace": "default"
>                 }
>               ]
>             }
>           }
>         ]
>       },
>       {
>         "#operator": "Discard"
>       }
>     ]
>   },
>   "text": "UPDATE `user-bucket` SET emailId = REPLACE(emailId, \"xxmail.com\", \"yymail.com\") WHERE REVERSE(emailId) LIKE  REVERSE(\"%xxmail.com\") LIMIT 10000"
> }

ismail.iqbal.ap · January 8, 2021, 4:26pm

Couchbase team,
Any findings on the explain plan?

vsr1 · January 8, 2021, 6:24pm

Plan looks correct and it using right index.

ismail.iqbal.ap · January 8, 2021, 6:26pm

Then why its taking more than 2minutes for 10k records? Any fixes please.

vsr1 · January 8, 2021, 6:28pm

N1QL update is full document update.
As REQUEST_PLUS. It needs wait Index catch upto latest mutation.
Then get a keys. get document and update one by one. N1QL update is full document update.

ismail.iqbal.ap · January 8, 2021, 6:29pm

Can you please advise me on the performance approach to this?

vsr1 · January 8, 2021, 6:40pm

You can try this.

Get the document keys using following query.
Use KV reactive/async calls using subdoc get emailid and mutate. You can do that in parallel.

SELECT RAW META(d).id
FROM `user-bucket` AS d
WHERE REVERSE(d.emailId) LIKE  REVERSE("%xxmail.com") ;

ismail.iqbal.ap · January 8, 2021, 10:25pm

subdoc is helping me in completing my task .But I am getting DocumentTooDeepException for multiple documents.
com.couchbase.client.java.error.subdoc.DocumentTooDeepException: JSON is too deep in document D1D0AA2C42628D474D8EDF18012E5D1E

How do i handle such case? Do I need to retry the operation again?

vsr1 · January 8, 2021, 10:33pm

@ismail.iqbal.ap,

SDKs are out of my expertise.

I assume this is one time job, so skip those document and continue. At the end when index caught up you run the same query. You will have few skipped documents retry again or handle that manually or outside.

cc @graham.pople, @david.nault

ismail.iqbal.ap · January 9, 2021, 12:03am

Here is my piece of code:

private static void createOrUpdateObjects(List docIDs, AsyncBucket asyncBucket) {

  Observable
          .from(docIDs)
          .flatMap(docID -> {
          	DocumentFragment<Lookup> userResult = asyncBucket.lookupIn(docID)
                      .get("emailId")
                      .execute()
                      .retryWhen(RetryBuilder.anyOf(DocumentTooDeepException.class, BackpressureException.class)
	                            .delay(Delay.exponential(TimeUnit.SECONDS, 3))
	                            .max(RETRY_COUNT)
	                            .build())
                      .doOnError(e -> System.out.println(docID+": "+e.getMessage()))
                      .onErrorResumeNext(Observable.empty())
                      .toBlocking().singleOrDefault(null);

              String userId = userResult.content(0).toString();
              userId = userId.replaceAll("@xxmail.com", "@yymail.com");
              	
                return asyncBucket.mutateIn(docID)
                        .replace("emailId", userId)
                        .execute()
                        .retryWhen(RetryBuilder.anyOf(DocumentTooDeepException.class, BackpressureException.class)
	                            .delay(Delay.exponential(TimeUnit.SECONDS, 3))
	                            .max(RETRY_COUNT)
	                            .build())
                        .doOnError(e -> System.out.println(docID+": "+e.getMessage()))
                        .onErrorResumeNext(Observable.empty());
          })
          .toList()
          .toBlocking()
          .single();

}