N1QL CBQ Update million records is not working

Couchbase Team,
Bucket user-bucket has 30million+ records in it. I need to update the domain of the email id present in the document from xxmail.com to yymail.com.
I have created an index and executed the script in cbq as below:

cbq> CREATE INDEX USER_BKT-EMAIL-IDX ON user-bucket(emailId) WHERE emailId LIKE “%xxmail.com”;

cbq> SELECT COUNT(*) AS count FROM `user-bucket` WHERE emailId LIKE "%xxmail.com";
{
    "requestID": "64633c5c-4559-4335-b363-4376d11d4ee6",
    "signature": {
        "count": "number"
    },
    "results": [
        {
            "count": 3063621
        }
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "11.009255457s",
        "executionTime": "11.009217044s",
        "resultCount": 1,
        "resultSize": 40
    }
}
cbq> UPDATE `user-bucket` SET emailId = REPLACE(emailId, "xxmail.com", "yymail.com") where emailId LIKE "%xxmail.com";
cbq>

This update statement executed for quite long time(15mins+), and it comes out with out any output message.
Then I executed the count query and found out only 100k+ records are updated.
Kindly help me in getting the update query executed for 30million+ records.

Thanks
Ismail

@ismail.iqbal.ap,

Mass update of Millions takes time due to full document update + distributed nature…

Would recommend solution from @jon.strabala described here https://stackoverflow.com/questions/65312247/add-field-to-existing-documents-over-million-records/65365278#65365278

OR

Use Prefix search as described below

CREATE INDEX ix1 ON  `user-bucket` (emailId) WHERE REVERSE(emailId) LIKE  RVERSE("%xxmail.com");
 UPDATE `user-bucket` SET emailId = REPLACE(emailId, "xxmail.com", "yymail.com")
WHERE REVERSE(emailId) LIKE  RVERSE("%xxmail.com"); 

Use scan_consistency request_plus and timeout or error retry again so that it continues where it left off.

I tried with Prefix search (REVERSE) as you mentioned, and the update query executed for quite long time(15mins+), and it comes out with out any output message. I found below error in the couch logs:

Service ‘query’ exited with status 134. Restarting. Messages:
os.(*File).read(0xc42000c050, 0xc4200f0000, 0x10000, 0x10000, 0xc4204f8ee8, 0x0, 0x0)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.1/go/src/os/file_unix.go:165 +0x4f fp=0xc4204f8e60 sp=0xc4204f8e18
os.(*File).Read(0xc42000c050, 0xc4200f0000, 0x10000, 0x10000, 0x0, 0x0, 0x0)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.1/go/src/os/file.go:101 +0x76 fp=0xc4204f8ec8 sp=0xc4204f8e60
bufio.(*Reader).Read(0xc420014840, 0xc4200f0000, 0x10000, 0x10000, 0x56, 0x0, 0x0)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.1/go/src/bufio/bufio.go:199 +0x198 fp=0xc4204f8f58 sp=0xc4204f8ec8
main.(*childReader).loop.func1(0xc42000e540, 0xc4200f0000, 0x10000, 0x10000, 0xc4204b57a0)
/home/couchbase/jenkins/workspace/couchbase-server-unix/ns_server/deps/gocode/src/goport/main.go:73 +0x50 fp=0xc4204f8fb8 sp=0xc4204f8f58
runtime.goexit()
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.1/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc4204f8fc0 sp=0xc4204f8fb8
created by main.(*childReader).loop
/home/couchbase/jenkins/workspace/couchbase-server-unix/ns_server/deps/gocode/src/goport/main.go:79 +0x11c

As @vsr1 indicated Eventing can provide a high performance solution:

I addition I would encourage you to look at Function: fixEmailDomains | Couchbase Docs it should be close to the exact Eventing code that you are looking for.

Hi,
As we are using Couchbase server 5.1, I dont think fixEmailDomains is available.

Right you are Eventing was introduced in version 5.5.

In the 6.5 release, the handler code can directly mutate (or write back) to the source bucket, prior to 6.5 you needed to use an intermediate bucket (or write the KEYs to an intermediate bucket and let eventing use that as it source to update the true target).

I dont see Eventing Service with Couchbase server 5.1

You can also update as chunks. and repeat with request_plus

    UPDATE `user-bucket` SET emailId = REPLACE(emailId, "xxmail.com", "yymail.com")
    WHERE REVERSE(emailId) LIKE  RVERSE("%xxmail.com") LIMIT 100000;

Hi,
I tried as you said, and the update time is huge. To update 10k records, its taking more than 2minutes. Below indexes are created in the bucket before executing update script:

CREATE INDEX USERID-IDX ON user-bucket(emailId) WHERE REVERSE(emailId) LIKE REVERSE(“%xxmail.com”);
CREATE INDEX USERIDIQ-IDX ON user-bucket(emailId) WHERE REVERSE(emailId) LIKE REVERSE(“%yymail.com”);

post the explain plan.

> {
>   "plan": {
>     "#operator": "Sequence",
>     "~children": [
>       {
>         "#operator": "IndexScan2",
>         "index": "USERID-IDX",
>         "index_id": "fa3a305f87cade40",
>         "index_projection": {
>           "primary_key": true
>         },
>         "keyspace": "user-bucket",
>         "namespace": "default",
>         "spans": [
>           {
>             "range": [
>               {
>                 "inclusion": 0,
>                 "low": "null"
>               }
>             ]
>           }
>         ],
>         "using": "gsi"
>       },
>       {
>         "#operator": "Sequence",
>         "~children": [
>           {
>             "#operator": "Parallel",
>             "~child": {
>               "#operator": "Sequence",
>               "~children": [
>                 {
>                   "#operator": "Fetch",
>                   "keyspace": "user-bucket",
>                   "namespace": "default"
>                 },
>                 {
>                   "#operator": "Filter",
>                   "condition": "(reverse((`user-bucket`.`emailId`)) like reverse(\"%xxmail.com\"))"
>                 }
>               ]
>             }
>           },
>           {
>             "#operator": "Limit",
>             "expr": "10000"
>           },
>           {
>             "#operator": "Parallel",
>             "~child": {
>               "#operator": "Sequence",
>               "~children": [
>                 {
>                   "#operator": "Clone"
>                 },
>                 {
>                   "#operator": "Set",
>                   "set_terms": [
>                     {
>                       "path": "(`user-bucket`.`emailId`)",
>                       "value": "replace((`user-bucket`.`emailId`), \"xxmail.com\", \"yymail.com\")"
>                     }
>                   ]
>                 },
>                 {
>                   "#operator": "SendUpdate",
>                   "alias": "user-bucket",
>                   "keyspace": "user-bucket",
>                   "limit": "10000",
>                   "namespace": "default"
>                 }
>               ]
>             }
>           }
>         ]
>       },
>       {
>         "#operator": "Discard"
>       }
>     ]
>   },
>   "text": "UPDATE `user-bucket` SET emailId = REPLACE(emailId, \"xxmail.com\", \"yymail.com\") WHERE REVERSE(emailId) LIKE  REVERSE(\"%xxmail.com\") LIMIT 10000"
> }

Couchbase team,
Any findings on the explain plan?

Plan looks correct and it using right index.

Then why its taking more than 2minutes for 10k records? Any fixes please.

N1QL update is full document update.
As REQUEST_PLUS. It needs wait Index catch upto latest mutation.
Then get a keys. get document and update one by one. N1QL update is full document update.

Can you please advise me on the performance approach to this?

You can try this.

Get the document keys using following query.
Use KV reactive/async calls using subdoc get emailid and mutate. You can do that in parallel.

SELECT RAW META(d).id
FROM `user-bucket` AS d
WHERE REVERSE(d.emailId) LIKE  REVERSE("%xxmail.com") ;

subdoc is helping me in completing my task .But I am getting DocumentTooDeepException for multiple documents.
com.couchbase.client.java.error.subdoc.DocumentTooDeepException: JSON is too deep in document D1D0AA2C42628D474D8EDF18012E5D1E

How do i handle such case? Do I need to retry the operation again?

@ismail.iqbal.ap,

SDKs are out of my expertise.

I assume this is one time job, so skip those document and continue. At the end when index caught up you run the same query. You will have few skipped documents retry again or handle that manually or outside.

cc @graham.pople, @david.nault

Here is my piece of code:

private static void createOrUpdateObjects(List docIDs, AsyncBucket asyncBucket) {

  Observable
          .from(docIDs)
          .flatMap(docID -> {
          	DocumentFragment<Lookup> userResult = asyncBucket.lookupIn(docID)
                      .get("emailId")
                      .execute()
                      .retryWhen(RetryBuilder.anyOf(DocumentTooDeepException.class, BackpressureException.class)
	                            .delay(Delay.exponential(TimeUnit.SECONDS, 3))
	                            .max(RETRY_COUNT)
	                            .build())
                      .doOnError(e -> System.out.println(docID+": "+e.getMessage()))
                      .onErrorResumeNext(Observable.empty())
                      .toBlocking().singleOrDefault(null);

              String userId = userResult.content(0).toString();
              userId = userId.replaceAll("@xxmail.com", "@yymail.com");
              	
                return asyncBucket.mutateIn(docID)
                        .replace("emailId", userId)
                        .execute()
                        .retryWhen(RetryBuilder.anyOf(DocumentTooDeepException.class, BackpressureException.class)
	                            .delay(Delay.exponential(TimeUnit.SECONDS, 3))
	                            .max(RETRY_COUNT)
	                            .build())
                        .doOnError(e -> System.out.println(docID+": "+e.getMessage()))
                        .onErrorResumeNext(Observable.empty());
          })
          .toList()
          .toBlocking()
          .single();

}