Indexes suddenly getting hundred of thousands of mutations and increasing

richard · August 24, 2021, 1:16pm

I am using Couchbase Server Community Edition 6.6.0 build 7909 on a Ubuntu server and a few times in the past month, the indexes on my server suddenly getting thousands of mutations and this increases into the millions.

The indexes are showing as ‘ready’ and are queryable.

This does eventually go back to normal but when it is processing all the mutations any new documents aren’t returned when the index is queried.

The sync gateway requests timeout when there are loads of mutations and then once it comes back up the changes feed starts processing a load of deleted objects that have been deleted previously.

We previously had 4 servers but 3 were on Docker and 1 wasn’t, the 1 that wasn’t also had all the indexes on due to only recently adding the Docker servers. I ran failover on that server and spread the indexes out on the other servers and thought that was it, but it happened today on the Docker servers.

I’ve attached the logs from the 17th at about 10:20 it happened from the none Docker server.
CBServerLogs.zip (3.1 MB)

Screenshot is from today on the Docker server but similar thing happened on the other server

I did have the ejection mode on full and have changed this back to value only today and also reduced the memory quota for the indexes as the server was on ~21GB/~23GB which doesn’t help.

varun.velamuri · August 26, 2021, 6:24am

@richard ,

There can be multiple reasons for seeing large number of mutations remaining on the indexes page. One could be a rollback on indexer side. I do not see any rollback related messages in the logs you shared. Other could be due to documents being pumped/updated into the bucket. When you see the increase in “mutations remaining”, are there any documents being written into the “main” bucket?

Also, 1,426,675 mutations remaining does not necessarily mean 1,426,675 documents are to be processed as multiple document updates will be de-duplicated and sent to clients. This number is just an upper limit on how many mutations may have to be processed. In practise, the actual number of documents being processed could be lot less than this.

Also, the resident percent of all the indexes seem “0”. This means that majority of the index data is on disk. Any update, read or write of index data will requires a disk access and this can slow down the system. I see that you have reduced the memory quota for indexes - This does not help at all. I think you should consider increasing the memory quota for the indexes.

Thanks,
Varun

richard · August 26, 2021, 11:11am

Thanks for checking the logs, I wasn’t sure if the error had anything to do with it.

If you think something is sending lots of changes, do you know anything about this warning from the sync gateway
There’s a lot of that from when it happened 24th just before I get notified that our API has lost connection to the sync gateway (at 10:40)

I have turned the sync gateway off previously once I was notified of it and the mutations were still increasing but I guess it is possible the sync gateway had already sent the changes and the number increased as it was processing

2021-08-24T10:26:25.420+01:00 [WRN] Null doc body/rawBody FDB920E0-7968-11EB-8B4C-917618DB0436/ from – db.(*Document).BodyBytes() at document.go:306

sg_info-2021-08-24T09-28-21.183.log.zip (2.3 MB)

With the resident ratio at 0%, I think it may just be this issue https://issues.couchbase.com/browse/MB-44400
I realise that is version 7 but seems to be the same?

The data is normally between 97-100%

varun.velamuri · August 26, 2021, 4:30pm

@richard Thanks for pointing the MB. It totally skipped my mind.

@adamf Can you please look at the sync-gateway message mentioned by @richard

adamf · August 26, 2021, 6:12pm

That log warning indicates that no body was found for the document listed. I haven’t seen this message (with this frequency) in the past. This is generally unexpected, unless document bodies are intentionally being set to null.

richard · September 22, 2021, 3:49pm

I think these are documents that have been deleted previously and have already gone through the sync gateway, but maybe another device that hasn’t synced in a while is processing all these deletes again and is pushing them back up again?

When I query the Sync Gateway for the id I get the response: {“error”:“not_found”,“reason”:“deleted”}

We did delete a few million documents a few months ago, which is why there is so many.

It did also occur in our test environment when a device that hadn’t connected for a while started syncing.

Happening now on our live environment, I’ve checked the users who have logged into the API and they all should be up to date though

varun.velamuri · October 1, 2021, 4:03am

@adamf I am not sure if this is an expected behaviour with sync gateway. Can you please share your thoughts.

@richard “Mutations remaining” can also go high if there are new documents pushed into the cluster. Are you sure there are no new documents in the cluster?

adamf · October 1, 2021, 4:41am

A client isn’t going to push back deleted/purged documents unless that client’s checkpoint has also been deleted. This might happen for a client that has been disconnected for months - the default expiry for client checkpoints is 90 days. Generally you’ll want to ensure that tombstones are being expired on the client as well as the server.