Slow operations and timeouts

Antonis · May 9, 2022, 9:48am

Server Build : 7.0.2 build 6703

Java SDK : 3.2.6

Scenario :

3 nodes cluster, replicas=2, durability=majority
Cluster quota: 24GB, 8 CPU per VM
2 collections, ~100m documents
kvTimeout=10 seconds
Replace or Insert all documents in a row one by one, 1000 transactions per second
On each VM: Average %idle 35-40%, Average %memused 65-75%
bucket resident 40%

Observation :

For some seconds the following exception is thrown multiple times:

May 6 04:49:58 XXXza-profiling za-profiling-profiles[-]: log:com.couchbase.client.core.error.AmbiguousTimeoutException: SubdocMutateRequest, Reason: TIMEOUT {“cancelled”:true,“clientContext”: {“isIdempotent”:true}

,“completed”:true,“coreId”:“0x3eb621a800000004”,“idempotent”:false,“lastChannelId”:“3EB621A800000004/00000000B45256BE”,“lastDispatchedFrom”:“10.233.90.21:45798”,“lastDispatchedTo”:“lhvmsrv3196-lb.lh.upstreamsystems.com:11210”,“reason”:“TIMEOUT”,“requestId”:76219536,“requestType”:“SubdocMutateRequest”,“retried”:0,“service”:{“bucket”:“profiles”,“collection”:“user_profile”,“documentId”:"_sXK-QJ8iJSOUq5aiW1sKQ",“opaque”:“0x48aabb2”,“scope”:“data”,“type”:“kv”,“vbucket”:624},“timeoutMs”:10000,“timings”:{“encodingMicros”:33,“totalMicros”:10006238}}

So, mutations were canceled after the configured timeout of 10 seconds without any retry. All timeouts originated from the same VM.

At the same time, Couchbase server prints the following:

memcached.log

2022-05-06T04:49:51.220646+00:00 WARNING 181: Slow operation: {“bucket”:“profiles”,“cid”:“3EB621A800000004/00000000B45256BE/48aabb2”,“command”:“SUBDOC_MULTI_MUTATION”,“duration”:“13 s”,“packet”: {“bodylen”:514,“cas”:0,“datatype”:“raw”,“extlen”:0,“key”:"._sXK-QJ8iJSOUq5aiW1sKQ",“keylen”:23,“magic”:“ClientRequest”,“opaque”:2997586436,“opcode”:“SUBDOC_MULTI_MUTATION”,“vbucket”:624}

,“peer”:"{“ip”:“10.168.104.52”,“port”:20810}",“trace”:“request=2124376153275375:12597635 execute=2124376153275375:30 execute=2124376172404327:4026 execute=2124388750889725:21”,“worker_tid”:140409586050816}
2022-05-06T04:49:51.225152+00:00 WARNING 181: Slow operation: {“bucket”:“profiles”,“cid”:“3EB621A800000004/00000000B45256BE/48ab206”,“command”:“GET_CLUSTER_CONFIG”,“duration”:“10 s”,“packet”: {“bodylen”:0,“cas”:0,“datatype”:“raw”,“extlen”:0,“key”:"",“keylen”:0,“magic”:“ClientRequest”,“opaque”:112364036,“opcode”:“GET_CLUSTER_CONFIG”,“vbucket”:0}

,“peer”:"{“ip”:“10.168.104.52”,“port”:20810}",“response”:“Success”,“trace”:“request=2124378363635916:10396399 execute=2124388760002294:32”,“worker_tid”:140409586050816}
2022-05-06T04:49:51.316574+00:00 INFO (profiles) Compaction of vb:980 done (ok). purged tombstones:0, prepares:3310, prepareBytes:2208372 collection_items_erased:alive:0,deleted:0, size/items/tombstones/purge_seqno pre{69107811, 100594, 0, 0}, post{38113379, 97284, 0, 0}

Is this the expected behavior? Could you please elaborate on how could be solved? Do we need more CPU or RAM?

Other memcached.logs during timeouts:

2022-05-06T04:49:51.220646+00:00 WARNING 181: Slow operation: {“bucket”:“profiles”,“cid”:“3EB621A800000004/00000000B45256BE/48aabb2”,“command”:“SUBDOC_MULTI_MUTATION”,“duration”:“13 s”,“packet”: {“bodylen”:514,“cas”:0,“datatype”:“raw”,“extlen”:0,“key”:"._sXK-QJ8iJSOUq5aiW1sKQ",“keylen”:23,“magic”:“ClientRequest”,“opaque”:2997586436,“opcode”:“SUBDOC_MULTI_MUTATION”,“vbucket”:624}

,“peer”:"{“ip”:“10.168.104.52”,“port”:20810}",“trace”:“request=2124376153275375:12597635 execute=2124376153275375:30 execute=2124376172404327:4026 execute=2124388750889725:21”,“worker_tid”:140409586050816}
2022-05-06T04:49:51.225152+00:00 WARNING 181: Slow operation: {“bucket”:“profiles”,“cid”:“3EB621A800000004/00000000B45256BE/48ab206”,“command”:“GET_CLUSTER_CONFIG”,“duration”:“10 s”,“packet”: {“bodylen”:0,“cas”:0,“datatype”:“raw”,“extlen”:0,“key”:"",“keylen”:0,“magic”:“ClientRequest”,“opaque”:112364036,“opcode”:“GET_CLUSTER_CONFIG”,“vbucket”:0}

,“peer”:"{“ip”:“10.168.104.52”,“port”:20810}",“response”:“Success”,“trace”:“request=2124378363635916:10396399 execute=2124388760002294:32”,“worker_tid”:140409586050816}
2022-05-06T01:51:16.912275+00:00 WARNING 278: Slow operation: {“bucket”:“profiles”,“cid”:“922D88E200000004/00000000CC379779/10d0826”,“command”:“SUBDOC_MULTI_MUTATION”,“duration”:“13 s”,“packet”: {“bodylen”:447,“cas”:0,“datatype”:“raw”,“extlen”:0,“key”:".NSAuHgG_mOysWsdmZtGFXQ",“keylen”:23,“magic”:“ClientRequest”,“opaque”:638061825,“opcode”:“SUBDOC_MULTI_MUTATION”,“vbucket”:618}

,“peer”:"{“ip”:“10.168.104.34”,“port”:23027}",“trace”:“request=2113661864500353:12579017 execute=2113661864500353:29 execute=2113661869268788:3369 execute=2113674443490934:26”,“worker_tid”:140409681430272}
2022-05-06T01:51:16.915075+00:00 WARNING 278: Slow operation: {“bucket”:“profiles”,“cid”:“922D88E200000004/00000000CC379779/10d084f”,“command”:“GET_CLUSTER_CONFIG”,“duration”:“12 s”,“packet”: {“bodylen”:0,“cas”:0,“datatype”:“raw”,“extlen”:0,“key”:"",“keylen”:0,“magic”:“ClientRequest”,“opaque”:1325927681,“opcode”:“GET_CLUSTER_CONFIG”,“vbucket”:0}

,“peer”:"{“ip”:“10.168.104.34”,“port”:23027}",“response”:“Success”,“trace”:“request=2113662086593708:12363373 execute=2113674449949489:18”,“worker_tid”:140409681430272}
2022-05-06T02:46:48.799027+00:00 WARNING 206: Slow operation: {“bucket”:“profiles”,“cid”:“DD2D9FC700000004/000000006EF5156E/4300fee”,“command”:“SUBDOC_MULTI_MUTATION”,“duration”:“1742 ms”,“packet”: {“bodylen”:581,“cas”:0,“datatype”:“raw”,“extlen”:0,“key”:".OLb7xePfXaZ_xhrjaMg1oQ",“keylen”:23,“magic”:“ClientRequest”,“opaque”:3993972740,“opcode”:“SUBDOC_MULTI_MUTATION”,“vbucket”:507}

,“peer”:"{“ip”:“10.168.104.37”,“port”:21623}",“trace”:“request=2117004586707730:1741527 execute=2117004586707730:274 execute=2117004595922489:1900 execute=2117006328214807:20”,“worker_tid”:140409681430272}
2022-05-06T02:46:48.800708+00:00 WARNING 206: Slow operation: {“bucket”:“profiles”,“cid”:“DD2D9FC700000004/000000006EF5156E/4301216”,“command”:“GET_CLUSTER_CONFIG”,“duration”:“970 ms”,“packet”: {“bodylen”:0,“cas”:0,“datatype”:“raw”,“extlen”:0,“key”:"",“keylen”:0,“magic”:“ClientRequest”,“opaque”:370290692,“opcode”:“GET_CLUSTER_CONFIG”,“vbucket”:0}

,“peer”:"{“ip”:“10.168.104.37”,“port”:21623}",“response”:“Success”,“trace”:“request=2117005365845751:969758 execute=2117006335587993:16”,“worker_tid”:140409681430272}
2022-05-06T02:47:34.723142+00:00 WARNING (No Engine) Slow runtime for ‘Item pager on vb:394’ on thread NonIoPool0: 8144 ms

biozal · May 18, 2022, 2:18pm

Hi @Antonis

I’m pinging someone from the engineering team to see if I can get an answer back for you on this. The error is stating you are getting a time out. A few more questions:

I know this is an obvious question but you have confirmed that your nodes can talk to each other, correct?

Is this environment in containers, VM’s, or just stand-alone servers?

What platform and OS are the servers using?

Thanks
Aaron

biozal · May 18, 2022, 2:47pm

@Antonis did anyone from your company open a support case on this?

Thanks
Aaron

Antonis · May 23, 2022, 9:25am

Hi @biozal,

Looking at curl -u ‘<>:<>’ http://server:8091/logs, no communication_issue were reported.
It is in VMs
Servers runs on Ubuntu 20.04

FYI there is an open JIRA ticket here https://issues.couchbase.com/browse/MB-52065

biozal · May 23, 2022, 1:47pm

I recommend the support team continue to work on this via the ticket then. They have WAY more resources than the people that monitor the forums.

-Aaron