Monitoring of persistence of a batch of upserts

timeout

#1

Our application is written in Node.js and it sends batches of 40000 documents to Couchbase server 5.5.2 using SDK methods (bucket.upsert, no N1QL). We are currently using a single node cluster. The application cannot tolerate any data loss, so it asks the Couchbase server for persistence ( “PersistTo = 1” ). From time to time, infrequently, the application receive a message like this:

{ CouchbaseError: Durability requirements failed
_at endureError (/home/ubuntu/XXXXX/node_modules/couchbase/lib/bucket.js:1456:19)
at /home/ubuntu/XXXXX/node_modules/couchbase/lib/bucket.js:1490:24
message: ’ Durability requirements failed ',
code: undefined,
innerError:
{ CouchbaseError: Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout
message:
‘Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout’,
code: 23 } }

So far, we have not succeeded to find out the reason for this error. We don’t see any performance bottlenecks in the system and the timeout is set to 5 seconds - which seems to be a lot. Usually, it takes hours to reproduce the problem. Both OS and Couchbase monitoring shows no apparent bottlenecks. In fact, the system is quite far from being busy.

To our understanding, Couchbase server writes the whole batch sent from the application to the memory, and starts polling (?) to look if the requested durability has been achieved across all vBuckets. However, we don’t understand how to monitor this process and why sometimes we receive timeouts. Is it possible to know which vBuckets have not succeeded to provide the requested durability yet and to estimate how much time or I/O are required to achieve that durability?

We understand that using of multiple nodes (3+) in the cluster with replication between the nodes may solve the problem. However, we’d still like to understand what’s going on with the persistence. Does anybody face with such problems?

Thanks in advance


#2

Hey @LeonidGvirtz,

This doesn’t appear to be directly related to one of the SDKs, but rather some form of delay in replication that is occurring which is causing your durability requirements to fail to be satisfied in the time you have provided. A number of things can cause this, and looking at the information available in the cluster UI can usually help pinpoint the cause (looking at replication log and what not). I have moved this to the Couchbase Server forum and pinged the server team, I suspect they can provide more detailed information on how to diagnose intermittently slow replication.

Cheers, Brett


#3

An update - now we use a 3-node cluster with 1 replica for each bucket and setting of “PersistTo = 1” still causes timeouts, sometimes even quite fast, no need to wait for hours. I wonder if it is possible to know the time the persistence takes for a batch submitted from SDK. Actually, I don’t even know how much time the batch takes at average. Also, it would be quite helpful to know what the server spends time for while writing all documents in a batch to disk.