Timeouts when using replicate_to=1 in delete() calls


#1

Hi,

I am running into timeouts when using replicate_to=1 in delete() calls.
Here is the simple Python code snippet that I use to reproduce the error:

from couchbase.bucket import Bucket

def raise_timeout():
    cn = Bucket(connection_string='couchbase://192.168.0.3,192.168.0.4,192.168.0.5/my_bucket?operation_timeout=60')
    key = 'bar'
    value = 'baz'
    cn.set(key=key, value=value, cas=0, replicate_to=1)
    # here couchbase.exceptions._TimeoutError_0x17 is raised
    cn.delete(key=key, replicate_to=1)

if __name__ == '__main__':
    raise_timeout()   

The raised error is:

Traceback (most recent call last):
  File "tasks/couchbase_timeout_test_script.py", line 58, in <module>
    produce_timeout()
  File "tasks/couchbase_timeout_test_script.py", line 55, in produce_timeout
    cn.delete(key=key, replicate_to=1)
  File "<..>/lib/python3.4/site-packages/couchbase/bucket.py", line 1317, in _tmpmeth
    return _dst(self, *args, **kwargs)
  File "<..>/lib/python3.4/site-packages/couchbase/bucket.py", line 661, in remove
    persist_to=persist_to, replicate_to=replicate_to)
couchbase.exceptions._TimeoutError_0x17 (generated, catch TimeoutError): <Key='bar', RC=0x17[Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout], Operational Error, Results=1, C Source=(src/multiresult.c,309)>

The couchbase cluster consists of three Couchbase instances, with one bucket ‘my_bucket’, configured with number of replicas = 1.
When using replicate_to=0 in the delete method, everyting works fine. Using replicate_to=1 results in a timeout.
By the way, the parameter ‘operation_timeout’ is apparently ignored. The script exits after ~6 seconds, regardless if I set operation_timeout to 60, 6000 or 4.

Why does the setting replicate_to=1 in the delete call result in a timeout?

The couchbase servers are running on Ubuntu:

# apt-show-versions libcouchbase-dev
libcouchbase-dev:amd64/trusty 2.5.0-1 uptodate
# apt-show-versions couchbase-server
couchbase-server:amd64/unknown 3.0.1 uptodate

Many thanks for your help!


#2

http://issues.couchbase.com/browse/CCBC-607 This might apply to you. Try using libcouchbase 2.4.8 and see if the problem goes away?

Though in this case I don’t see why this should succeed on ‘set’ but fail on ‘persist_to’.

btw, trying it locally works fine for me (with a 4 node cluster, 2 replicas; replicate_to=2)


#3

Thank you for your fast reply!
I tried libcouchbase 2.4.8, then I got rid of the timeouts in my artificial Gist script (see https://gist.github.com/Fabma/fc9791c891d6cedab857).

However, when running some application code I still encounter many timeouts in set() and endure_multi(), when setting replicate_to=1 :confused:

Two things that appear strange to me:

  1. When using couchbase python client 2.0.1, the referenced Gist script does only work if I use the “old” version of connecting to Couchbase, i.e., Couchbase.connect(…). If I use the “new” version, i.e., Bucket(‘couchbase://…’), I immediately run into timeout errors.

  2. The client-side timeout parameter appears to be completely ignored. At least the “time to timeout exception” should be affected by this setting, but it is not. There is no difference when I set the timeout parameter to 10 or 1000.
    This blocks testing different timeout parameter values to find out a setting that is appropriate for our application. From my understanding, setting the timeout parameter to a high value of, e.g., 60 seconds, should result in almost no timeouts, since our application normally does not induce more than 1k operations per second.

This is how the Couchbase monitoring looks like when we constantly run into timeouts: (remind that this only occurs if I use replicate_to=1. If I use replicate_to=0, everything looks fine)

Many thanks!


#4

Note that your connection string should be operation_timeout=60.0 rather than operation_timeout=60. libcouchbase, until version 2.4.9, will treat values without a decimal period as microseconds. I think the disparity between Couchbase.connect() and Bucket() is that the former does not use the connection string.

Additionally, ultimately I’d like to figure out why this is failing with lcb 2.5.0… I redid the durability code there in 2.4.9, however – but your problem might be timeout related wrt the connection string… which itself is strange, since in >= 2.4.9 the connection string is correctly interpreted.

Finally, can you try to enable logging and re-run this with proper logs? See http://docs.couchbase.com/developer/python-2.0/logging.html


#5

We now tried the connection string

cn = Bucket('couchbase://192.168.0.3,192.168.0.4,192.168.0.5/<bucketname>?operation_timeout=120.0')

with couchbase-python-client 2.0.1 and libcouchbase 2.5.0. With the test script I provided we now get the following debug errorr logs:

2015-05-22 10:09:54,912 | INFO     | Initializing Couchbase logging. lcb_version=('2.5.0', 132352)
2015-05-22 10:09:54,913 | DEBUG    | [0] Preparing providers (this may be called multiple times) (L:89)
2015-05-22 10:09:54,914 | DEBUG    | [0] Provider FILE is DISABLED (L:99)
2015-05-22 10:09:54,914 | DEBUG    | [0] Provider CCCP is ENABLED (L:96)
2015-05-22 10:09:54,914 | DEBUG    | [0] Provider HTTP is ENABLED (L:96)
2015-05-22 10:09:54,914 | DEBUG    | [0] Provider MCRAW is DISABLED (L:99)
2015-05-22 10:09:54,915 | DEBUG    | [0] Start refresh requested (L:292)
2015-05-22 10:09:54,915 | DEBUG    | [0] Current provider is CCCP (L:271)
2015-05-22 10:09:54,915 | INFO     | [0] Requesting connection to node 192.168.0.3:11210 for CCCP configuration (L:118)
2015-05-22 10:09:54,916 | DEBUG    | [0] <192.168.0.3:11210> (HE=0x102b480) Creating new connection because none are available in the pool (L:416)
2015-05-22 10:09:54,916 | DEBUG    | [0] <192.168.0.3:11210> (HE=0x102b480) Starting connection on I=0x10056b0 (L:321)
2015-05-22 10:09:54,916 | INFO     | [0] <192.168.0.3:11210> (SOCK=0xf02390) Starting. Timeout=2000000us (L:450)
2015-05-22 10:09:54,916 | DEBUG    | [0] <192.168.0.3:11210> (SOCK=0xf02390) Got event handler for new connection (L:267)
2015-05-22 10:09:54,917 | DEBUG    | [0] <192.168.0.3:11210> (SOCK=0xf02390) Scheduling asynchronous watch for socket. (L:314)
2015-05-22 10:09:54,917 | DEBUG    | [0] <192.168.0.3:11210> (SOCK=0xf02390) Got event handler for new connection (L:267)
2015-05-22 10:09:54,917 | INFO     | [0] <192.168.0.3:11210> (SOCK=0xf02390) Connected  (L:116)
2015-05-22 10:09:54,917 | DEBUG    | [0] <192.168.0.3:11210> (SOCK=0xf02390) Successfuly set TCP_NODELAY (L:123)
2015-05-22 10:09:54,918 | DEBUG    | [0] <192.168.0.3:11210> (HE=0x102b480) Received result for I=0x10056b0,C=0xf02390; E=0x0 (L:271)
2015-05-22 10:09:54,918 | DEBUG    | [0] <192.168.0.3:11210> (HE=0x102b480) Assigning R=0x100bcf0 SOCKET=0xf02390 (L:223)
2015-05-22 10:09:54,918 | DEBUG    | [0] <192.168.0.3:11210> (CTX=0x102c430,unknown) Pairing with SOCK=0xf02390 (L:101)
2015-05-22 10:09:54,919 | DEBUG    | [0] <192.168.0.3:11210> (CTX=0x102c430,sasl) Destroying. PND=0,ENT=1,SORC=1 (L:151)
2015-05-22 10:09:54,919 | DEBUG    | [0] <192.168.0.3:11210> (CTX=0xeb19b0,unknown) Pairing with SOCK=0xf02390 (L:101)
2015-05-22 10:09:54,919 | DEBUG    | [0] <192.168.0.3:11210> (CTX=0xeb19b0,bc_cccp) Destroying. PND=0,ENT=1,SORC=1 (L:151)
2015-05-22 10:09:54,920 | INFO     | [0] <192.168.0.3:11210> (HE=0x102b480) Placing socket back into the pool. I=0x10056b0,C=0xf02390 (L:491)
2015-05-22 10:09:54,921 | INFO     | [0] Setting new configuration. Received via CCCP (L:174)
2015-05-22 10:09:54,921 | DEBUG    | [0] Instance configured! (L:55)
2015-05-22 10:09:54,921 | DEBUG    | [0] Preparing providers (this may be called multiple times) (L:89)
2015-05-22 10:09:54,922 | DEBUG    | [0] Provider FILE is DISABLED (L:99)
2015-05-22 10:09:54,922 | DEBUG    | [0] Provider CCCP is ENABLED (L:96)
2015-05-22 10:09:54,922 | DEBUG    | [0] Provider HTTP is DISABLED (L:99)
2015-05-22 10:09:54,922 | DEBUG    | [0] Provider MCRAW is DISABLED (L:99)
2015-05-22 10:09:54,923 | INFO     | [0] <192.168.0.3:11210> (HE=0x102b480) Found ready connection in pool. Reusing socket and not creating new connection (L:407)
2015-05-22 10:09:54,923 | DEBUG    | [0] <192.168.0.3:11210> (HE=0x102b480) Assigning R=0x1008080 SOCKET=0xf02390 (L:223)
2015-05-22 10:09:54,923 | DEBUG    | [0] <192.168.0.3:11210> (CTX=0xeb19b0,unknown) Pairing with SOCK=0xf02390 (L:101)
2015-05-22 10:09:54,923 | DEBUG    | [0] <192.168.0.3:11210> (SRV=0x103d190,IX=0) Setting initial timeout=119999ms (L:543)
2015-05-22 10:09:54,924 | DEBUG    | [0] <192.168.0.4:11210> (HE=0x10385f0) Creating new connection because none are available in the pool (L:416)
2015-05-22 10:09:54,924 | DEBUG    | [0] <192.168.0.4:11210> (HE=0x10385f0) Starting connection on I=0xff26f0 (L:321)
2015-05-22 10:09:54,925 | INFO     | [0] <192.168.0.4:11210> (SOCK=0x10390e0) Starting. Timeout=120000000us (L:450)
2015-05-22 10:09:54,925 | DEBUG    | [0] <192.168.0.4:11210> (SOCK=0x10390e0) Got event handler for new connection (L:267)
2015-05-22 10:09:54,925 | DEBUG    | [0] <192.168.0.4:11210> (SOCK=0x10390e0) Scheduling asynchronous watch for socket. (L:314)
2015-05-22 10:09:54,925 | DEBUG    | [0] <192.168.0.4:11210> (SOCK=0x10390e0) Got event handler for new connection (L:267)
2015-05-22 10:09:54,926 | INFO     | [0] <192.168.0.4:11210> (SOCK=0x10390e0) Connected  (L:116)
2015-05-22 10:09:54,926 | DEBUG    | [0] <192.168.0.4:11210> (SOCK=0x10390e0) Successfuly set TCP_NODELAY (L:123)
2015-05-22 10:09:54,926 | DEBUG    | [0] <192.168.0.4:11210> (HE=0x10385f0) Received result for I=0xff26f0,C=0x10390e0; E=0x0 (L:271)
2015-05-22 10:09:54,926 | DEBUG    | [0] <192.168.0.4:11210> (HE=0x10385f0) Assigning R=0xeb6550 SOCKET=0x10390e0 (L:223)
2015-05-22 10:09:54,927 | DEBUG    | [0] <192.168.0.4:11210> (SRV=0x103e090) Session not yet negotiated. Negotiating (L:522)
2015-05-22 10:09:54,927 | DEBUG    | [0] <192.168.0.4:11210> (CTX=0x10502c0,unknown) Pairing with SOCK=0x10390e0 (L:101)
2015-05-22 10:09:54,927 | DEBUG    | [0] <192.168.0.4:11210> (CTX=0x10502c0,sasl) Destroying. PND=0,ENT=1,SORC=1 (L:151)
2015-05-22 10:09:54,928 | DEBUG    | [0] <192.168.0.4:11210> (CTX=0x10368e0,unknown) Pairing with SOCK=0x10390e0 (L:101)
2015-05-22 10:09:54,928 | DEBUG    | [0] <192.168.0.4:11210> (SRV=0x103e090,IX=1) Setting initial timeout=119996ms (L:543)
set value 0
2015-05-22 10:09:59,848 | WARNING  | [0] Polling durability timed out! (L:572)
Traceback (most recent call last):
  File "tasks/benchmarking/couchbase_timeout_test.py", line 49, in <module>
    repeated_get_set_and_delete(iterations=100)
  File "tasks/benchmarking/couchbase_timeout_test.py", line 44, in repeated_get_set_and_delete
    cn.delete(key=key, replicate_to=1, persist_to=0)
  File "/home/peter/.virtualenvs/peter-beta/lib/python3.4/site-packages/couchbase/bucket.py", line 1317, in _tmpmeth
    return _dst(self, *args, **kwargs)
  File "/home/peter/.virtualenvs/peter-beta/lib/python3.4/site-packages/couchbase/bucket.py", line 661, in remove
    persist_to=persist_to, replicate_to=replicate_to)
couchbase.exceptions._TimeoutError_0x17 (generated, catch TimeoutError): <Key='foo', RC=0x17[Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout], Operational Error, Results=1, C Source=(src/multiresult.c,309)>
2015-05-22 10:09:59,857 | DEBUG    | [0] <192.168.0.3:11210> (SRV=0x103d190,IX=0) Finalizing ctx 0xeb19b0 (L:758)
2015-05-22 10:09:59,857 | DEBUG    | [0] <192.168.0.3:11210> (CTX=0xeb19b0,memcached) Destroying. PND=0,ENT=0,SORC=1 (L:151)
2015-05-22 10:09:59,857 | DEBUG    | [0] <192.168.0.4:11210> (SRV=0x103e090,IX=1) Finalizing ctx 0x10368e0 (L:758)
2015-05-22 10:09:59,858 | DEBUG    | [0] <192.168.0.4:11210> (CTX=0x10368e0,memcached) Destroying. PND=0,ENT=0,SORC=1 (L:151)

The error does not take the operation_timeout config parameter (120s) into account, but it appears after 5s.
Remind that the test script works with the setting replicate_to=0, the error only occurs if we set replicate_to=1.

Thanks again for your great help!


#6

The operation timeout is a per-operation timeout, the durability timeout works on a different system (specifically, the operation timeout depends on connectivity between your client and the cluster nodes, while the durability timeout depends on how long you want to wait for persistence/replication).

You can set the durability_timeout easily enough in the connection string as well. As you’ve indeed observed, this interval is by default 5 seconds.


#7

With
"?operation_timeout=60.0&durability_timeout=60.0"
the small gist script (https://gist.github.com/Fabma/fc9791c891d6cedab857) runs without a timeout exception. Thanks for revealing these parameters!

Measuring the total execution time (100 get(), 100 set(), 10 delete() operations) results in:

with to_replicate=0:
0.7 seconds

with to_replicate=1:
10 seconds (libcouchbase 2.4.8)

with to_replicate=1:
25 seconds (libcouchbase 2.5.0)

Q1) The total execution time seems to depend on the libcouchbase version
Q2) The “waiting for replication” overhead appears to be quite high to me(?)

Our application - using also endure_multi - still runs into multiple timeout exceptions, when setting the durability_timeout to 120.0 seconds. We run a cluster of three Couchbase nodes, the CPU utilization of the Couchbase servers is about 10-20%.
Q3) Is there a proper way to debug why the timeouts appear and see why/where Couchbase is waiting?

Many thanks!


#8

I don’t see why this should differ between libcouchbase versions (I might have some idea as to what might be happening). Generally time delays are dependent on network/disk speed; so in other words, the faster the network connection (and disk) on each of the replicas, the smaller the time it would take for the items to actually get persisted/replicated.

It is actually interesting that the newer libcouchbase version takes longer; specifically the code there was modified to not send probes to any nodes which have already confirmed their persistence/replication status. I take these times are for the total of all these operations (so in other words, 210 ops).

If actually debugging the durability overhead itself, endure_multi is recommended since this will provide a clear distinction between how long the actual K/V operation took, and how long it took to wait for its persistence/replication to complete.

The logging output (as shown above) would show you if there are any unexpected delays, but in general we don’t provide logging on such a granular level.

The C library provides an nresponses field for each item which would show how many actual packets were sent out until the requirements were satisfied, however this information is not exposed to the Python client. If you’re not afraid of C, you can probably run with a modified version of the library; inserting various print/logging statements whenever a reply is recieved. This is what I normally do when debugging these things myself. Unfortunately the logging mechanism is not complex or fine-grained enough to allow an option to explicitly enable/disable durability logging (or any kind of per-operation logging where there isn’t a clear error anywhere).

Finally, the durability requirements (both persist_to, and the standalone endure_multi) use something called observe internally; in Python this is exposed as the observe() method. Keep in mind though that the Python-level durability functions go through the C library directly (via lcb_durability_poll) rather than use the Python (or C-level) observe. That being said, you can perform this loop twice (perhaps checking “manually” with observe; filtering the proper persistence/replication numbers). This will help detect any bugs or discrepancies that might exist. I’ve tested the durability functionality with both versions of libcouchbase, and they both functioned as expected, but this might be because my network isn’t very loaded and/or the cluster’s disks are not under load.