Random, unexplained timeout on get()?


#1

I am running a two-server Couchbase cluster, both servers are running a Python application querying and writing data with a total 100 message/second average 24/7.
Once in about every 2-3 days, I get a timeout error from this Python app on trying to receive one or another document from Couchbase.

I have no idea why this happens, the timeout parameters has been set to very high (60.0).

This is the simple get by document id which throws the timeout:
doc = self.bucket.get(docid)

The following error gets thrown:
Uncaught exception: <Key=‘app-job-2016-01-26 19:29:54.893916’, RC=0x17[Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout], Operational Error, Results=1, C Source=(src/multiresult.c,309)>
Type: <class ‘couchbase.exceptions._TimeoutError_0x17 (generated, catch TimeoutError)’>

doc = self.bucket.get(docid)
File “…/venv1/lib64/python3.4/site-packages/couchbase/bucket.py”, line 489, in get
replica=replica, no_format=no_format)

I open up the connection with a large enough timeout:

cb_connection_string = 'couchbase://127.0.0.1/default?operation_timeout=60.0&durability_timeout=60.0’
try:
self.bucket = Bucket(cb_connection_string)

No network or disk issues observed the the time.

Has anybody else experienced similar un-explainable timeout errors while using large timeout settings?


#2

Timeouts are largely handled by the underlying C library. You might have some firewall or idle connection being closed under the hood.

Can you enable logging? (there should be a logging section in the API documentation…)


#3

Sorry, enable logging of which component, ns_server, couchdb, etc?
What log levels would be required?
I assume this is what you meant?: http://docs.couchbase.com/admin/admin/Misc/Trbl-logs.html


#4

By logging I mean the SDK level logging: http://pythonhosted.org/couchbase/api/logging.html


#5

Thanks, sure, it has been permissioned, now just waiting for the error to show again. I have seen in several times in the last few hours, so hopefully it will show again soon.


#6

I have got the error - see the logs below.
Honestly I still don’t see what goes wrong.
What seems to be interesting that for the second server it shows timeout of 957 ms instead of 60000 ms.

Maybe the connection string is wrong? As it seems only one of the server (xx.xx.xx.61) is connected with the requested timeout, the other one seems to be showing 957 ms timeout. This is the full connection string:
cb_connection_string = 'couchbase://127.0.0.1,xx.xx.xx.61,xx.xx.xx.35/default?operation_timeout=60.0&durability_timeout=60.0’
self.bucket = Bucket(cb_connection_string)

The logs (from server xx.xx.xx.61):

DEBUG:couchbase.server:[0] <xx.xx.xx.61:11210> (SRV=0x25ce110,IX=1) Scheduling next timeout for 60000 ms (L:467)
DEBUG:couchbase.server:[0] <xx.xx.xx.35:11210> (SRV=0x25cd320,IX=0) Scheduling next timeout for 957 ms (L:467)
WARNING:couchbase.server:[0] <xx.xx.xx.35:11210> (SRV=0x25cd320,IX=0) Failing command (pkt=0x25cf920, opaque=29457, opcode=0x0) with error 0x17 (L:380)
WARNING:couchbase.server:[0] <xx.xx.xx.35:11210> (SRV=0x25cd320,IX=0) Failing command (pkt=0x25cf970, opaque=29458, opcode=0xb5) with error 0x17 (L:380)
DEBUG:couchbase.confmon:[0] Start refresh requested (L:292)
ERROR:couchbase.server:[0] <xx.xx.xx.35:11210> (SRV=0x25cd320,IX=0) Server timed out. Some commands have failed (L:463)
DEBUG:couchbase.server:[0] <xx.xx.xx.35:11210> (SRV=0x25cd320,IX=0) Scheduling next timeout for 60000 ms (L:467)
DEBUG:couchbase.server:[1] <xx.xx.xx.35:11210> (SRV=0x260a310,IX=0) Finalizing ctx 0x2b020c0145c0 (L:738)
DEBUG:couchbase.ioctx:[1] <xx.xx.xx.35:11210> (CTX=0x2b020c0145c0,memcached) Destroying. PND=0,ENT=0,SORC=1 (L:151)
DEBUG:couchbase.server:[1] <xx.xx.xx.61:11210> (SRV=0x260aa70,IX=1) Finalizing ctx 0x2b020c01e860 (L:738)a
DEBUG:couchbase.ioctx:[1] <xx.xx.xx.61:11210> (CTX=0x2b020c01e860,memcached) Destroying. PND=0,ENT=0,SORC=1 (L:151)
DEBUG:couchbase.server:[2] <xx.xx.xx.35:11210> (SRV=0x261c740,IX=0) Finalizing ctx 0x2b0218014590 (L:738)
DEBUG:couchbase.ioctx:[2] <xx.xx.xx.35:11210> (CTX=0x2b0218014590,memcached) Destroying. PND=0,ENT=0,SORC=1 (L:151)
DEBUG:couchbase.server:[2] <xx.xx.xx.61:11210> (SRV=0x265e950,IX=1) Finalizing ctx 0x2b0218026f80 (L:738)
DEBUG:couchbase.ioctx:[2] <xx.xx.xx.61:11210> (CTX=0x2b0218026f80,memcached) Destroying. PND=0,ENT=0,SORC=1 (L:151)
2016-01-26 22:06:00,560 (3977) [0:00:36.047459] - ERROR - Uncaught exception: <Key='ingest-app-job-2016-01-26 22:03:56.473585', RC=0x17[Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout], Operational Error, Results=1, C Source=(src/multiresult.c,309)>
                     Type: <class 'couchbase.exceptions._TimeoutError_0x17 (generated, catch TimeoutError)'>
                     Traceback: ['  File "ingest-app.py", line 256, in <module>\n    main(current_folder)\n', '  File "ingest-app.py", line 217, in main\n    main_app_cb_durability_timeout=main_app_cb_durability_timeout)\n', '  File "modules/functions.py", line 200, in check_process_jobs\n    job_doc = cb.get(job.docid)  # get the job\'s details from Couchbase\n', '  File "modules/db_couchbase_class.py", line 49, in get\n    doc = self.bucket.get(docid)\n', '  File "venv1/lib64/python3.4/site-packages/couchbase/bucket.py", line 489, in get\n    replica=replica, no_format=no_format)\n']
                     Traceback line number: 256 - [module:functions, line:398, functionname:my_handler]
DEBUG:couchbase.server:[0] <xx.xx.xx.35:11210> (SRV=0x25cd320,IX=0) Finalizing ctx 0x24f7360 (L:738)
DEBUG:couchbase.ioctx:[0] <xx.xx.xx.35:11210> (CTX=0x24f7360,memcached) Destroying. PND=0,ENT=0,SORC=1 (L:151)
DEBUG:couchbase.server:[0] <xx.xx.xx.61:11210> (SRV=0x25ce110,IX=1) Finalizing ctx 0x25ec710 (L:738)
DEBUG:couchbase.ioctx:[0] <xx.xx.xx.61:11210> (CTX=0x25ec710,memcached) Destroying. PND=0,ENT=0,SORC=1 (L:151)

#7

The other server (xx.xx.xx.35) a half hour later has timed out with similar logs:

DEBUG:couchbase.server:[3] <xx.xx.xx.35:11210> (SRV=0x1dd0660,IX=0) Finalizing ctx 0x1deb710 (L:738)
DEBUG:couchbase.ioctx:[3] <xx.xx.xx.35:11210> (CTX=0x1deb710,memcached) Destroying. PND=0,ENT=0,SORC=1 (L:151)
DEBUG:couchbase.server:[0] <xx.xx.xx.61:11210> (SRV=0x1d73590,IX=1) Scheduling next timeout for 60000 ms (L:467)
DEBUG:couchbase.server:[0] <xx.xx.xx.35:11210> (SRV=0x1d727a0,IX=0) Scheduling next timeout for 38478 ms (L:467)
WARNING:couchbase.server:[0] <xx.xx.xx.35:11210> (SRV=0x1d727a0,IX=0) Failing command (pkt=0x1d6bd10, opaque=92451, opcode=0x0) with error 0x17 (L:380)
DEBUG:couchbase.confmon:[0] Start refresh requested (L:292)
ERROR:couchbase.server:[0] <xx.xx.xx.35:11210> (SRV=0x1d727a0,IX=0) Server timed out. Some commands have failed (L:463)
DEBUG:couchbase.server:[0] <xx.xx.xx.35:11210> (SRV=0x1d727a0,IX=0) Scheduling next timeout for 60000 ms (L:467)
INFO:couchbase.confmon:[0] Not applying configuration received via CCCP. No changes detected. A.rev=6322, B.rev=6322 (L:166)
DEBUG:couchbase.confmon:[0] Current provider is CCCP (L:271)
INFO:couchbase.cccp:[0] Re-Issuing CCCP Command on server struct 0x1d727a0 (xx.xx.xx.35:11210) (L:110)
2016-01-26 22:39:01,141 (19228) [0:01:02.422383] - INFO - New file has been detected for ETL job 'apphits'. New file: raw-2016-01-26T223726.9169389Z.json.gz.enc - [module:functions, line:66, functionname:process_file]
INFO:Kafka Ingest App:New file has been detected for ETL job 'apphits'. New file: raw-2016-01-26T223726.9169389Z.json.gz.enc
DEBUG:couchbase.server:[1] <xx.xx.xx.35:11210> (SRV=0x1d88620,IX=0) Scheduling next timeout for 59999 ms (L:467)
DEBUG:couchbase.server:[1] <xx.xx.xx.61:11210> (SRV=0x1daa680,IX=1) Scheduling next timeout for 60000 ms (L:467)
2016-01-26 22:39:01,143 (19228) [0:00:00.001828] - INFO - The file's modified date wasn't newer than the one already in the 'ingest-app-apphits-last_file' document in Couchbase, so we didn't update this document with the new modified date. Old modified date '2016-01-26 22:38:56; new modified date: '2016-01-26 22:38:56' - [module:functions, line:120, functionname:process_file]
INFO:Kafka Ingest App:The file's modified date wasn't newer than the one already in the 'ingest-app-apphits-last_file' document in Couchbase, so we didn't update this document with the new modified date. Old modified date '2016-01-26 22:38:56; new modified date: '2016-01-26 22:38:56'
ERROR:couchbase.cccp:[0] <NOHOST:NOPORT> Got I/O Error=0x17 (L:133)
INFO:couchbase.cccp:[0] Re-Issuing CCCP Command on server struct 0x1d73590 (xx.xx.xx.61:11210) (L:110)
INFO:couchbase.confmon:[0] Not applying configuration received via CCCP. No changes detected. A.rev=6322, B.rev=6321 (L:166)
DEBUG:couchbase.server:[0] <xx.xx.xx.61:11210> (SRV=0x1d73590,IX=1) Scheduling next timeout for 60000 ms (L:467)
DEBUG:couchbase.server:[0] <xx.xx.xx.35:11210> (SRV=0x1d727a0,IX=0) Scheduling next timeout for 966 ms (L:467)
WARNING:couchbase.server:[0] <xx.xx.xx.35:11210> (SRV=0x1d727a0,IX=0) Failing command (pkt=0x1d6bd60, opaque=92452, opcode=0x0) with error 0x17 (L:380)
DEBUG:couchbase.confmon:[0] Start refresh requested (L:292)
ERROR:couchbase.server:[0] <xx.xx.xx.35:11210> (SRV=0x1d727a0,IX=0) Server timed out. Some commands have failed (L:463)
DEBUG:couchbase.server:[0] <xx.xx.xx.35:11210> (SRV=0x1d727a0,IX=0) Scheduling next timeout for 20 ms (L:467)
DEBUG:couchbase.server:[1] <xx.xx.xx.35:11210> (SRV=0x1d88620,IX=0) Finalizing ctx 0x2b14b8027530 (L:738)
DEBUG:couchbase.ioctx:[1] <xx.xx.xx.35:11210> (CTX=0x2b14b8027530,memcached) Destroying. PND=0,ENT=0,SORC=1 (L:151)
DEBUG:couchbase.server:[1] <xx.xx.xx.61:11210> (SRV=0x1daa680,IX=1) Finalizing ctx 0x2b14b8014b40 (L:738)
DEBUG:couchbase.ioctx:[1] <xx.xx.xx.61:11210> (CTX=0x2b14b8014b40,memcached) Destroying. PND=0,ENT=0,SORC=1 (L:151)
DEBUG:couchbase.server:[2] <xx.xx.xx.35:11210> (SRV=0x1d860a0,IX=0) Finalizing ctx 0x2b14c4014370 (L:738)
DEBUG:couchbase.ioctx:[2] <xx.xx.xx.35:11210> (CTX=0x2b14c4014370,memcached) Destroying. PND=0,ENT=0,SORC=1 (L:151)
DEBUG:couchbase.server:[2] <xx.xx.xx.61:11210> (SRV=0x1d871e0,IX=1) Finalizing ctx 0x2b14c401e610 (L:738)
DEBUG:couchbase.ioctx:[2] <xx.xx.xx.61:11210> (CTX=0x2b14c401e610,memcached) Destroying. PND=0,ENT=0,SORC=1 (L:151)
2016-01-26 22:40:00,258 (19228) [0:00:59.114823] - ERROR - Uncaught exception: <Key='ingest-app-job-2016-01-26 22:37:57.085263', RC=0x17[Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout], Operational Error, Results=1, C Source=(src/multiresult.c,309)>
                     Type: <class 'couchbase.exceptions._TimeoutError_0x17 (generated, catch TimeoutError)'>
                     Traceback: ['  File "ingest-app.py", line 256, in <module>\n', '  File "ingest-app.py", line 217, in main\n    main_app_cb_operation_timeout=main_app_cb_operation_timeout,\n', '  File "modules/functions.py", line 200, in check_process_jobs\n    job_doc = cb.get(job.docid)  # get the job\'s details from Couchbase\n', '  File "modules/db_couchbase_class.py", line 53, in get\n    time.sleep(1)\n', '  File "venv1/lib64/python3.4/site-packages/couchbase/bucket.py", line 489, in get\n    replica=replica, no_format=no_format)\n']
                     Traceback line number: 256 - [module:functions, line:398, functionname:my_handler]
ERROR:Kafka Ingest App:Uncaught exception: <Key='ingest-app-job-2016-01-26 22:37:57.085263', RC=0x17[Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout], Operational Error, Results=1, C Source=(src/multiresult.c,309)>
                     Type: <class 'couchbase.exceptions._TimeoutError_0x17 (generated, catch TimeoutError)'>
                     Traceback: ['  File "ingest-app.py", line 256, in <module>\n', '  File "ingest-app.py", line 217, in main\n    main_app_cb_operation_timeout=main_app_cb_operation_timeout,\n', '  File "modules/functions.py", line 200, in check_process_jobs\n    job_doc = cb.get(job.docid)  # get the job\'s details from Couchbase\n', '  File "modules/db_couchbase_class.py", line 53, in get\n    time.sleep(1)\n', '  File "venv1/lib64/python3.4/site-packages/couchbase/bucket.py", line 489, in get\n    replica=replica, no_format=no_format)\n']
                     Traceback line number: 256
DEBUG:couchbase.server:[0] <xx.xx.xx.35:11210> (SRV=0x1d727a0,IX=0) Finalizing ctx 0x1d6d850 (L:738)
DEBUG:couchbase.ioctx:[0] <xx.xx.xx.35:11210> (CTX=0x1d6d850,memcached) Destroying. PND=0,ENT=0,SORC=1 (L:151)
DEBUG:couchbase.server:[0] <xx.xx.xx.61:11210> (SRV=0x1d73590,IX=1) Finalizing ctx 0x1d6b650 (L:738)
DEBUG:couchbase.ioctx:[0] <xx.xx.xx.61:11210> (CTX=0x1d6b650,memcached) Destroying. PND=0,ENT=0,SORC=1 (L:151)

Any idea what the below would be, which preceded the timeout:

ERROR:couchbase.cccp:[0] <NOHOST:NOPORT> Got I/O Error=0x17 (L:133)
...
ERROR:couchbase.server:[0] <xx.xx.xx.35:11210> (SRV=0x1d727a0,IX=0) Server timed out. Some commands have failed (L:463)

#8

The 957ms timeout is not in itself an error, since the timer tends to sometimes get scheduled a bit earlier than the actual timeout; in such cases it gets re-triggered for the remaining time left to the actual timeout (meaning the timeout was fired at +59s).

The following

DEBUG:couchbase.server:[0] <xx.xx.xx.35:11210> (SRV=0x25cd320,IX=0) Finalizing ctx 0x24f7360 (L:738)

suggests there is something actually wrong with the socket (this is a message of a socket being closed). A timeout in itself would not cause the socket to be closed… unless the exception is unhandled and the finalization is a result of your app crashing and the actual Bucket being destroyed.

It seems you have multiple Bucket objects. Is this a threaded application, or are you forking?


#9

Yes, actually this is a multiprocess app, a scale out data processing app using multiple CPUs on multiple servers.
That’s why you might be seeing logs from multiple Bucket objects in the log.

There are some places where the upsert failing has to crash the app, because otherwise it would be in an incomplete stage.
In other places, like that get(), I actually handle the timeout and try again the second time, but if it fails the second time too, then it throws the error and stops the app.
I have now multiple cases when it fails the second time too.
What error handling would you suggest? Just issuing the get again seems to be not helping. Doing a reconnect to Couchbase?


#10

You can try to reconnect… and I’m interested in seeing the full logs (you can send me an email).

Ensure you are creating the Bucket only after you fork!

The interesting thing I realize is perhaps that of all buckets closing their connections within the proximity of you getting those timeout errors. They might be related?


#11

I have added 1-time reconnect after a failed get(), but not for the upsert(), so if going forward I will be only seeing this error for the upsert() and not the get() anymore, then I know that that has solved it.
I will email you a longer set of the logs, when I have a newer outage, the old ones already got overwritten.


#12

btw I don’t think the timeouts would be related to the types of operations performed…


#13

I don’t think either that there would be a difference by the type of operations.
I have applied the “reconnect-fix” only to the get() and not to the upsert(), so I will know if it made a difference, as then I have a comparison.
Without it there are periods of weeks when the issue doesn’t come up, so I would be left wondering whether the fix has worked or it is just one of those periods when the issue doesn’t show.


#14

Hi zfedor, I am having exactly the same issue, do you get a solution?