Client-Side Timeout exceeded


#1

environment:
OS: CentOS Linux release 7.5.1804
Couchbase Server: 4.5.1-2844 Community Edition
C-SDK: libcouchbase-devel.x86_64_2.9.3-1.el6, libcouchbase2-core.x86_64_2.9.3-1.el6
Python-SDK: couchbase 2.4.2
Cluster Node Number: 4

i saw my project log had many this error during my project was running:

after i changed my environment to :
C-SDK: libcouchbase-devel.x86_64_2.6.4-1.el7, libcouchbase2-core.x86_64_2.6.4-1.el7
Python-SDK: couchbase 2.2.1

the error above had no existence any more

so i think the problem is about compatibility between server and sdk at the version above. is it right?
if so, where can i find the document that a server version and it matched sdk version

thx~


#2

No, I wouldn’t say the cause is running a newer python lib or libcouchbase. The latest Couchbase SDK is compatible with older releases of the cluster.

From the listing there, it’s not really possible to say what the cause is. The one log line you supplied tells what node the timeout is from. Are they all from the same node? Is it intermittent, or consistent?

Also, did you run both versions on the same system and in the same environment? Perhaps something in the environment, like occasional network congestion on something like an EC2 availability zone, is causing the occasional timeout.

I hope that helps!


#3

Another thing you could do to get a clearer picture, if this happens fairly quickly, is to run your app with LCB_LOGLEVEL at 5 in your environment variables e.g.

export LCB_LOGLEVEL=5

And turn on logging with:

import couchcbase
couchbase.enable_logging()

And look at the resulting output for any libcouchbase diagnostics about what may be going wrong.

Thanks,

Ellis


#4

OK. I will try this, thx~


#5

yep, firstly I used the latest version and found the problem. Then I uninstalled SDK (libcouchbase and couchbase) and installed the old version, the problem had disappeared.


#6

Interesting. We’d surely like to known the cause as well if there’s something wrong. Let us know what you see from the more verbose logs as @ellis.breen suggested.


#7

I added export LCB_LOGLEVEL=5 in my environment, and turned on the logging.
but I can’t find any other logs in my output file when the error occurred.


#8

Ah, perhaps you don’t have logging in your Python application enabled?

Doing something like this may be necessary if so:
import couchbase
couchbase.enable_logging()
ch = logging.StreamHandler() #put your logging stream handle here if not logging to stdout
ch.setLevel(logging.WARNING)
logging.getLogger().addHandler(ch)

Hope that helps,

Ellis


#9

in previous test, there were some logs when i started my application :


but no log output when the error occurred.

i will try it again according to your suggestion


#10

Hmm… Looks like the bucket possibly failed to connect earlier - these Python exceptions look like responses to a get command, but we need to see what happens when the initial bucket connection occurs. Please send that part of the log.


#11

It may also be useful to know what syntax you are using to connect - are you using couchbase.cluster.Cluster.open_bucket or just the couchbase.bucket.Bucket constructor?


#12

this is newest log:

i connect codes:
from couchbase import Couchbase
xxxx_connect = Couchbase.connect( (bucket=‘xxxx’, host=my couchbse host list))


#13

Interesting - looks like negotiation failed while trying to connect to 10.23.191.132:11210 due to a timeout.

The syntax you provided looks like a wrapper for our SDKs. I have not seen this sort of syntax before.

It looks like you are using something called Pyrex to access our APIs, via a ‘data_context.c’ - so looks like the actual command is dispatched somewhere deep in here. Without access to that code, it’s hard for me to discern what is going on.

Could you try increasing the timeout on the bucket, and see if that makes a difference? If not, could you try temporarily disabling tracing using ?enable_tracing=false at the end of connection string and see if that makes a difference?

Thanks,

Ellis


#14

yes, data_context.pyx is compiled by Cython to data_context.c, it’s a base module in our project. it’s the role for providing some general functions and properties to read data from db and then transform to business data structure for business logic, and transform business data to JSON then write back to bucket which the data assigned .

In your suggestion, the timeout on the bucket is the arg of the connect function or each request (get or set ) function?
is it like:
Couchbase.connect(bucket=‘xxxx’, host=host list, timeout=10000)
Couchbase.connect(bucket=‘xxxx’, host=host list, enable_trace=False)


#15

Again, this is not a syntax I’m familiar with. It looks like it probably forwards to the Bucket constructor or Cluster contructor. What is happening in data_context.pyx?


#16

there is a class which is base class of business data structure in data_context. we pass the couchbase connect object as an argument to the class’s init function when instance a data object
simplified codes as follow:
class DataBase(object):

def __init__(self, cls, db, key, ttl, must_lock, ... and so on ...):
    """
    :param cls: real class of business data
    :param db: couchbase db object
    :param key: doc's key in db
    :param ttl: doc's ttl
    :param must_lock: if lock when load data
    """

    # assignment logic here

    pass

def save(self):
    """
    function for saving operation (delete or update)
    """

    # some logic

    if (if delete operation):
        if self.must_lock:
            self.unlock()

        self._db.delete(self.key)
        
    elif (if update operation):
        doc = self.get_doc_for_save()       # format data class object to JSON str
        self._db.set(self.key, doc, cas=self.cas, ttl=ttl)
        self.unlock()

    # some logic

def get_doc_for_save(self):
    """
    format data class object to JSON str
    """
    
    # transform logic here

    return doc


def load(self):
    """
    load data from db, then transform JSON str to business data class object
    """

    # some logic

    if self.must_lock:
        result = self._db.lock(self.key, ttl=self.get_lock_ttl())
    else:
        result = self._db.get(self.key)


    self.doc = self.cls.new_from_data(result.value)         # transform JSON str to business data class object

    # some logic

    
def unlock(self):
    """
    release the lock
    """
    
    self._db.unlock(self.key, self.cas)

#17

it’s just simple function calling the connect object of Couchbase 's APIs


#18

Sorry, I can’t see any reference to a couchbase.bucket.Bucket or couchbase.cluster.Cluster object. This just shows me get and lock operations.

Where is the Bucket/Cluster constructed?

I need to see the response when this happens, not when you are trying to do gets/other operations.

Thanks,

Ellis


#19

I think I have some misunderstands about your words, is this?
database.py

from common.config.config import conf as common_config
from couchbase import Couchbase

_db_connector = Couchbase.connect

_db_mail = None
_db_op = None

def db_mail():
global _db_mail

if not _db_mail:
    print 'DB:mail:Connecting:"', common_config.couchbase_host
    _db_mail = _db_connector(bucket=common_config.get_bucket_name('mail'), host=common_config.couchbase_host)

return _db_mail

def db_op():
global _db_op

if not _db_op:
    print 'DB:op:Connecting:"', common_config.couchbase_host
    _db_op = _db_connector(bucket=common_config.get_bucket_name('op'), host=common_config.couchbase_host)

return _db_op

and in data object file, import the connect function, like
from common.model.database import db_op

class TestData(BaseDo):

def __init__(self, context, key):
    super(BaseDo, self).__init__(context, key)


@classmethod
def cls(cls):
    return TestData

@classmethod
def get_db(cls):
    return db_op()      # this object will pass to the data_context above mentioned

#20

Ah, OK… It appears you are using the deprecated Couchbase.connect() method. You should be using the couchbase.bucket.Bucket() constructor instead, which is currently equivalent. Couchbase.connect() will most likely not appear in SDK3.

I need to see what happens at the time this connection occurs on the failing node (10.23.191.132:11210 ) - so please send all the logs I requested from around this time.

Out of interest, is this an SSL connection? Which protocol is being used?

Thanks,

Ellis