Intermittent memcached error on CB 3.0 Enterprise

I see this same error in the logs every few days in the couchbase console:

Control connection to memcached on 'ns_1@10.0.0.10' disconnected: {badmatch,
                                                                   {error,
                                                                    timeout}}

Then in the same second it will say:

Bucket "auth" loaded on node 'ns_1@10.0.0.10' in 0 seconds.

I looked into the error.log and saw these details for the first couple occurrences:

[ns_doctor:error,2015-03-22T8:21:30.829,ns_1@10.0.0.30:ns_log<0.11472.0>:ns_doctor:get_node:189]Error attempting to get node 'ns_1@10.0.0.10': {exit,
                                                {noproc,
                                                 {gen_server,call,
                                                  [ns_doctor,
                                                   {get_node,
                                                    'ns_1@10.0.0.10'}]}}}
[stats:error,2015-04-22T17:13:30.067,ns_1@10.0.0.30:<0.29087.6>:stats_collector:handle_info:124]Exception in stats collector: {exit,
                                  {{badmatch,{error,timeout}},
                                   {gen_server,call,
                                       ['ns_memcached-auth',
                                        {stats,<<>>},
                                        180000]}},
                                  [{gen_server,call,3,
                                       [{file,"gen_server.erl"},{line,188}]},
                                   {ns_memcached,do_call,3,
                                       [{file,"src/ns_memcached.erl"},
                                        {line,1399}]},
                                   {stats_collector,grab_all_stats,1,
                                       [{file,"src/stats_collector.erl"},
                                        {line,84}]},
                                   {stats_collector,handle_info,2,
                                       [{file,"src/stats_collector.erl"},
                                        {line,116}]},
                                   {gen_server,handle_msg,5,
                                       [{file,"gen_server.erl"},{line,604}]},
                                   {proc_lib,init_p_do_apply,3,
                                       [{file,"proc_lib.erl"},{line,239}]}]}

And these for subsequent occurrences:

[ns_server:error,2015-05-04T19:57:03.107,ns_1@10.0.0.30:ns_doctor<0.11518.0>:ns_doctor:update_status:229]The following buckets became not ready on node 'ns_1@10.0.0.10': ["auth"], those of them are active ["auth"]
[ns_server:error,2015-05-04T19:57:03.112,ns_1@10.0.0.30:ns_doctor<0.11518.0>:ns_doctor:update_status:229]The following buckets became not ready on node 'ns_1@10.0.0.30': ["auth"], those of them are active ["auth"]

We’re running Couchbase Server Enterprise 3.0.2 on Ubuntu, in a cluster with 2 nodes and sync gateway on each node. We have only noticed one possible error client side due to this, where a client using a mobile couchbase lite client wasn’t able to replicate all their documents when doing the replication around the time one of these issues occurred. We’re not sure if it’s related though, so is this a normal couchbase error that doesn’t affect anything that we can ignore, or do we have something configured wrong on the server?

Thanks in advance for any help anyone can give on this topic.

It looks like intermittent networking issues. I don’t see anything there regarding restarting processes, so I think it’s safe to say there’s no crash. Do you see any evidence of a crash in further details in the debug.log? That’s where I’d look next.

Since you’re using Enterprise, if you have a subscription you can also request that Couchbase look over your full collect_info output.

Thanks! We didn’t notice any crashes, it just seemed to coincide with users trying to connect and sync to couchbase. We’re in the Azure cloud so we should have a good network connection but microsoft has had issues with their cloud before. I’ll look into the debug log more and see if I can find anything.

Did you ever get to the bottom of this? I’m seeing this every so often as well and twice in one day just now; and a automatic failover is initiated after this.

On Google Cloud networks, wondering if there’s any special we have to tune like tcp keeaplives etc. for memcached.

On 3.0.1

We’re on Azure and we stopped looking into this because Couchbase support didn’t find anything bad in our logs. The default timeout on Azure VMs though is 4 or 5 minutes so I generally override that to 15 minutes to allow for heartbeats between our iOS clients and the server for continuous pull replication, though it doesn’t seem to work since sync gateway will just stop sending keepalives sometimes for no discernable reason.

K, thx for info.

Supposedly google has a socket timeout of 10min. Default debian tcp_keepalive_time is set to 7200s/2hrs, so definitely above the 10min. Going to set to 5min to see if any difference.

https://groups.google.com/forum/#!searchin/gce-discussion/compute$20network$20timeout|sort:relevance/gce-discussion/AxaHhT_Q2LY/dSw-rk5KDQAJ

Can someone from couchbase confirm that the inter-node communication sockets in either couchbase or memecached utilize the keepalive option when opening sockets? Otherwise, this configuration is useless…

just confirmed with netstat that keepalive is used on the memcached sockets, so we’ll see if tuning keepalive is helpful.

It’s probably worth highlighting that the 3.x Enterprise is End-Of-Life as of Feb 2017. I’d recommend moving to the most recent release - 4.6.1 EE (or 4.5.0 CE).