Couchbase server randomly disconnects KV endpoints/KeepAlive not working

java

#1

Not long ago we started to use the 2.x Java SDK over the 1.x series. Since, we got a lot of TimeoutExceptions (as many others have also reported) on simple KV operations. We’ve been able to work around these using a retry algorithm.

However, I’ve also noticed that Couchbase Server (4.1.0-5005 CE) is also closing connections though we have a keep-alive of 60 seconds.

2016-08-17 23:25:10,280 INFO cb-io-1-2 com.couchbase.client.core.endpoint.AbstractGenericHandler - [/127.0.0.1:11210][KeyValueEndpoint]: Connection reset by peer: An existing connection was forcibly closed by the remote host
2016-08-17 23:25:10,280 INFO cb-io-1-1 com.couchbase.client.core.endpoint.AbstractGenericHandler - [/127.0.0.1:11210][KeyValueEndpoint]: Connection reset by peer: An existing connection was forcibly closed by the remote host
2016-08-17 23:25:10,280 INFO cb-io-1-4 com.couchbase.client.core.endpoint.AbstractGenericHandler - [/127.0.0.1:11210][KeyValueEndpoint]: Connection reset by peer: An existing connection was forcibly closed by the remote host
2016-08-17 23:25:10,291 INFO cb-io-1-4 com.couchbase.client.core.endpoint.AbstractGenericHandler - [/127.0.0.1:11210][KeyValueEndpoint]: Connection reset by peer: An existing connection was forcibly closed by the remote host

When that happens, after that, none of the KV endpoints are connected and any new upsert request will fail with a RequestCancelledException. Again, we had to catch and build a retry algorithm to work around this.

I understand that there’s a big redesign of the Java SDK between 1.4.x and the 2.x series to use RxJava, but how should I attack these TimeoutExceptions and connections resets? And how can I check if the keep alive is actually working or not?


#2

also, sometimes, I get timeout exceptions on cluster.openBucket(…) as it disconnects immediately:

2016-08-18 11:52:11,109 [cb-io-1-2] (Slf4JLogger.java) INFO  - Connected to Node localhost
2016-08-18 11:52:11,110 [cb-io-1-2] (Slf4JLogger.java) INFO  - Disconnected from Node localhost

again, I’m just retrying this after a timeout, but it’s weird.


#3

It seems there is probably something more fundamental going on. Does your server show any processes crashing? That’d explain the occasional timeouts and connection drops.

I can see it’s localhost in the logs above. Just a single node and just localhost?

Poke around the logs (documentation lists where they are per platform) and see if there’s anything unusual there. Maybe @daschl has another idea? Unfortunately, a timeout and a disconnect just don’t give us much data at the client end. It’s more indicative of another problem.


#4

in the example above I’m working on a small tool to read documents from couchbase and index them in elasticsearch for an upcoming migration away from using XDCR with the couchbase-elasticsearch plugin.

The tool basically connects to a single node of a couchbase cluster and a single node of an elasticsearch cluster, so yeah, locally I test it with localhost for both.

if I run it, every 1 in 3 tries or so connects and immediately disconnects but idles until there’s a TimeoutException.

in the server I do see these:

Control connection to memcached on 'ns_1@127.0.0.1' disconnected: {error,
closed}	ns_memcached000	ns_1@127.0.0.1	14:16:20 - Thu Aug 18, 2016
Service 'memcached' exited with status -1073740940. Restarting. Messages: 2016-08-18T14:01:29.649279+10:00 WARNING 2044 Closing connection [127.0.0.1:62352 - 127.0.0.1:11210] due to read error: An existing connection was forcibly closed by the remote host. 
2016-08-18T14:12:09.684143+10:00 WARNING 1992 Closing connection [127.0.0.1:63781 - 127.0.0.1:11210] due to read error: An existing connection was forcibly closed by the remote host. 
2016-08-18T14:12:09.684143+10:00 WARNING 2016 Closing connection [127.0.0.1:63783 - 127.0.0.1:11210] due to read error: An existing connection was forcibly closed by the remote host. 
2016-08-18T14:12:09.684143+10:00 WARNING 1980 Closing connection [127.0.0.1:63782 - 127.0.0.1:11210] due to read error: An existing connection was forcibly closed by the remote host. 
2016-08-18T14:12:09.684143+10:00 WARNING 1984 Closing connection [127.0.0.1:63784 - 127.0.0.1:11210] due to read error: An existing connection was forcibly closed by the remote host.	ns_log000	ns_1@127.0.0.1	14:16:20 - Thu Aug 18, 2016

some conflict between couchbase buckets and memcached maybe?


#5

maybe more useful information from the trace log from when this quick disconnect happens:

2016-08-18 14:34:22,203 [cb-io-1-1] (Slf4JLogger.java) INFO  - Connected to Node localhost
2016-08-18 14:34:22,203 [cb-io-1-1] (Slf4JLogger.java) DEBUG - Connected (DISCONNECTED) to Node localhost/127.0.0.1
2016-08-18 14:34:22,203 [cb-io-1-1] (Slf4JLogger.java) TRACE - State (CouchbaseNode) DISCONNECTED -> CONNECTED
2016-08-18 14:34:22,203 [cb-io-1-1] (Slf4JLogger.java) TRACE - State (KeyValueService) CONNECTED -> DEGRADED
2016-08-18 14:34:22,204 [cb-io-1-1] (Slf4JLogger.java) TRACE - State (ServiceStateZipper) CONNECTED -> DISCONNECTED
2016-08-18 14:34:22,204 [cb-io-1-1] (Slf4JLogger.java) INFO  - Disconnected from Node localhost
2016-08-18 14:34:22,204 [cb-io-1-1] (Slf4JLogger.java) DEBUG - Disconnected (CONNECTED) from Node localhost/127.0.0.1
2016-08-18 14:34:22,204 [cb-io-1-1] (Slf4JLogger.java) TRACE - State (CouchbaseNode) CONNECTED -> DISCONNECTED
2016-08-18 14:34:22,205 [cb-io-1-1] (Slf4JLogger.java) DEBUG - [localhost/127.0.0.1:11210][KeyValueEndpoint]: Channel Active.
2016-08-18 14:34:22,205 [cb-computations-4] (Slf4JLogger.java) DEBUG - Successfully enabled Service BINARY on Node localhost/127.0.0.1
2016-08-18 14:34:22,205 [cb-computations-4] (Slf4JLogger.java) DEBUG - Starting to discover config through Carrier Bootstrap

and after that you just get these:

2016-08-18 14:34:22,206 [cb-core-3-1] (Slf4JLogger.java) TRACE - Retrying GetBucketConfigRequest{observable=rx.subjects.AsyncSubject@76d95e68, bucket='genbook'} with a delay of 100 MICROSECONDS
2016-08-18 14:34:22,207 [cb-core-3-1] (Slf4JLogger.java) TRACE - Retrying GetBucketConfigRequest{observable=rx.subjects.AsyncSubject@76d95e68, bucket='genbook'} with a delay of 100 MICROSECONDS
2016-08-18 14:34:22,208 [cb-core-3-1] (Slf4JLogger.java) TRACE - Retrying GetBucketConfigRequest{observable=rx.subjects.AsyncSubject@76d95e68, bucket='genbook'} with a delay of 100 MICROSECONDS
2016-08-18 14:34:22,209 [cb-core-3-1] (Slf4JLogger.java) TRACE - Retrying GetBucketConfigRequest{observable=rx.subjects.AsyncSubject@76d95e68, bucket='genbook'} with a delay of 100 MICROSECONDS
2016-08-18 14:34:22,210 [cb-core-3-1] (Slf4JLogger.java) TRACE - Retrying GetBucketConfigRequest{observable=rx.subjects.AsyncSubject@76d95e68, bucket='genbook'} with a delay of 100 MICROSECONDS
2016-08-18 14:34:22,211 [cb-core-3-1] (Slf4JLogger.java) TRACE - Retrying GetBucketConfigRequest{observable=rx.subjects.AsyncSubject@76d95e68, bucket='genbook'} with a delay of 100 MICROSECONDS
2016-08-18 14:34:22,212 [cb-core-3-1] (Slf4JLogger.java) TRACE - Retrying GetBucketConfigRequest{observable=rx.subjects.AsyncSubject@76d95e68, bucket='genbook'} with a delay of 100 MICROSECONDS
2016-08-18 14:34:22,213 [cb-core-3-1] (Slf4JLogger.java) TRACE - Retrying GetBucketConfigRequest{observable=rx.subjects.AsyncSubject@76d95e68, bucket='genbook'} with a delay of 100 MICROSECONDS
2016-08-18 14:34:22,214 [cb-core-3-1] (Slf4JLogger.java) TRACE - Retrying GetBucketConfigRequest{observable=rx.subjects.AsyncSubject@76d95e68, bucket='genbook'} with a delay of 128 MICROSECONDS
2016-08-18 14:34:22,215 [cb-core-3-1] (Slf4JLogger.java) TRACE - Retrying GetBucketConfigRequest{observable=rx.subjects.AsyncSubject@76d95e68, bucket='genbook'} with a delay of 256 MICROSECONDS
2016-08-18 14:34:22,216 [cb-core-3-1] (Slf4JLogger.java) TRACE - Retrying GetBucketConfigRequest{observable=rx.subjects.AsyncSubject@76d95e68, bucket='genbook'} with a delay of 512 MICROSECONDS
2016-08-18 14:34:22,217 [cb-core-3-1] (Slf4JLogger.java) TRACE - Retrying GetBucketConfigRequest{observable=rx.subjects.AsyncSubject@76d95e68, bucket='genbook'} with a delay of 1024 MICROSECONDS
2016-08-18 14:34:22,219 [cb-core-3-1] (Slf4JLogger.java) TRACE - Retrying GetBucketConfigRequest{observable=rx.subjects.AsyncSubject@76d95e68, bucket='genbook'} with a delay of 2048 MICROSECONDS
2016-08-18 14:34:22,222 [cb-core-3-1] (Slf4JLogger.java) TRACE - Retrying GetBucketConfigRequest{observable=rx.subjects.AsyncSubject@76d95e68, bucket='genbook'} with a delay of 4096 MICROSECONDS
2016-08-18 14:34:22,227 [cb-core-3-1] (Slf4JLogger.java) TRACE - Retrying GetBucketConfigRequest{observable=rx.subjects.AsyncSubject@76d95e68, bucket='genbook'} with a delay of 8192 MICROSECONDS
2016-08-18 14:34:22,236 [cb-core-3-1] (Slf4JLogger.java) TRACE - Retrying GetBucketConfigRequest{observable=rx.subjects.AsyncSubject@76d95e68, bucket='genbook'} with a delay of 16384 MICROSECONDS
2016-08-18 14:34:22,253 [cb-core-3-1] (Slf4JLogger.java) TRACE - Retrying GetBucketConfigRequest{observable=rx.subjects.AsyncSubject@76d95e68, bucket='genbook'} with a delay of 32768 MICROSECONDS
2016-08-18 14:34:22,287 [cb-core-3-1] (Slf4JLogger.java) TRACE - Retrying GetBucketConfigRequest{observable=rx.subjects.AsyncSubject@76d95e68, bucket='genbook'} with a delay of 65536 MICROSECONDS
2016-08-18 14:34:22,354 [cb-core-3-1] (Slf4JLogger.java) TRACE - Retrying GetBucketConfigRequest{observable=rx.subjects.AsyncSubject@76d95e68, bucket='genbook'} with a delay of 100000 MICROSECONDS
2016-08-18 14:34:22,461 [cb-core-3-1] (Slf4JLogger.java) TRACE - Retrying GetBucketConfigRequest{observable=rx.subjects.AsyncSubject@76d95e68, bucket='genbook'} with a delay of 100000 MICROSECONDS

until it times out:

Exception in thread "main" java.lang.RuntimeException: java.util.concurrent.TimeoutException
        at com.couchbase.client.java.util.Blocking.blockForSingle(Blocking.java:71)
        at com.couchbase.client.java.CouchbaseCluster.openBucket(CouchbaseCluster.java:290)
        at com.couchbase.client.java.CouchbaseCluster.openBucket(CouchbaseCluster.java:265)
        at com.genbook.utilities.actions.IndexCouchbaseAction.connectCouchbase(IndexCouchbaseAction.java:331)
        at com.genbook.utilities.actions.IndexCouchbaseAction.isRunnable(IndexCouchbaseAction.java:209)
        at com.genbook.utilities.UtilityBase.main(UtilityBase.java:147)
Caused by: java.util.concurrent.TimeoutException
        ... 6 more

Now, I’ve seen many posts already about TimeoutExceptions and that Blocking.blockForSingle, it seems to be called from a lot of different places. For example, this might be the most extensive one on that subject: Java Client 2.3.1 : N1QL queries throwing TimeoutException [upgraded to Couchbase 4.5]


#6

on the upside: the trace logging does show keepalive requests now:

2016-08-18 14:34:52,204 [cb-io-1-2] (Slf4JLogger.java) DEBUG - [localhost/127.0.0.1:11210][KeyValueEndpoint]: KeepAlive fired

and then

2016-08-18 14:34:52,212 [cb-io-1-2] (Slf4JLogger.java) TRACE - [localhost/127.0.0.1:11210][KeyValueEndpoint]: Started decoding of KeepAliveRequest{observable=rx.subjects.AsyncSubject@6b11ea46, bucket='null'}

not sure why it thinks it’s for bucket ‘null’ though, it should have been connected to a bucket named ‘genbook’


#7

keepAlive trace shows null for bucket, but it’s just because the keep alive message extends a message class that expect to have a bucket name, but doesn’t provide one in the constructor. very minor omission though.

about Blocking.blockForSingle and TimeoutException, once again I need to reiterate that timeouts are a layer of protection that covers a wide range of operations (basically all of them when called from the sync API).

so a TimeoutException can be a manifestation of a myriad of underlying problems. Anything that causes a slowdown in an operation execution (client side or server side) can lead to such an exception, so further investigation in finding the root issue and root cause is always needed! Never assume that two issues are the same because they both lead to a timeout…

on your issue, it seems like both side report that the other side forcibly closed the socket connection… (unless I’m badly interpreting the server side log). that is weird…

have you found anything else indicative of server-side problems in the logs? like matt said, you should look for process crashing, etc…


#8

well, if I’m not using the server (it’s a local dev 4.1 on my localhost) it happens every 20 minutes, so one entry from this morning in the couchbase server log:

Bucket "microservice" loaded on node 'ns_1@127.0.0.1' in 0 seconds.	ns_memcached000	ns_1@127.0.0.1	09:04:56 - Fri Aug 19, 2016
Bucket "genbook" loaded on node 'ns_1@127.0.0.1' in 0 seconds.	ns_memcached000	ns_1@127.0.0.1	09:04:56 - Fri Aug 19, 2016
Bucket "default" loaded on node 'ns_1@127.0.0.1' in 0 seconds.	ns_memcached000	ns_1@127.0.0.1	09:04:56 - Fri Aug 19, 2016
Control connection to memcached on 'ns_1@127.0.0.1' disconnected: {error,
closed}	ns_memcached000	ns_1@127.0.0.1	09:04:52 - Fri Aug 19, 2016
Service 'memcached' exited with status -1073740940. Restarting. Messages: 2016-08-19T08:45:03.501053+10:00 WARNING (genbook) DCP (Producer) eq_dcpq:secidx:proj-genbook-MAINT_STREAM_TOPIC_bb:33:9:28:af:93:5f:c2-13900620800435568061 - (vb 1019) stream created with start seqno 0 and end seqno 18446744073709551615
2016-08-19T08:45:03.502054+10:00 WARNING (genbook) DCP (Producer) eq_dcpq:secidx:proj-genbook-MAINT_STREAM_TOPIC_bb:33:9:28:af:93:5f:c2-13900620800435568061 - (vb 1020) stream created with start seqno 4 and end seqno 18446744073709551615
2016-08-19T08:45:03.503054+10:00 WARNING (genbook) DCP (Producer) eq_dcpq:secidx:proj-genbook-MAINT_STREAM_TOPIC_bb:33:9:28:af:93:5f:c2-13900620800435568061 - (vb 1021) stream created with start seqno 8 and end seqno 18446744073709551615
2016-08-19T08:45:03.504055+10:00 WARNING (genbook) DCP (Producer) eq_dcpq:secidx:proj-genbook-MAINT_STREAM_TOPIC_bb:33:9:28:af:93:5f:c2-13900620800435568061 - (vb 1022) stream created with start seqno 17 and end seqno 18446744073709551615
2016-08-19T08:45:03.505056+10:00 WARNING (genbook) DCP (Producer) eq_dcpq:secidx:proj-genbook-MAINT_STREAM_TOPIC_bb:33:9:28:af:93:5f:c2-13900620800435568061 - (vb 1023) stream created with start seqno 14 and end seqno 18446744073709551615	ns_log000	ns_1@127.0.0.1	09:04:51 - Fri Aug 19, 2016

is probably related to this in the error log:

[stats:error,2016-08-19T09:04:52.119+10:00,ns_1@127.0.0.1:<0.473.0>:base_stats_collector:handle_info:109](Collector: stats_collector) Exception in stats collector: {exit,
                                                            {{error,closed},
                                                             {gen_server,
                                                              call,
                                                              ['ns_memcached-default',
                                                               {stats,<<>>},
                                                               180000]}},
                                                            [{gen_server,
                                                              call,3,
                                                              [{file,
                                                                "gen_server.erl"},
                                                               {line,188}]},
                                                             {ns_memcached,
                                                              do_call,3,
                                                              [{file,
                                                                "src/ns_memcached.erl"},
                                                               {line,1433}]},
                                                             {stats_collector,
                                                              grab_stats,1,
                                                              [{file,
                                                                "src/stats_collector.erl"},
                                                               {line,68}]},
                                                             {base_stats_collector,
                                                              handle_info,2,
                                                              [{file,
                                                                "src/base_stats_collector.erl"},
                                                               {line,89}]},
                                                             {gen_server,
                                                              handle_msg,5,
                                                              [{file,
                                                                "gen_server.erl"},
                                                               {line,604}]},
                                                             {proc_lib,
                                                              init_p_do_apply,
                                                              3,
                                                              [{file,
                                                                "proc_lib.erl"},
                                                               {line,239}]}]}

[stats:error,2016-08-19T09:04:52.119+10:00,ns_1@127.0.0.1:<0.448.0>:base_stats_collector:handle_info:109](Collector: stats_collector) Exception in stats collector: {exit,
                                                            {{error,closed},
                                                             {gen_server,
                                                              call,
                                                              ['ns_memcached-genbook',
                                                               {stats,<<>>},
                                                               180000]}},
                                                            [{gen_server,
                                                              call,3,
                                                              [{file,
                                                                "gen_server.erl"},
                                                               {line,188}]},
                                                             {ns_memcached,
                                                              do_call,3,
                                                              [{file,
                                                                "src/ns_memcached.erl"},
                                                               {line,1433}]},
                                                             {stats_collector,
                                                              grab_stats,1,
                                                              [{file,
                                                                "src/stats_collector.erl"},
                                                               {line,68}]},
                                                             {base_stats_collector,
                                                              handle_info,2,
                                                              [{file,
                                                                "src/base_stats_collector.erl"},
                                                               {line,89}]},
                                                             {gen_server,
                                                              handle_msg,5,
                                                              [{file,
                                                                "gen_server.erl"},
                                                               {line,604}]},
                                                             {proc_lib,
                                                              init_p_do_apply,
                                                              3,
                                                              [{file,
                                                                "proc_lib.erl"},
                                                               {line,239}]}]}

[stats:error,2016-08-19T09:04:52.135+10:00,ns_1@127.0.0.1:<0.412.0>:base_stats_collector:handle_info:109](Collector: stats_collector) Exception in stats collector: {exit,
                                                            {{error,closed},
                                                             {gen_server,
                                                              call,
                                                              ['ns_memcached-microservice',
                                                               {stats,<<>>},
                                                               180000]}},
                                                            [{gen_server,
                                                              call,3,
                                                              [{file,
                                                                "gen_server.erl"},
                                                               {line,188}]},
                                                             {ns_memcached,
                                                              do_call,3,
                                                              [{file,
                                                                "src/ns_memcached.erl"},
                                                               {line,1433}]},
                                                             {stats_collector,
                                                              grab_stats,1,
                                                              [{file,
                                                                "src/stats_collector.erl"},
                                                               {line,68}]},
                                                             {base_stats_collector,
                                                              handle_info,2,
                                                              [{file,
                                                                "src/base_stats_collector.erl"},
                                                               {line,89}]},
                                                             {gen_server,
                                                              handle_msg,5,
                                                              [{file,
                                                                "gen_server.erl"},
                                                               {line,604}]},
                                                             {proc_lib,
                                                              init_p_do_apply,
                                                              3,
                                                              [{file,
                                                                "proc_lib.erl"},
                                                               {line,239}]}]}

[ns_server:error,2016-08-19T09:04:56.150+10:00,ns_1@127.0.0.1:ns_doctor<0.221.0>:ns_doctor:update_status:308]The following buckets became not ready on node 'ns_1@127.0.0.1': ["default",
                                                                  "genbook",
                                                                  "microservice"], those of them are active ["default",
                                                                                                             "genbook",
                                                                                                             "microservice"]

But I have no idea how to interpret that or see what’s going wrong. I’m running a single node couchbase 4.1: Version: 4.1.0-5005 Community Edition (build-5005) on windows 10 (anniversary edition), version 1607 (Build 14393.51)


#9

Actually, both of those probably have the same root cause. It looks like the memcached process is either dropping connections or exiting.

I’d recommend doing a cbcollect_info (the documentation covers how) and posting it to an issue on the issue tracker. Or, if you have an enterprise subscription the support folks may be able to look at it more quickly.

I’d also mention, it might just be worth looking at the memcached process logs (docs have details) to see if you can spot a cause.


#10

yeah… looked at the memcached ones, they just restart every 20 minutes and then abruptly stop, pretty consistent with the other logging. but the reports.log has more

[error_logger:error,2016-08-19T09:04:51.353+10:00,ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]Supervisor received unexpected message: {ack,<0.3200.27>,
                                         {error,
                                          {shutdown,
                                           {memcached_port_server_down,
                                            <11435.263.0>,
                                            {abnormal,-1073740940}}}}}

so I guess I’ll do that cbcollect_info and post this on the issue tracker.


#11

Indeed. The port_server is what starts and monitors the memcache process. So if it goes down, it’ll restart it.


#12

hmm, seems I need an account to post there?


#13

Yes, there is a consolidation of logins the IT team is working on. Maybe @igor can assist? Let me include him in an email note.


Unable to deserialize JsonObject and getting TimeoutException when extreme load test on Couchbase?
#14

it’s been a while since the last post, not being able to provide those logs either (they are huge), but it’s still happening, every 20 minutes sharp the same exit -1073740940