Error while consuming KeepAliveResponse

snmueller · November 16, 2017, 11:55pm

Hi,
I am occasionally seeing the following error when using the Java SDK:

2017-11-16 20:29:05,635 [RxComputationScheduler-1] WARN AbstractGenericHandler - [KeyValueEndpoint]: Got error while consuming KeepAliveResponse.
java.util.concurrent.TimeoutException
at rx.internal.operators.OperatorTimeoutBase$TimeoutSubscriber.onTimeout(OperatorTimeoutBase.java:177)
at rx.internal.operators.OperatorTimeout$1$1.call(OperatorTimeout.java:41)
at rx.internal.schedulers.EventLoopsScheduler$EventLoopWorker$2.call(EventLoopsScheduler.java:189)
at rx.internal.schedulers.ScheduledAction.run(ScheduledAction.java:55)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Does anyone know what causes this? I am using Java SDK 2.47 and Couchbase Server 4.2

Thanks,
Stephen

synesty · September 20, 2018, 11:31am

We have this too sporadically:
Couchbase 5.0.1
Java SDK 2.5.2

2018-09-20 13:03:58,736 WARN  - [/IPOfCBServer:8093][QueryEndpoint]: Got error while consuming KeepAliveResponse. - [RxComputationScheduler-2] c.c.c.c.l.JdkLogger
java.util.concurrent.TimeoutException: null
        at rx.internal.operators.OnSubscribeTimeoutTimedWithFallback$TimeoutMainSubscriber.onTimeout(OnSubscribeTimeoutTimedWithFallback.java:166) [rxjava-1.3.3.jar:1.3.3]
        at rx.internal.operators.OnSubscribeTimeoutTimedWithFallback$TimeoutMainSubscriber$TimeoutTask.call(OnSubscribeTimeoutTimedWithFallback.java:191) [rxjava-1.3.3.jar:1.3.3]
        at rx.internal.schedulers.EventLoopsScheduler$EventLoopWorker$2.call(EventLoopsScheduler.java:189) [rxjava-1.3.3.jar:1.3.3]
        at rx.internal.schedulers.ScheduledAction.run(ScheduledAction.java:55) [rxjava-1.3.3.jar:1.3.3]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_131]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_131]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_131]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_131]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_131]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_131]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131]
Sep 20, 2018 1:03:58 PM com.couchbase.client.core.endpoint.AbstractGenericHandler$KeepAliveResponseAction onError
WARNING: [/IPOfCBServer:11210][KeyValueEndpoint]: Got error while consuming KeepAliveResponse.
java.util.concurrent.TimeoutException

hard disks have enough space left
CPU load seems reasonble (no process / thread seems stuck at 100% )

Any hints where we could look into it?
Can we increase some timeouts dynamically at runtime? e.g. .keepAliveTimeout() ?

ravikrn.13 · September 26, 2018, 7:12am

We too have this.
We are using Couchbase 5.1.1
Java SDK 2.6.0

Please help !!

daschl · September 26, 2018, 7:25am

@synesty you can try to disable continuous keepalives on the environment and see if it makes a difference. the environment also has options to tweak those timeouts and settings IIRC.

It seems to be happening on 11210 (kv), do you have lots of traffic on this port? If it just happens that the keepalives are stuck behind kv ops, disabling the continuous keepalive might help.

@ravikrn.13 usually this is an indication of a slow network or something else going on - it is basically a safety net. See the answer above also for tips and pointers.

ingenthr · September 26, 2018, 6:40pm

MB-10291 could be involved here as well.

@ravikrn.13: you may want to see if it is consistently the same host, or if it’s multiple hosts. Since you’re on client 2.6, the Response Time Observability feature should give you more information in the logs. See the docs and the blogs for more info.

Also, a general recommendation, update to SDK 2.5.7 or later, as there is an issue where name resolution problems can block these operations. That’ll only be in play if you have name resolution issues in your environment, but Kubernetes and GKE are known to have some.

ravikrn.13 · October 5, 2018, 7:02am

This is happening in only one of our hosts (client side).

We have Operations over threshold warning with below output.

[  
   {  
      "top":[  
         {  
            "operation_name":"upsert",
            "last_local_id":"1C41D2DCB9C21B23/FFFFFFFFF5D0FC14",
            "encode_us":25,
            "last_local_address":"10.64.107.123:42300",
            "last_remote_address":"10.64.106.185:11210",
            "last_dispatch_us":176303,
            "last_operation_id":"0x378b86",
            "total_us":2607870
         },
         {  
            "operation_name":"upsert",
            "last_local_id":"1C41D2DCB9C21B23/FFFFFFFFF5D0FC14",
            "encode_us":29,
            "last_local_address":"10.64.107.123:42300",
            "last_remote_address":"10.64.106.185:11210",
            "last_dispatch_us":176517,
            "last_operation_id":"0x378b7e",
            "total_us":2613914
         },
         {  
            "operation_name":"upsert",
            "last_local_id":"1C41D2DCB9C21B23/FFFFFFFFF5D0FC14",
            "encode_us":27,
            "last_local_address":"10.64.107.123:42300",
            "last_remote_address":"10.64.106.185:11210",
            "last_dispatch_us":176531,
            "last_operation_id":"0x378b75",
            "total_us":2614746
         },
         {  
            "operation_name":"upsert",
            "last_local_id":"1C41D2DCB9C21B23/FFFFFFFFFAC3DEF5",
            "encode_us":3460,
            "last_local_address":"10.64.107.123:55212",
            "last_remote_address":"10.64.106.186:11210",
            "last_dispatch_us":140,
            "last_operation_id":"0x378af5",
            "total_us":2736595
         },
         {  
            "operation_name":"upsert",
            "last_local_id":"1C41D2DCB9C21B23/FFFFFFFFC538D2C9",
            "encode_us":19791,
            "last_local_address":"10.64.107.123:42340",
            "last_remote_address":"10.64.106.185:11210",
            "last_dispatch_us":64389,
            "last_operation_id":"0x378aed",
            "total_us":2868818
         },
         {  
            "operation_name":"upsert",
            "last_local_id":"1C41D2DCB9C21B23/FFFFFFFFC538D2C9",
            "encode_us":2304,
            "last_local_address":"10.64.107.123:42340",
            "last_remote_address":"10.64.106.185:11210",
            "last_dispatch_us":24448,
            "last_operation_id":"0x378aef",
            "total_us":2905684
         },
         {  
            "operation_name":"upsert",
            "last_local_id":"1C41D2DCB9C21B23/FFFFFFFFFAC3DEF5",
            "encode_us":507,
            "last_local_address":"10.64.107.123:55212",
            "last_remote_address":"10.64.106.186:11210",
            "last_dispatch_us":1164,
            "last_operation_id":"0x378ade",
            "total_us":3262958
         },
         {  
            "operation_name":"upsert",
            "last_local_id":"1C41D2DCB9C21B23/00000000069C4B27",
            "encode_us":5051911,
            "last_local_address":"10.64.107.123:59646",
            "last_remote_address":"10.64.106.184:11210",
            "last_dispatch_us":22756,
            "last_operation_id":"0x378ae6",
            "total_us":7267507
         },
         {  
            "operation_name":"upsert",
            "last_local_id":"1C41D2DCB9C21B23/00000000069C4B27",
            "encode_us":5105711,
            "last_local_address":"10.64.107.123:59646",
            "last_remote_address":"10.64.106.184:11210",
            "last_dispatch_us":95046,
            "last_operation_id":"0x378ae7",
            "total_us":7335759
         },
         {  
            "operation_name":"upsert",
            "last_local_id":"1C41D2DCB9C21B23/FFFFFFFFFAC3DEF5",
            "encode_us":473,
            "last_local_address":"10.64.107.123:55212",
            "last_remote_address":"10.64.106.186:11210",
            "last_dispatch_us":1350724,
            "last_operation_id":"0x378ae5",
            "total_us":10531245
         }
      ],
      "service":"kv",
      "count":770
   }
]

Appreciate any help.

daschl · October 5, 2018, 7:32am

@ravikrn.13 according to the debug info, you have upsert requests taking 7 or even 10 seconds to complete. that is a long time I think.

Now if you look at the upsert that has a total time of 10s, it seems to be spending roughly 1.3s on the network and in the IO layer, so there is a significant gap here. Have you experienced longer stalls in your app and/or are you performing GC logging to make sure there are no stop the world pauses in your app?

ravikrn.13 · October 5, 2018, 10:01am

@daschl : Thanks for the reply.

We are running services in 4 instances of same application and the issue is seen in only 1 of the instance. To rule out N/W issues, all the other 3 applications are fine during the same period .
We are taking GC logs at regular intervals (30 mins).

We are seeing spike in Couchbase objects.

209072        6690304  com.couchbase.client.core.env.CoreScheduler$EventLoopWorker
209070      113734080  rx.internal.util.unsafe.SpscArrayQueue

Number of instances is 209072
Number of bytes is 6690304

Is this a cause of concern?

ingenthr · October 6, 2018, 12:33am

I’ll defer to @subhashni and @daschl who know this area better, but it looks to me like it might be possible that your 4th instance has some slightly different code or environment that is somehow creating/leaking Bucket instances and thus the resources managed underneath? A quick heap dump will probably confirm.

ravikrn.13 · October 8, 2018, 5:03am

As suggested we took heap dump and we have a leak suspect:

One instance of "rx.subjects.PublishSubject$PublishSubjectState" loaded by "com.tfsc.ilabs.core.cache" occupies 532,077,400 (53.29%) bytes. The memory is accumulated in one instance of "rx.subjects.PublishSubject$PublishSubjectProducer[]" loaded by "com.tfsc.core.cache".

Keywords
rx.subjects.PublishSubject$PublishSubjectProducer[]
rx.subjects.PublishSubject$PublishSubjectState

daschl · October 8, 2018, 5:06am

@ravikrn.13 I think at this point we need to start to look at actual code. What code is different for this one application server? The keywords shared are generic to rx and don’t help with pinpointing what is going on.

One thing to double check is that you use the Bucket instance as a singleton. on startup create and only destroy on shutdown. Is that in place?

The PublishSubject is only internally used for config management and there should only be one around per Bucket instance so this in theory also points to the fact that these instances are created a lot (maybe on a request basis?)

ravikrn.13 · October 8, 2018, 6:20am

Thanks @daschl.

Found out the root cause. We are using eventBus in an non-singleton way. Overtime created way too many objects which led to the high heap usage of the application causing the operations to slow down including keepalive operations.

daschl · October 8, 2018, 6:47am

@ravikrn.13 glad to hear you found it!