Java client occasionally leaks (non-daemon) netty IO threads after shutdown

We’re using CouchbaseClient v1.4 from a Java server with moderately high concurrency.
When we shutdown the Couchbase client, it usually comes down cleanly, but we occasionally see lingering Netty threads:

"New I/O  worker #1" prio=10 tid=0x00000000014a3000 nid=0x58bd runnable [0x00007fc7c2f88000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
        - locked <0x000000059a590988> (a sun.nio.ch.Util$2)
        - locked <0x000000059a5909a0> (a java.util.Collections$UnmodifiableSet)
        - locked <0x000000059a5af320> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
        at org.jboss.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:52)
        at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:223)
        at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:35)
        at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:102)
        at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

"Memcached IO over {MemcachedConnection to /10.6.70.3:11210 /10.6.70.2:11210 /10.6.70.1:11210}" prio=10 tid=0x00007fc7dcd48800 nid=0x58b8 runnable [0x00007fc7c38f1000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
        - locked <0x000000059e331fe0> (a sun.nio.ch.Util$2)
        - locked <0x000000059e331ff8> (a java.util.Collections$UnmodifiableSet)
        - locked <0x000000059a513d90> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
        at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:222)
        at com.couchbase.client.CouchbaseMemcachedConnection.run(CouchbaseMemcachedConnection.java:158)

This prevents clean Java server shutdown, and requires an aggressive “kill -9” to clean up.
We figured out how to make the second thread start as ‘daemon’, but the “New I/O worker” thread can’t be converted to daemon without gutting the client libs pretty heavily.

Has anyone else seen this? Any suggestions for workarounds?

Patch to force Couchbase to start NIO threads as ‘daemons’ to permit clean server shutdown:

--- /tmp/BucketMonitor.java	2014-05-16 10:04:58.000000000 -0700
+++ src/main/java/com/couchbase/client/vbucket/BucketMonitor.java	2014-05-16 10:04:16.000000000 -0700
@@ -97,13 +97,28 @@
     this.configParser = configParser;
     this.host = cometStreamURI.getHost();
     this.port = cometStreamURI.getPort() == -1 ? 80 : cometStreamURI.getPort();
-    factory = new NioClientSocketChannelFactory(Executors.newCachedThreadPool(),
-      Executors.newCachedThreadPool());
+    factory = new NioClientSocketChannelFactory(newThreadPool(),
+      newThreadPool());
     this.headers = new HttpMessageHeaders();
       this.provider = provider;
   }
 
   /**
+   * Creates an executor based on a simple thread pool that only
+   * uses 'daemon' threads.
+   */
+  private java.util.concurrent.Executor newThreadPool() {
+    return Executors.newCachedThreadPool(
+      new java.util.concurrent.ThreadFactory() {
+        public Thread newThread(Runnable r) {
+          Thread t = new Thread(r);
+          t.setDaemon(true);
+          return t;
+        }
+      });
+  }
+
+  /**
    * Take any action required when the monitor appears to be disconnected.
    */
   protected void notifyDisconnected() {

@Dengberg, Do you have some example code that shows this behaviour? I tried to reproduce it but no joy so far.

I haven’t been able to reproduce this in isolated unit tests. We (Evernote) run around 500 Java7+Tomcat servers that receive a lot of activity over the course of a week, and see this problem come up in the wild around 20% of the time when we try to shut down. So it’s a bit hard to narrow down the exact conditions that cause the thread leakage.

Our current workaround:

— /tmp/BucketMonitor.java 2014-05-16 10:04:58.000000000 -0700
+++ src/main/java/com/couchbase/client/vbucket/BucketMonitor.java 2014-05-16 10:04:16.000000000 -0700
@@ -97,13 +97,28 @@
this.configParser = configParser;
this.host = cometStreamURI.getHost();
this.port = cometStreamURI.getPort() == -1 ? 80 : cometStreamURI.getPort();

  • factory = new NioClientSocketChannelFactory(Executors.newCachedThreadPool(),
  •  Executors.newCachedThreadPool());
    
  • factory = new NioClientSocketChannelFactory(newThreadPool(),

  •  newThreadPool());
    

    this.headers = new HttpMessageHeaders();
    this.provider = provider;
    }

    /**

    • Creates an executor based on a simple thread pool that only
    • uses ‘daemon’ threads.
  • */

  • private java.util.concurrent.Executor newThreadPool() {

  • return Executors.newCachedThreadPool(

  •  new java.util.concurrent.ThreadFactory() {
    
  •    public Thread newThread(Runnable r) {
    
  •      Thread t = new Thread(r);
    
  •      t.setDaemon(true);
    
  •      return t;
    
  •    }
    
  •  });
    
  • }

  • /**

    • Take any action required when the monitor appears to be disconnected.

I haven’t been able to reproduce this in isolated unit tests. We (Evernote) run around 500 Java7+Tomcat servers that receive a lot of activity over the course of a week, and see this problem come up in the wild around 20% of the time when we try to shut down. So it’s a bit hard to narrow down the exact conditions that cause the thread leakage.

Our current workaround:

— /tmp/BucketMonitor.java 2014-05-16 10:04:58.000000000 -0700
+++ src/main/java/com/couchbase/client/vbucket/BucketMonitor.java 2014-05-16 10:04:16.000000000 -0700
@@ -97,13 +97,28 @@
this.configParser = configParser;
this.host = cometStreamURI.getHost();
this.port = cometStreamURI.getPort() == -1 ? 80 : cometStreamURI.getPort();

  • factory = new NioClientSocketChannelFactory(Executors.newCachedThreadPool(),
  •  Executors.newCachedThreadPool());
    
  • factory = new NioClientSocketChannelFactory(newThreadPool(),

  •  newThreadPool());
    

    this.headers = new HttpMessageHeaders();
    this.provider = provider;
    }

    /**

    • Creates an executor based on a simple thread pool that only
    • uses ‘daemon’ threads.
  • */

  • private java.util.concurrent.Executor newThreadPool() {

  • return Executors.newCachedThreadPool(

  •  new java.util.concurrent.ThreadFactory() {
    
  •    public Thread newThread(Runnable r) {
    
  •      Thread t = new Thread(r);
    
  •      t.setDaemon(true);
    
  •      return t;
    
  •    }
    
  •  });
    
  • }

  • /**

    • Take any action required when the monitor appears to be disconnected.

Hi, there is a high chance that this is a bug. I’ll go investigate - can you open a ticket here in the meantime? http://www.couchbase.com/issues/browse/JCBC

Was there a bug associated with this issue ?
I have been experiencing similar issues with 1.4.4 where netty IO threads are in WAIT state. Its frustrating that we cannot reproduce this issue in LnP but in production during heavy traffic hours we are experiencing this issue. When this issue occurs the application server will no longer be taking any traffic and just restarting app server won’t help instead we had to reboot the VM to kill those dangling connections.
Here is the stack from the thread dump we took when this issue happened.

Thread Name
Couchbase View Thread for node cbnibslc02-289848/10.120.159.104:8092
State
Waiting on condition
Java Stack
at sun/nio/ch/EPollArrayWrapper.epollWait(Native Method)
at sun/nio/ch/EPollArrayWrapper.poll(EPollArrayWrapper.java:228(Compiled Code))
at sun/nio/ch/EPollSelectorImpl.doSelect(EPollSelectorImpl.java:81(Compiled Code))
at sun/nio/ch/SelectorImpl.lockAndDoSelect(SelectorImpl.java:87(Compiled Code))
at sun/nio/ch/SelectorImpl.select(SelectorImpl.java:98(Compiled Code))
at org/apache/http/impl/nio/reactor/AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:305)
at com/couchbase/client/http/AsyncConnectionManager.execute(AsyncConnectionManager.java:89)
at com/couchbase/client/ViewNode$1.run(ViewNode.java:89)
at java/lang/Thread.run(Thread.java:780)
Native Stack
(0x00007F0D06106052 [libj9prt26.so+0x13052])
(0x00007F0D061136CF [libj9prt26.so+0x206cf])
(0x00007F0D06105D9B [libj9prt26.so+0x12d9b])
(0x00007F0D06105E97 [libj9prt26.so+0x12e97])
(0x00007F0D061136CF [libj9prt26.so+0x206cf])
(0x00007F0D061059BB [libj9prt26.so+0x129bb])
(0x00007F0D060FF812 [libj9prt26.so+0xc812])
(0x00007F0D07252B40 [libpthread.so.0+0xfb40])
pthread_cond_wait+0xca (0x00007F0D0724EA9A [libpthread.so.0+0xba9a])
(0x00007F0D0634D0CF [libj9thr26.so+0x80cf])
(0x00007F0D064BA859 [libj9vm26.so+0x63859])
(0x00007F0D052DE6D9 [libj9jit26.so+0x5b96d9])

thanks