Java SDK memory leak with bad cluster


#1

Found memory leak in Java SDK 2.1.6. It happens only with specific conditions - when we got problems with a cluster.
Preconditions:

  1. Created cluster_1 with 3 nodes.
  2. 3rd node has a problem with size of HDD - it’s full.
  3. On client we from time to time can get error: com.couchbase.client.java.error.CouchbaseOutOfMemoryException, but there is no memory leaks.
  4. We have Couchbase client (JDK) which periodically performs reconnect do Couchbase cluster.

Conditions:
5. Starting XDCR replication from external Couchbase cluster_2 to our problem cluster_1.
6. On Couchbase cluster_1 we can see a messages like this:

[14:38:25] - Approaching full disk warning. Usage of disk “F:” on node “172.20.112.93” is around 91%.
[14:38:25] - Approaching full disk warning. Usage of disk “F:” on node “srv3” is around 100%.
[14:38:25] - Hard Out Of Memory Error. Bucket “ufm_2” on node srv3 is full. All memory allocated to this bucket is used for metadata.

  1. And now we can see that next classes have a memory leak:
    com.couchbase.client.deps.io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry
    com.couchbase.client.core.ResponseEvent
    com.couchbase.client.core.RequestEvent

Example of code to model such a problem:

package com.ufm.api;

import com.couchbase.client.java.Bucket;
import com.couchbase.client.java.CouchbaseCluster;
import com.couchbase.client.java.env.CouchbaseEnvironment;
import com.couchbase.client.java.env.DefaultCouchbaseEnvironment;
import org.junit.Test;

public class MemoryLeakTest {

@Test
public void testCBMemoryLeak() throws Exception {

    while (true) {
        ConnectionContainer connectionContainer = getConnectionContainer();
        Thread.sleep(200);
        closeConnection(connectionContainer);
        Thread.sleep(200);
    }

}

private void closeConnection(ConnectionContainer connectionContainer) {
    connectionContainer.getBucket().close();
    connectionContainer.getCouchbaseCluster().disconnect();
}

private ConnectionContainer getConnectionContainer() {
    ConnectionContainer connectionContainer = new ConnectionContainer();

    CouchbaseEnvironment couchbaseEnvironment = DefaultCouchbaseEnvironment.builder()
            .kvTimeout(5000L)
            .build();
    connectionContainer.setCouchbaseEnvironment(couchbaseEnvironment);

    CouchbaseCluster cluster = CouchbaseCluster.create(couchbaseEnvironment, "http://srv3:8091");
    connectionContainer.setCouchbaseCluster(cluster);

    Bucket bucket = cluster.openBucket("ufm_2", "1111");
    connectionContainer.setBucket(bucket);
    return connectionContainer;
}


private class ConnectionContainer {
    private CouchbaseEnvironment couchbaseEnvironment;
    private CouchbaseCluster couchbaseCluster;
    private Bucket bucket;

    public CouchbaseEnvironment getCouchbaseEnvironment() {
        return couchbaseEnvironment;
    }

    public void setCouchbaseEnvironment(CouchbaseEnvironment couchbaseEnvironment) {
        this.couchbaseEnvironment = couchbaseEnvironment;
    }

    public CouchbaseCluster getCouchbaseCluster() {
        return couchbaseCluster;
    }

    public void setCouchbaseCluster(CouchbaseCluster couchbaseCluster) {
        this.couchbaseCluster = couchbaseCluster;
    }

    public Bucket getBucket() {
        return bucket;
    }

    public void setBucket(Bucket bucket) {
        this.bucket = bucket;
    }
}
}

#2

hi @dpozhidaev, sorry for the late answer…
Looks similar to something reported in a Netty bug: https://github.com/netty/netty/issues/4134

There is a possible workaround, but it would require you to upgrade the SDK from 2.1.6 to 2.2.7 (the latest in the current series), because it requires a newer version of Netty.

The workaround is to disable the netty pooling by starting the JVM with the following option:

-Dcom.couchbase.client.deps.io.netty.recycler.maxCapacity=0

This will however produce more GC pressure. The upstream Netty bug and its impact on the SDK is tracked in our own ticket, JCBC-951.

:warning: Keep in mind there’s been a few behavioral changes in 2.2.0 (most notably, in the async API no request is triggered before you call subscribe(...) on an Observable). See the release notes for the 2.2.x series.


#3

Thank you.
Moved to newest SDK (2.2.7), but it didn’t help. Thank you for advice about using maxCapacity - I will try it.


#4

@dpozhidaev did setting the capacity help? I saw that there have been some upstream changes in netty to the recycler which we can pick up once released in later java SDK releases…