Java.lang.OutOfMemoryError: Direct buffer memory

Background:
-java 6
-Couchbase Server 3.0.1
-Couchbase java client api 2.1.4
-ubuntu server
-25 node cluster

We ran into out of memory exception in the logs and believe it was trigger from a couchbase node going down/into a bad state. Even after we recover the bad node, eventually this is found in the service logs (we believe restart of the service fixes the problem). The logs keep generating the same exceptions and the service continuously GC.

I found this bug that seems to be related: https://issues.couchbase.com/browse/KAFKAC-9 but doesn’t apply to java client sdk.

This is the exception that was found:

WARNING endpoint.Endpoint [tid-49]: Error during IO write phase.
java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658)
at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
at com.couchbase.client.deps.io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:437)
at com.couchbase.client.deps.io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179)
at com.couchbase.client.deps.io.netty.buffer.PoolArena.allocate(PoolArena.java:168)
at com.couchbase.client.deps.io.netty.buffer.PoolArena.allocate(PoolArena.java:98)
at com.couchbase.client.deps.io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:250)
at com.couchbase.client.deps.io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
at com.couchbase.client.deps.io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
at com.couchbase.client.deps.io.netty.channel.nio.AbstractNioChannel.newDirectBuffer(AbstractNioChannel.java:394)
at com.couchbase.client.deps.io.netty.channel.nio.AbstractNioByteChannel.filterOutboundMessage(AbstractNioByteChannel.java:273)
at com.couchbase.client.deps.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:663)
at com.couchbase.client.deps.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1113)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:691)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:626)
at com.couchbase.client.deps.io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:266)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:691)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:626)
at com.couchbase.client.deps.io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:127)
at com.couchbase.client.deps.io.netty.channel.CombinedChannelDuplexHandler.write(CombinedChannelDuplexHandler.java:192)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:691)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:626)
at com.couchbase.client.deps.io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:112)
at com.couchbase.client.deps.io.netty.handler.codec.MessageToMessageCodec.write(MessageToMessageCodec.java:116)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:32)
(…more…)

@chan00 can you clarify a bit on your use case?

  • Are you using the java SDK through its kafka connector only?
  • Can you share the code how you are using it?

Pinging @avsej just in case, since he is in charge of the kafka connector

@daschl
can you clarify a bit on your use case?
We’re using the Java client API to perform CRUD operation on two different bucket. Both share the same DefaultCouchbaseEnvironment. There’s about ~3000 QPS per a bucket right now.
We are running 25 Couchbase cluster in production for about 1-2 months now and this is the first time this has happened. One of the Couchbase node had “cpu soft lock ups” and ~30mins later, our service started to continuously GC about the same time the exception starts. Eventually all the our service instances (25 instances) would be in this state even after we recovered the Couchbase node. We needed to restart all the service instances and everything is happy again.

  • Are you using the java SDK through its kafka connector only?
    No, we’re using com.couchbase.client.java-client v2.1.4 only. The exception in the KAFKAC-9 was very similar so I believe it might be related.

  • Can you share the code how you are using it?
    We create the connection the bucket like so:

     val env = DefaultCouchbaseEnvironment
       .builder()
       .kvTimeout(1000L)
       .computationPoolSize(32)
       .ioPoolSize(24)
       .build()
    

    val bucket = cluster.openBucket(bucket.name, bucket.password)

Here’s a couple samples how we’re using the bucket (but we have a set of CRUD operations):

Observable.from(docs.toArray)
              .flatMap(new Func1[JsonDocument, Observable[JsonDocument]]() {
              override def call(docToInsert: JsonDocument): Observable[JsonDocument] = {
                bucket.async().upsert(docToInsert)
              }
            }).toList()
              .single()


    Observable.from(ids.toArray)
      .flatMap(new Func1[String, Observable[JsonDocument]]() {
      override def call(id: String): Observable[JsonDocument] = {
        bucket.async().remove(id)
      }
    }).toList()
      .single()

The stack from the exception doesn’t provide enough information to what triggered the issue.

A bit more data. Additional to the exceptions in the previous reply, a bunch of these exceptions are found too:

com.couchbase.client.core.RequestCancelledException: Request cancelled in-flight.
    at com.couchbase.client.core.endpoint.AbstractGenericHandler.handleOutstandingOperations(AbstractGenericHandler.java:281)
    at com.couchbase.client.core.endpoint.AbstractGenericHandler.handlerRemoved(AbstractGenericHandler.java:262)
    at com.couchbase.client.deps.io.netty.channel.DefaultChannelPipeline.callHandlerRemoved0(DefaultChannelPipeline.java:527)
    at com.couchbase.client.deps.io.netty.channel.DefaultChannelPipeline.callHandlerRemoved(DefaultChannelPipeline.java:521)
    at com.couchbase.client.deps.io.netty.channel.DefaultChannelPipeline.remove0(DefaultChannelPipeline.java:351)
    at com.couchbase.client.deps.io.netty.channel.DefaultChannelPipeline.destroyDown(DefaultChannelPipeline.java:798)
    at com.couchbase.client.deps.io.netty.channel.DefaultChannelPipeline.destroyUp(DefaultChannelPipeline.java:767)
    at com.couchbase.client.deps.io.netty.channel.DefaultChannelPipeline.destroy(DefaultChannelPipeline.java:759)
    at com.couchbase.client.deps.io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:743)
    at com.couchbase.client.deps.io.netty.channel.AbstractChannel$AbstractUnsafe$6.run(AbstractChannel.java:615)
    at com.couchbase.client.deps.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380)
    at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
    at com.couchbase.client.deps.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
    at com.couchbase.client.deps.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    at java.lang.Thread.run(Thread.java:745)

Is it possible the memory is not released when requests are cancelled in flight?

Hi @chan00,

so a Request cancelled in-flight exception happens when the socket is closed and all the requests that are currently on the socket are cancelled. Normally they should be released properly, but we can do the following to be sure:

  • What types of API you are exactly using?
  • In the logs, do you see leaks reported by netty?
  • How often do you see those cancelledExceptions? Do you have an environment to test your application?

One thing I’d recommend is if you have an integration system, enable the paranoid leak detection by netty (don’t do it in prod, it will slow it down a lot) and then check the logs after keeping it running for some time.

    static {
        ResourceLeakDetector.setLevel(ResourceLeakDetector.Level.PARANOID);
    }

Alternatively you could try setting it to ADVANCED, even in production. I think it will slow things down a little bit, but it might be an acceptable tradeoff. Again, please check how your performance is decreased before putting your prod instances at risk :smile:

  • What types of API you are exactly using?
    Here’s all the couchbase api we’re using. All of the requests are made async:

    bucket.async().replace(doc).subscribe(…)

    bucket.async().insert(doc).subscribe(…)

    Observable.from(docs.toArray)
    .flatMap(new Func1JsonDocument, Observable[JsonDocument] {
    override def call(docToInsert: JsonDocument): Observable[JsonDocument] = {
    bucket.async().upsert(docToInsert)
    }
    }).toList()
    .single()

    Observable.from(ids.toArray)
    .flatMap(new Func1String, Observable[JsonDocument] {
    override def call(id: String): Observable[JsonDocument] = {
    bucket.async().remove(id)
    }
    }).toList()
    .single()
    .subscribe(…)

    Observable.from(ids.toArray)
    .flatMap(new Func1String, Observable[JsonDocument] {
    override def call(id: String): Observable[JsonDocument] = {
    bucket.async().get(id)
    }
    }).toList()
    .single()
    .subscribe()

  • In the logs, do you see leaks reported by netty?
    No, we don’t see any netty related errors in the logs.

  • How often do you see those cancelledExceptions? Do you have an
    environment to test your application?
    We only see this after we go into GC hell. On a single node, ~30 CancelledException compare to over >5,000 OutOfMemoryError. It might actually be the result of the OutOfMemoryError. We see the OutOfMemoryError before the cancelledException. We believe we don’t allocate enough memory for direct memory as the old gen is not full yet we still GC and the reserveMemory code in java.nio.Bits.reserveMemory explicitly calls System.gc(). Does Couchbase have recommendation on the size we should set for JVM’s MaxDirectMemorySize?

I’ll give the ResourceLeakDetector a try but we haven’t been able to repro these type of issues in our test environment. We ran into this issue after ~2 months of running the service in production. Thanks.

Here’s the reserveMemory code if you’re interested:

static void reserveMemory(long size, int cap) {
    synchronized (Bits.class) {
        if (!memoryLimitSet && VM.isBooted()) {
            maxMemory = VM.maxDirectMemory();
            memoryLimitSet = true;
        }
        // -XX:MaxDirectMemorySize limits the total capacity rather than the
        // actual memory usage, which will differ when buffers are page
        // aligned.
        if (cap <= maxMemory - totalCapacity) {
            reservedMemory += size;
            totalCapacity += cap;
            count++;
            return;
        }
    } 
    System.gc();  

    try {
        Thread.sleep(100);
    } catch (InterruptedException x) {
        // Restore interrupt status
        Thread.currentThread().interrupt();
    }
    synchronized (Bits.class) {
        if (totalCapacity + cap > maxMemory)
            throw new OutOfMemoryError("Direct buffer memory");
        reservedMemory += size;
        totalCapacity += cap;
        count++;
    }
}