Invalid connection behavior when using load balancer hostname


#1

Hi everyone,

I have a big issue with the couchbase client 2,7,x, I don’t know if prior client versions have the issue.

In our context, we just updated our couchbase servers from 2,2 to 5,1,1, We use a load balancer in front of our Couchbase Servers; cb,domain,com, This LB is linked to two servers; cb1,domain,com and cb2,domain,com, The servers have 3 buckets, the problem is that we are unable to connect and open buckets using cb,domain,com, Everything runs fine using cb1,domain,com or cb2,domain,com.

At the application startup, we use this piece of code to open our buckets and share them:

    Stream,of(bucketNames),forEach(bucketName -> {
        System,out,println("Opening " + bucketName);
        this,bucketMap,put(bucket, this,cluster,openBucket(bucketName, "pazzword"));
    })

If bucketNames contains only 1 entry, it works fine, As soon as we add more bucket names, it fails at the second openBucket with this error:

java,lang,RuntimeException: java,util,concurrent,TimeoutException
at com,couchbase,client,core,utils,Blocking,blockForSingle(Blocking,java:74)
at com,couchbase,client,java,CouchbaseCluster,openBucket(CouchbaseCluster,java:349)
at com,couchbase,client,java,CouchbaseCluster,openBucket(CouchbaseCluster,java:333)
at com,couchbase,client,java,CouchbaseCluster,openBucket(CouchbaseCluster,java:322)
at com,domain,test,NewCouchbase,lambda$0(NewCouchbase,java:52)
at java,util,Spliterators$ArraySpliterator,forEachRemaining(Spliterators,java:948)
at java,util,stream,ReferencePipeline$Head,forEach(ReferencePipeline,java:580)
at com,domain,test,NewCouchbase,(NewCouchbase,java:50)
at com,domain,test,CouchbaseTest,testCouchbase(CouchbaseTest,java:36)
at sun,reflect,NativeMethodAccessorImpl,invoke0(Native Method)
at sun,reflect,NativeMethodAccessorImpl,invoke(NativeMethodAccessorImpl,java:62)
at sun,reflect,DelegatingMethodAccessorImpl,invoke(DelegatingMethodAccessorImpl,java:43)
at java,lang,reflect,Method,invoke(Method,java:498)
at org,junit,runners,model,FrameworkMethod$1,runReflectiveCall(FrameworkMethod,java:50)
at org,junit,internal,runners,model,ReflectiveCallable,run(ReflectiveCallable,java:12)
at org,junit,runners,model,FrameworkMethod,invokeExplosively(FrameworkMethod,java:47)
at org,junit,internal,runners,statements,InvokeMethod,evaluate(InvokeMethod,java:17)
at org,junit,runners,ParentRunner,runLeaf(ParentRunner,java:325)
at org,junit,runners,BlockJUnit4ClassRunner,runChild(BlockJUnit4ClassRunner,java:78)
at org,junit,runners,BlockJUnit4ClassRunner,runChild(BlockJUnit4ClassRunner,java:57)
at org,junit,runners,ParentRunner$3,run(ParentRunner,java:290)
at org,junit,runners,ParentRunner$1,schedule(ParentRunner,java:71)
at org,junit,runners,ParentRunner,runChildren(ParentRunner,java:288)
at org,junit,runners,ParentRunner,access$000(ParentRunner,java:58)
at org,junit,runners,ParentRunner$2,evaluate(ParentRunner,java:268)
at org,junit,runners,ParentRunner,run(ParentRunner,java:363)
at org,eclipse,jdt,internal,junit4,runner,JUnit4TestReference,run(JUnit4TestReference,java:86)
at org,eclipse,jdt,internal,junit,runner,TestExecution,run(TestExecution,java:38)
at org,eclipse,jdt,internal,junit,runner,RemoteTestRunner,runTests(RemoteTestRunner,java:459)
at org,eclipse,jdt,internal,junit,runner,RemoteTestRunner,runTests(RemoteTestRunner,java:675)
at org,eclipse,jdt,internal,junit,runner,RemoteTestRunner,run(RemoteTestRunner,java:382)
at org,eclipse,jdt,internal,junit,runner,RemoteTestRunner,main(RemoteTestRunner,java:192)
Caused by: java,util,concurrent,TimeoutException
, 32 more

After investigating and tracing the log, I’ve seen that the connection to cb,domain,com is being removed at the end of the openBucket process in a reconfiguration event:

com,couchbase,client,core,RequestHandler,reconfigure at line 500

                for (Node node : nodes) {
                    if (!configNodes,contains(node,hostname())) {
                        LOGGER,debug("Removing and disconnecting node {},", node,hostname());
                        removeNode(node);
                        node,disconnect(),subscribe(new Subscriber<LifecycleState>() {
                            @Override
                            public void onCompleted() {}

                            @Override
                            public void onError(Throwable e) {
                                LOGGER,warn("Got error during node disconnect,", e);
                            }

                            @Override
                            public void onNext(LifecycleState lifecycleState) {}
                        });
                    }
                }

Since this disconnection event is triggered, the next openBucket of the loop automatically fails because this new openBucket also tries to connect using cb,domain,com hostname. Since the async part of the client isn’t revealed to the implementor, I would assume that the openBucket method would return with a clean context. If that was the case, there would be no difference when calling it multiple times. The expected behavior would be the automatic detection of a front facing server, or at least a specific configuration parameter for this architecture, to prevent the LB to be added as a real node. Maybe there would be other problems, but it’s just the way I was assuming the client was working.

I think that having a front facing server is a common configuration, so there must be something I’m missing in the configuration, or there’s a real bug in the client connection process.

Any idea or workaround? Or is the only supported configuration would be to directly use “cb1,domain,com and cb2,domain,com” in the CouchbaseCluster,create method parameters?

Note: I can’t upload the complete case log since new users can’t attach files, I may be able to send it on request, or add a response with the 200+ lines. Sorry for replacing all the points with commas. This forum detects everything as links and only allows two links per post for new members. A bit weird for a technical forum.


#2

My hypothesis was that the Couchbase Client was having an unfinished task after the return of the openBucket call, probably the removal of the cb,domain,com host since it’s not in the Servers list. By putting a simple “Thread.sleep(10000);” in the openBucket loop, everything runs fine.

The code fails in Blocking.blockForSingle, called by CouchbaseCluster.openBucket, at this line

public static <T> T blockForSingle(final Observable<? extends T> observable, final long timeout,
                                   final TimeUnit tu) {
    final CountDownLatch latch = new CountDownLatch(1);
    TrackingSubscriber<T> subscriber = new TrackingSubscriber<T>(latch);

    observable.subscribe(subscriber);

    try {
        if (!latch.await(timeout, tu)) {
            throw new RuntimeException(new TimeoutException());    <----------------
        }
    } catch (InterruptedException e) {
        Thread.currentThread().interrupt();
        throw new RuntimeException("Interrupted while waiting for subscription to complete.", e);
    }

    if (subscriber.returnException() != null) {
        if (subscriber.returnException() instanceof RuntimeException) {
            throw (RuntimeException) subscriber.returnException();
        } else {
            throw new RuntimeException(subscriber.returnException());
        }
    }

    return subscriber.returnItem();
}

It means that TrackingSubscriber.onCompleted or TrackingSubscriber.onError is not called and the latch.await is waiting until the end. I tried setting the timeouts to 25000ms, and the problem is still there. It’s just longer to get the stack trace. The method Cluster.openBucket should be a blocking call, so everything should be finished on return. Since few things looks to still be in progress after the return, it acts more like the async version of the method Cluster.async().openBucket. When running in debug mode in Eclipse, and adding a break point in TrackingSubscriber, my tests always run successfully.

Any idea?


#3

Hey @carl.duranleau, because of the way Couchbase bootstraps and dynamically adjusts to topology, we cannot support putting a load balancer in front of the cluster. It’s actually not a common configuration and I think this is covered well in the docs (open to feedback on improvements). We do have a solution though.

If you’re looking to maintain the list of nodes in one place as your cluster topology may change over time, DNS SRV is the ideal solution for that. SRV records are “service” records, and allow you to cheaply (cheaper than an LB!) maintain a list of nodes for a given cluster.

A load balancer may “work” in some cases, but it’s problematic in many cases since the bootstrap has to fail back to a less preferred method, the node you’re bootstrapping against isn’t in the list, etc.

Further reasoning here is that each node in Couchbase typically has unique services on it. During bootstrap, the client retrieves the cluster topology from the list or DNS SRV record supplied, then connects to the resources on each node. Load balancers are more appropriate when every node is like every other node (not the case in Couchbase) or where requests can be proxied internally (which Couchbase does not do to maintain efficiency/performance).