Java SDK can not access Query node in couchbase 5.0

java
sdk

#1

Hi all,

My Environment :
Couchbase -> Version 5.0 Enterprise
Java SDK -> :java-client:2.5.2
Python pip -> 2.2.6.dev7-ge8b2c7e

The application using Java SDK can connect Data node but can not access Query node.
Our Application team referred https://developer.couchbase.com/documentation/server/current/sdk/java/start-using-sdk.html

when application connect Query node occurred Error message about Authentication.
Is this what it is? Why is it a data node and not a query node? In python, both the data node and the query node work well.

Error Message is follow:
2017-11-16 15:52:12.022 WARN 2189 — [ cb-io-1-6] c.c.client.core.endpoint.Endpoint : [null][KeyValueEndpoint]: Authentication Failure.
2017-11-16 15:52:12.025 WARN 2189 — [ cb-io-1-6] c.c.client.core.endpoint.Endpoint : [null][KeyValueEndpoint]: Could not connect to endpoint, retrying with delay 4096 MILLISECONDS:
com.couchbase.client.core.endpoint.kv.AuthenticationException: Bucket not found on Select Bucket command
at com.couchbase.client.core.endpoint.kv.KeyValueSelectBucketHandler.channelRead0(KeyValueSelectBucketHandler.java:154)
at com.couchbase.client.core.endpoint.kv.KeyValueSelectBucketHandler.channelRead0(KeyValueSelectBucketHandler.java:41)
at com.couchbase.client.deps.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
at com.couchbase.client.deps.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
at com.couchbase.client.deps.io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
at com.couchbase.client.deps.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:312)
at com.couchbase.client.deps.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:286)
at com.couchbase.client.deps.io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
at com.couchbase.client.deps.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
at com.couchbase.client.deps.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1302)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
at com.couchbase.client.deps.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at com.couchbase.client.deps.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:646)
at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:581)
at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460)
at com.couchbase.client.deps.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at com.couchbase.client.deps.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:745)
2017-11-16 15:52:12.025 INFO 2189 — [ cb-io-1-6] c.c.client.core.endpoint.Endpoint : [null][KeyValueEndpoint]: Got notified from Channel as inactive, attempting reconnect.


#2

Hm, the error message printed in the stack trace is actually referring to KV (11210) and not to Query, but it should also be transient. Can you share the full logs maybe?


#3

@daschl,

This log is repeated continuously. If you send a query to the data node, there is no problem. Do I need port or other settings?


#4

@notemusic are you bootstrapping in the SDK (the list of nodes) only against nodes which have the KV service enabled? If you have a MDS setup (so i.e. nodes with only the query service), they need to be excluded from the bootstrap list.

Also, can you still share the full log from startup so we can see whats going on? this one stack doesnt help much unfortunately.


#5

@daschl thanks for your response. i have a more question about MDS.

if i use MDS, when setting the cluster in the SDK, should not you use a query node? I wonder if I need to write a data node.


#6

@notemusic no I’m only saying in the bootstrap list, when doing CouchbaseCluster.create("…") you should only pass in nodes with KV on them, the rest is fully transparent.


#7

Looks like this got broken with the upgrade to 5.x. This wasn’t a problem with 4.x. What was the design decision behind this change?


#8

@unhuman what is the problem you are seeing?


#9

With our upgrade to 5.1 (from 4.5.1) we are seeing lots of WARN logs of “Bucket not found on Select Bucket command.” We are not sure if these are real problems yet, but they do not occur with 4.5.1, and do in both 5.0.1 and 5.1.0.

We currently have all nodes in the cluster in our bootstrap process. Our code, through a VIP, reads the cluster map and generates the configuration with all the nodes in the cluster. We will be changing that to have only “kv” nodes in the bootstrap, which seems to work. That’s a deployment of > 100 services.

But… The question is why was the decision made to change that?


#10

@unhuman 5.0 changed the way the SDK works against the server through KV since we added RBAC - the authentication is now separate from bucket selection… since you can have more than one bucket with your user creds accessible…


#11

@daschl Hey - so these logged messages are warnings. Do they affect our ability to interact with the cluster in any way? We’re trying to figure out how we upgrade our applications to the fixes (do we need to mandate do this ASAP or can we take a more moderate approach?). The affected applications are using the migrated bucket credentials.


#12

If you want 5.x to behave compatibly with 4.x, the guidelines are to create a user whose name is the same as the bucket. With that, you can take a moderate approach and then shift the app servers over to the newer client using the newer authentication approach. See the docs for details!

I’d also recommend against using a VIP that resolves to all of the nodes. There is a subtle thing incorrect there in that the config generated has an entry that the client substitutes for the bootstrap hostname. It’ll probably ‘correct’ itself after some time as it updates the config, but it’ll be incorrect initially. DNS SRV is our supported way to maintain the list in a single location as you add/remove nodes. Again, see the docs.

Also, to investigate what’s happening, you may want to try running sdk-doctor in that env? It may not find anything, but it will validate that common connectivity problems aren’t there.

You’ll find pre-built binaries on the release page.

Usually the summary at the end is pretty easy to interpret.


#13

@ingenthr We’re experiencing this issue with latest Java SDK 2.5.8 connecting to the latest version of couchbase server 5.1.0. This seems to only happen when the server is in the middle of a rebalance. This is a large cluster with > 1TB of data so the rebalances are long and slow.

WARN [2018-06-01 17:04:03,865] com.couchbase.client.core.endpoint.Endpoint: [][KeyValueEndpoint]: Could not connect to endpoint, retrying with delay 4096 MILLISECONDS: com.couchbase.client.core.endpoint.kv.AuthenticationException: Bucket not found on Select Bucket command
WARN [2018-06-01 17:04:03,866] com.couchbase.client.core.endpoint.Endpoint: [][KeyValueEndpoint]: Authentication Failure.
WARN [2018-06-01 17:04:03,866] com.couchbase.client.core.endpoint.Endpoint: [][KeyValueEndpoint]: Could not connect to endpoint, retrying with delay 4096 MILLISECONDS: com.couchbase.client.core.endpoint.kv.AuthenticationException: Bucket not found on Select Bucket command
WARN [2018-06-01 17:04:03,884] com.couchbase.client.core.endpoint.Endpoint: [][KeyValueEndpoint]: Authentication Failure.
WARN [2018-06-01 17:04:03,884] com.couchbase.client.core.endpoint.Endpoint: [][KeyValueEndpoint]: Could not connect to endpoint, retrying with delay 4096 MILLISECONDS: com.couchbase.client.core.endpoint.kv.AuthenticationException: Bucket not found on Select Bucket command

We see this over and over again and it doesn’t recover.

It does seem like things are still functionally working but the logs are spamming this over and over again and it’s not clear what’s going on. Is this a bug in the java client SDK or is something else broken here?


#14

It looks like all these AuthenticationException: Bucket not found on Select Bucket command warnings are because these particular endpoints in the cluster map are the nodes that are being added to the cluster during the rebalance and the buckets/data do not exist on the cluster yet.

This seems like a bug in the java sdk client in that it shouldn’t log these as warnings given that is expected/normal behavior while adding a new node to an existing cluster (rebalance).


#15

I don’t think it is expected, but will need to check. The bug may be that they are in the configuration (as other than fast forward) and are taking connections before they’re ready to service those connections. There might be nothing we can do other than WARN and retry, since the client doesn’t know when it’ll be ready and silently retrying might mask a different issue.

You said this is with 5.1? Let us look into this a bit.


#16

Yes, this is with 5.1.0 enterprise server.


#17

@ingenthr Any update on this issue? Just curious if this is being addressed in a later version of the java client SDK? We’re currently using version “2.5.8”


#18

Let me check with the developer who was going to review it, either they or I will update here.


#19

@bryan the root cause of this is that there is a delay when the server sends the SDK that a new node is available and the time the kv engine is able to serve that bucket.

When you say it doesn’t recover… do you mean during the rebalance? I was expecting that it at least vanishes once rebalance is done? I’ll see if there is a way we can improve that experience so that it doesn’t spam the logs. Also to double check: actual traffic should not be impacted right.


#20

@bryan are the new nodes being rebalanced in, non-kv nodes (like query/index only)?