Cluster issue after failing over and re-adding node

jda · September 19, 2019, 11:54am

I have an issue with a demo setup where I tried to work out the best way to upgrade the servers (OS software/fixes - that required a reboot). So via the admin panel (port 8091) I went under servers and chose “Failover” (choosing: Graceful Failover (default)) for the first server. Once completed I did the OS updates and rebooted.

When the server was up again I added it back (selecting “Full recovery” - which may be wrong…?) - and rebalanced…

I did the same thing for the other server.

Now, when everything is up and running again I have a strange issue… I discovered it by the application failing with errors like this:

Caused by: java.util.concurrent.TimeoutException: {"b":"data","r":"192.168.42.212:11210","s":"kv","c":"104D161EBC55AA58/000000006A7CD142","t":2500000,"i":"0x3eb","l":"192.168.42.226:38640"}
	at com.couchbase.client.java.bucket.api.Utils$1.call(Utils.java:131)
	at com.couchbase.client.java.bucket.api.Utils$1.call(Utils.java:127)
	at rx.internal.operators.OperatorOnErrorResumeNextViaFunction$4.onError(OperatorOnErrorResumeNextViaFunction.java:140)
	at rx.internal.operators.OnSubscribeTimeoutTimedWithFallback$TimeoutMainSubscriber.onTimeout(OnSubscribeTimeoutTimedWithFallback.java:166)
	at rx.internal.operators.OnSubscribeTimeoutTimedWithFallback$TimeoutMainSubscriber$TimeoutTask.call(OnSubscribeTimeoutTimedWithFallback.java:191)
	at rx.internal.schedulers.ScheduledAction.run(ScheduledAction.java:55)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:522)
	at java.util.concurrent.FutureTask.run(FutureTask.java:277)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:191)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.lang.Thread.run(Thread.java:811)

On server db1 when I go under “Documents” and enter a doc. type in the “Where” field: type='Article and press [Retrieve docs] - I get no documents and it shows this message in red: “No Results” … On the other server db2 I get the first 10 docs of that type.

If I use a Query select * from data where type='Article' it correctly returns the documents (on both servers).

All indexes are built. It’s the same issue if I try another doc type. If I remove the “Where” clause it also shows the first documents on db1.
I have now tried to entirely remove the db1 server from the cluster - and re-add it (and create the indexes). After the add and rebalance has completed I still see the same issue… I have also tried to run a “compact” to no avail.

How can this happen? And what can I do to solve it?

I’m not too keen on trying this in my production environment without knowing what is going on… - and how to resolve it

Couchbase server is: Community Edition 6.0.0 build 1693 ‧ IPv4
Java SDK (on application server): couchbase-core-io-1.7.9 & couchbase-java-client-2.7.9
Running on CentOS Linux 7.7.1908 (upgraded from 7.6)

EDIT:
Ok, now I tried to change the application server to only contact db2. And now it throws a different error.

 Caused by: com.ibm.jscript.InterpretException: Script interpreter error, line=1, col=7: Error calling method 'submit()' on java class 'dk.dtu.aqua.catchlog.bean.LogonBean'
    	at com.ibm.jscript.types.JavaAccessObject.call(JavaAccessObject.java:335)
    	at com.ibm.jscript.types.FBSObject.call(FBSObject.java:161)
    	at com.ibm.jscript.ASTTree.ASTCall.interpret(ASTCall.java:197)
    	at com.ibm.jscript.ASTTree.ASTProgram.interpret(ASTProgram.java:119)
    	at com.ibm.jscript.ASTTree.ASTProgram.interpretEx(ASTProgram.java:139)
    	at com.ibm.jscript.JSExpression._interpretExpression(JSExpression.java:435)
    	at com.ibm.jscript.JSExpression.access$1(JSExpression.java:424)
    	at com.ibm.jscript.JSExpression$2.run(JSExpression.java:414)
    	at java.security.AccessController.doPrivileged(AccessController.java:730)
    	at com.ibm.jscript.JSExpression.interpretExpression(JSExpression.java:410)
    	at com.ibm.jscript.JSExpression.evaluateValue(JSExpression.java:251)
    	at com.ibm.jscript.JSExpression.evaluateValue(JSExpression.java:234)
    	at com.ibm.xsp.javascript.JavaScriptInterpreter.interpret(JavaScriptInterpreter.java:222)
    	at com.ibm.xsp.binding.javascript.JavaScriptMethodBinding.invoke(JavaScriptMethodBinding.java:111)
    	... 32 more
    Caused by: com.couchbase.client.core.CouchbaseException: java.lang.RuntimeException: Could not decode snappy-compressed value.
    	at com.couchbase.client.core.endpoint.AbstractGenericHandler.decode(AbstractGenericHandler.java:369)
    	at com.couchbase.client.deps.io.netty.handler.codec.MessageToMessageCodec$2.decode(MessageToMessageCodec.java:81)
    	at com.couchbase.client.deps.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
    	at com.couchbase.client.deps.io.netty.handler.codec.MessageToMessageCodec.channelRead(MessageToMessageCodec.java:111)
    	at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
    	at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    	at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
    	at com.couchbase.client.deps.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
    	at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
    	at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    	at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
    	at com.couchbase.client.deps.io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
    	at com.couchbase.client.deps.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:312)
    	at com.couchbase.client.deps.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:286)
    	at com.couchbase.client.deps.io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
    	at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
    	at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    	at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
    	at com.couchbase.client.deps.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
    	at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
    	at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    	at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
    	at com.couchbase.client.deps.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1304)
    	at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
    	at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    	at com.couchbase.client.deps.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:921)
    	at com.couchbase.client.deps.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:135)
    	at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:646)
    	at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:581)
    	at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
    	at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460)
    	at com.couchbase.client.deps.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
    	at com.couchbase.client.deps.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    	at java.lang.Thread.run(Thread.java:811)
    Caused by: java.lang.RuntimeException: Could not decode snappy-compressed value.
    	at com.couchbase.client.core.endpoint.kv.KeyValueHandler.handleSnappyDecompression(KeyValueHandler.java:397)
    	at com.couchbase.client.core.endpoint.kv.KeyValueHandler.decodeResponse(KeyValueHandler.java:949)
    	at com.couchbase.client.core.endpoint.kv.KeyValueHandler.decodeResponse(KeyValueHandler.java:132)
    	at com.couchbase.client.core.endpoint.AbstractGenericHandler.decode(AbstractGenericHandler.java:338)
    	... 33 more
    Caused by: com.couchbase.client.deps.org.iq80.snappy.CorruptionException: Invalid copy offset for opcode starting at 1
    	at com.couchbase.client.deps.org.iq80.snappy.SnappyDecompressor.decompressAllTags(SnappyDecompressor.java:165)
    	at com.couchbase.client.deps.org.iq80.snappy.SnappyDecompressor.uncompress(SnappyDecompressor.java:47)
    	at com.couchbase.client.deps.org.iq80.snappy.Snappy.uncompress(Snappy.java:85)
    	at com.couchbase.client.core.endpoint.kv.KeyValueHandler.handleSnappyDecompression(KeyValueHandler.java:395)
    	... 36 more

The application itself has not been changed for a while…

jda · September 19, 2019, 2:34pm

I have now deleted and recreated all indexes. That got rid of the last error (…Snappy…).

I can also open a type of docs. from Documents on both servers.

I still get the timeout error now and again. So would like some feedback on that

And I would also like very much to understand why this happened - the “solution” here took a looong time and would not be desirable in production

So perhaps at least some “best practice” for taking a node out of the cluster to do the upgrade and add it again. And what to look for before taking the other node out…

jda · October 4, 2019, 1:08pm

The timeout and “Snappy” errors turned out to be an issue with the Java SDK etc.

See more in this post.

… but it would still be nice to have some feedback on the right way to take the cluster nodes down for maintenance with continuous service?

ingenthr · October 4, 2019, 3:58pm

Actually, so far the error seems to be pointing to something with the Community Edition of Couchbase Server. It could ultimately end up being an issue in the Java SDK or some kind of interaction, but so far investigation shows that the compression bit is set when coming from the cluster and the payload is not compressed.

More details in MB-36299. @graham.pople is going to see if he can reproduce it.