Is couchbase really high availability?


#1

I’ve a problem with high availability.

Here is my setup:
I have 2 Linux servers (192.168.2.91 and 192.168.2.92), on each is installed: latest couchbase server and keepalived (used for HAProxy, Squid, Dante, VIP: 192.168.2.90).
Client: c# application that puts key-value pairs into couchbase. In app config it is setup to use 192.168.2.91 and 192.168.2.92.

Test 1.
Couchbase cluster is created using web wizard: 1 master + replica.

So it is all tip top: client/server working - no problem. Now, when I shutdown server 192.168.2.91 I expect traffic to be routed to 192.168.2.92. It does not happen. Client gets time out. Ok, restart the client. Client takes about 30 sec to establish connection (new CouchbaseClient()). When CouchbaseClient() returns all calls to put/get data automatically fail. I tried to play with autofailover feature setting it to 30 sec (min allowed). Still no good. In production it does not make sense to have such long fail over anyway. It data is replicated

So, there is no way to to get HA there.

Test 2.
There is no replicas, only stand alone couchbase servers. I set up XDCR between servers on .91 to .92 and from .92 to .91
I can put data to .91 and it will appear in .92, but when I put to .92 there is nothing in .91
Then I tried to play with keepalived - to have one IP (.90) that I will connect to either server. That scenario did not work either. Data was not transferred to another server.

so, where is high availability here? do I miss something in the setup?

Thank you,
Dima


#2

Hi Dima, for #1 you can initiate the failover yourself. We do allow 30s for better protection against network hiccups etc. However your client app can do this through REST in shorter amount of time.
For #2, you need a replication defined from #91 to #92 and another defined from #92 to #91. do you have both defined?


#3

Hi, cihan!

Thank you for reply!

For #1. Auto fail over is set to 30 sec. Problem is that client app is using couchbase .NET library and does not know if there is a timeout. There is just no exception of any kind. Only waiting for about 30 sec for function to return.
After I wait for 2 min (while it should be 30 sec as for auto fail over rule), I’m still not able to connect to .92 even if I explicitly specify in app.config only 1 IP (192.168.2.92) for couchbase server. It just times out and any other API call will just Immediately return.

For #2, As I said above: I set up XDCR between servers on .91 to .92 and from .92 to .91. So replication should be working both ways. Replication worked once, then I reboot one server (simulated a failure) and whole thing broke and no replication was restored. I tried to start/stop couchbase services, reboot servers, reconfigure XDCR settings - no results.

couchbase works fine if it is one instance only or in multiple servers with no failures, but as soon I start testing HA with simulated node failures, the whole solution becomes not usable.


#4

In case #1, data is replicated AND distributed to both servers. When you simulate your failure of node 1, the cluster is in a failure state, half of your distributed data is missing and replicas are still unaccessible, until you hit the “fail-over” button or auto-failover kicks in. Once fail-over has happened, replicas become active so a smart client will get the new vbucket map and things should be back to normal… but you might actually need to restart you client in practice to get back to normal.
Second thing I should mention… if your client is auto creating the buckets when they dont exists, you will get a total mess on your hands as the failed-over node, your config is still pointing to, will actually accept client connection and let you create a “ghost” bucket and write to it. Since the node is failed-over and not part of the cluster, this data basically goes to /dev/null… I know… lovely. So make sure your client app is checking each server’s health that are in the config and remove any un-initialized or failed-over node before passing them to couchbase smart client.

As for #2, have you checked the cluster health of both nodes? xdcr will not resume until the bucket on both sides is healthy. An other possibility for issues is that you’re using 3.0.1 or 3.0.2 which are, lets face it, totally unusable in any prod scenario due to corruption, stall, leak and more fun stuff. Go to 3.0.3 as a minimum or go back to version 2 would be my advice.