Server node reported as down - cannot connect to the administrator console

jared · May 24, 2016, 8:14pm

Hi,

I have a an urgent issue with our live couchbase cluster. We have 2 nodes configured and one node is being reported as down. The cluster has been running for a couple of months.

Server specs:
Windows Server 2012 R2
Couchbase 2.5.1
Couchbase was configured to use hostnames and not ip addresses.
Only have around 8000 items in total across 5 buckets

I can see that the Couchbase windows service is running on the failed node, I cannot however connect to the admin console - It just says “page cannot be found”. I have tried restarting the service as well as server. The CouchbaseServer service seems to startup fine but I still cannot connect.

Another issue is that I would like fail over the node that’s down, but for some reason when I view the buckets, the “Replica” counts are 0. Does this mean that the data has never been replicated? The buckets are configured to have 1 replica.

I have had issues with Couchbase on a single node before, and the fix was easy enough as it just required a re-install and a data restore. But now I have an additional node which seems to only contain half the data…?

I’m still able to make a backup of the data on the failed node via the cbbackup tool using the “couchbasefile://” url. I can also make a backup of the entire couchbase installation folder, which includes the data and index folders.

What is the best way to proceed based on the above configuration? First prize would be to get connectvity back, but failing that, would setting up a 3rd server, restoring the failed node’s data there, adding it to the cluster and rebalancing work?

Please let me know what logs / files will be useful in trying to troubleshoot the failed node!

Thanks,
Jared

ingenthr · May 24, 2016, 8:21pm

If you have an urgent need and have an Enterprise subscription, it’s probably best to contact support.

As far as the failed node, I’d recommend looking at the logs for ns_server. The Windows service really starts an Erlang process for ns_server, that in turn starts other processes and listens on port 8091. If you can’t pull up port 8091, something has gone wrong early and the log may indicate.

Regarding your failover, it may show 0 replicas if you’ve already carried out a failover or had autofailover enable. Otherwise, data wasn’t replicating earlier. The UI on the remaining node should be able to show you how many vbuckets are available.

I would recommend doing a filesystem level backup of the down node for sure. There are methods of restoring that which support can talk you through.

I hope that helps a bit. I know it didn’t answer all of your questions, but might give you a couple of next steps.

p.s.: you probably know this already, but 2.5 is very old at this point. 4.5 is in beta now.

jared · May 24, 2016, 8:41pm

Thanks for the quick response!

Unfortunately we’re not on an Enterprise subscription…I have made a backup of the logs directory in the meantime - its around 50MB zipped.

I can see that auto-failover is not enabled. Where can I check “how many vBuckets” are available?

Its been a while since I’ve been on the couchbase site so didn’t actually realise we were a couple versions behind. We can definitely upgrade once we get this up and running!

jared · May 24, 2016, 8:43pm

I see this was literally just posted: Upgrading from 2.1.5 to 4.0

I’ll follow it closely…

jared · May 25, 2016, 10:05am

We’ve managed to get back online by doing the following:

The primary problem was that the data replication never occurred, so the data was split 50/50 between the 2 nodes. Failing over in this case wouldn’t have worked because there was no replica on the healthy server.

We took a backup of all the buckets on each server using the cbbackup tool, using the “couchbasefile://” prefix.
Setup a new server with the same version.
Created new buckets
Restored BOTH backups to the buckets. By using the --add flag on the cbrestore tool, you can append to the data instead of overwriting. Any existing data with the same key is not touched.

We still don’t have a resolution as to why the node has failed, but at least our site is up and running.

Hopefully this will help if anyone experiences this issue in future.