Failure Recovery - Can't Rebalance

travisgreer · October 27, 2014, 11:18pm

Hi,

We’ve got a couchbase cluster that is stuck. It’s a 5 node cluster with one data bucket running couchbase server (community) v 2.2.0 (build-837) on ubuntu in AWS (manually installed).

After some issues (will detail below), we can’t get a successful rebalance. From the logs, rebalance starts:
Bucket “football5” rebalance does not seem to be swap rebalance
ns_vbucket_mover000
ns_1@IP.47
14:02:52 - Mon Oct 27, 2014

Followed by:
Control connection to memcached on ‘ns_1@10.101.186.47’ disconnected: {badmatch,
{error,
closed}}
ns_memcached004
ns_1@IP.47
14:15:30 - Mon Oct 27, 2014

There are various errors, but I’m not sure what is relevant.

So we got to this point when we attempted to start replicating to a separate cluster via XDCR. After a minute, we canceled the XDCR as it appeared to be affecting performance. It went downhill fast at that point. One node auto fail over’d. I stopped all incoming traffic to our app. I restarted couchbase on the failed node at which point it was added back. Then attempted the first rebalance, which failed. I thought I could bounce the couchbase instances and then a rebalance would work. I bounced two more (so now, three of five total). One fail over’d, the other did not.

So now we’ve lost some data (20%?), which sucks. But we still can’t get the cluster up (can’t rebalance).

I apologize if I missed other threads, please point me to them if that’s the case. Or if I need to post more information, I’m happy to do so.

Obviously I’m a couchbase newbie and have already shot myself in the foot. I’m looking to avoid further data loss and get our cluster back up.

Thanks,
Travis

PS: we don’t have a backup, that was what we were trying to accomplish when we started the XDCR.

ingenthr · October 28, 2014, 1:27am

If you had replicas, you probably can recover. Let me get a colleague to have a look at this topic.

alkondratenko · October 28, 2014, 1:44am

Control connection to memcached disconnect is potentially indicator of something quite severe.

I’ll need more details to diagnose what exactly. Best way to provide diagnostics is by filing a bug in jira and attaching cbcollectinfos from all nodes.

travisgreer · October 28, 2014, 2:23am

Thank you for the feedback!

@ingenthr - we do have 1 replica, but since we lost data in two of our nodes (from my own mistakes), we appear to have lost about 20% of our data (still figuring it out)

@alkondratenko - I can file a bug. Though, the only issue that may actually be a bug is how we cancelled the XDCR after letting it run for only a minute or two - and how our cluster did not handle that well. The rest of our issues stemmed from my inexperience with couchbase, not from anything wrong with couchbase itself. Would it still be worth creating a bug?

Thanks again for the feedback, much appreciated!
Travis

ingenthr · October 28, 2014, 4:37am

If it’s not too much trouble for you, it would be great to get an issue filed with cbcollect_info output from all nodes. This will give the team a chance to see if there’s something new here that we need to be able to recover from. To my knowledge, there’s no known scenario where you shouldn’t be able to rebalance and get back to running when deciding you’re willing to give up data. That’s why I think both @alkondratenko and I think there must be an issue.

As far as recovering from here is concerned, it may be best to get a copy of what’s left on the cluster with cbtransfer. At least with that, you can constitute a new cluster and load the data.

You probably already know this, but Couchbase, Inc.'s support folks can give you more real-time help if you have contracted for assistance when needed. We’re glad to help here and in the issue tracker, though the service level is not the same for obvious reasons.

travisgreer · October 28, 2014, 5:31pm

We got our cluster to rebalance and it now back up and running! Two of the five nodes didn’t have swap enabled, though they should’ve had plenty of RAM and disk. Didn’t do much else other than give it some time (while we snapshotted the disks).

I looked up cbcollect_info. How disruptive is the process to run? I’m a bit hesitant as we still don’t have a backup (and being fresh off our last disaster). I was looking up details here, is there a better page to check out?
http://www.couchbase.com/wiki/display/couchbase/Working+with+the+Couchbase+Technical+Support+Team

Again, I appreciate the assistance!

ingenthr · November 9, 2014, 9:02pm

cbcollect_info is intended to be run alongside a production workload, but it does use additional system resources for certain parts of its operation. Unless you’re right on the edge resources wise, it should be fine to run it during your lowest workload period of time.