Idle server got "confused", now won't respond/takes much CPU after start (3.0.1)


#1

I have been hoping to migrate some deployments to Couchbase Server, and to that end I figured I would get a daemon running on an EC2 instance and just see how it goes. Well, it hasn’t gone particularly well.

I had started with a pre-release version of 3.0 and
ran into trouble there with it pegging the machine at full CPU. I may have accidentally killed all Erlang processes on the machine at one point, so I chalked it up to either that or running a pre-release version. The main notes I have from that are this log snippet: https://gist.github.com/natevw/ea4748ce91b5d7408bd2

Anyway, when I noticed that I installed the 3.0.1 release instead and it seemed to resolve the problem. However, today I logged back into that server and noticed the CPU was getting chewed up by Couchbase again. The admin console was not responding on port 8091. I ran sudo /etc/init.d/couchbase-server stop and it took forever but did eventually work.

At this point I tried restarting with sudo /etc/init.d/couchbase-server start and same thing: CPU pegged, nothing reponding on port 8091. The stop command took a long time again. Note that in the months between the original problem and [noticing] a similar issue today the CB daemon was pretty much unused, maybe some occasional tire kicking. Single node, no significant data, no XDCR, mostly just playing around with geospatial views through the admin console, then logging out and leaving it alone…

I really can’t make out any clues from the logs. Any idea what is going on? How can I at least start completely fresh on this machine, so that I can start it up successfully again? For now I’ve just stopped the daemon and left its logs in place in case something else would be helpful.

The last 5000 lines of both debug.log and reports.log are here, do they show anything in particular? Thanks in advance! https://gist.github.com/natevw/e0f7759b317dfe87d800


#2

Hey Nathan, good to hear from you! I hope you remember me a while back.

One bit of info not here is what the config of the system was. What kind of EC2 system and what OS?

Let’s do two things. First, can you do a cbcollect_info on the system and file an issue? Post here (or private message me) on it so I can follow up.

Second, I might recommend as a debugging step, trying to remove the design docs through curl. Maybe something odd was going on in there and adding incrementally will get us to the right place.


#3

Of course, Matt, and glad to see you’re still at it, making Couchbase harder/better/faster/stronger!

Looks like I can’t access the REST API any more than the admin:

ubuntu@ip-N-N-N-N:~$ sudo /etc/init.d/couchbase-server start
 * Started couchbase-server
ubuntu@ip-N-N-N-N:~$ curl localhost:8092
curl: (7) couldn't connect to host

(Note the “started couchbase-server” appears after a very long delay…the first time it actually complained of timeout instead [though the processes seemed to be up], so I issued another stop and retried to get the above log.)

I am also unable to file an issue, JIRA will not accept my forum password but both the “Register for couchbase.org?” and “Forgot password?” options just take me back to this forum. I will PM you a link to the output of sudo /opt/couchbase/bin/cbcollect_info --single-node-diag idle_hang.zip once it finishes.

Unless you have any objections, I will track down the data directories and delete them to see if this resolves the issue. All I had on this was a few “fake” documents and a little view to test some new geospatial indexing features (which worked when the server did).

Thanks for your concern!


#4

Update: found an older password maybe from a years-ago JIRA account, so that issue is halfway taken care of. I “filed” this forum post for the JIRA links: JIRA registration/reset links broken

(And looks like the collectinfo is done, so expect that soon! UPDATE: filed http://issues.couchbase.com/browse/MB-13403)