Couchbase Version 4.0.0-4051 Crashing Regularly on Ubuntu AWS Instance

Alfie · November 6, 2015, 1:17pm

Hello

We are running Couchbase Server 4 on an Ubuntu AWS instance. The server is crashing and failing to recover on a regular basis.

Here are some relevant messages from the log below:

Service ‘goxdcr’ exited with status 1. Restarting. Messages: MetadataService 2015-11-04T23:32:58.053Z [ERROR] metakv.ListAllChildren failed. path=/remoteCluster/, err=Get http://127.0.0.1:8091/_metakv/remoteCluster/: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: connection refused, num_of_retry=2
MetadataService 2015-11-04T23:32:58.053Z [ERROR] metakv.ListAllChildren failed. path=/remoteCluster/, err=Get http://127.0.0.1:8091/_metakv/remoteCluster/: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: connection refused, num_of_retry=3
MetadataService 2015-11-04T23:32:58.053Z [ERROR] metakv.ListAllChildren failed. path=/remoteCluster/, err=Get http://127.0.0.1:8091/_metakv/remoteCluster/: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: connection refused, num_of_retry=4
RemoteClusterService 2015-11-04T23:32:58.053Z [ERROR] Failed to get all entries, err=metakv failed for max number of retries = 5
[goport] 2015/11/04 23:32:58 /opt/couchbase/bin/goxdcr terminated: exit status 1 ns_log000 ns_1@127.0.0.1 23:34:29 - Wed Nov 4, 2015
Service ‘ns_server’ exited with status 137. Restarting. Messages: working as port ns_log000 ns_1@127.0.0.1 23:34:29 - Wed Nov 4, 2015
Service ‘memcached’ exited with status 137. Restarting. Messages: 2015-11-04T22:56:41.120171Z WARNING 53: Slow STAT operation on connection (127.0.0.1:54121 => 127.0.0.1:11209): 10812 ms
2015-11-04T23:17:01.017534Z WARNING 46: Slow STAT operation on connection (127.0.0.1:46253 => 127.0.0.1:11209): 1702 ms
2015-11-04T23:20:49.409502Z WARNING 53: Slow STAT operation on connection (127.0.0.1:51925 => 127.0.0.1:11209): 15320 ms
2015-11-04T23:21:53.118576Z WARNING 49 Closing connection [127.0.0.1:44584 - 127.0.0.1:11209] due to read error: Connection reset by peer
2015-11-04T23:21:53.124936Z WARNING 46 Closing connection [127.0.0.1:53585 - 127.0.0.1:11209] due to read error: Connection reset by peer ns_log000 ns_1@127.0.0.1 23:34:28 - Wed Nov 4, 2015

Could you please advise and help.

Thank you.

shreyas · November 7, 2015, 6:26am

Same issue here with v4 cluster on Azure. Decided to move back to v3 now.

Alfie · November 8, 2015, 12:17pm

Hello

The Couchbase server has crashed again this morning. This is a real concern. I would hugely appreciate a response from you.

Thanks you.

anil · November 9, 2015, 11:56pm

Hi, I would probably recommend posting a ‘collect info’ which can be done from the Web UI and then filing an issue. Since the logs above isn’t enough to identify the issue.

Alfie · November 11, 2015, 9:18am

Hi Anil

Thanks very much for your reply and the links. I’ll raise an issue and attach the collected information.

lmajano · February 19, 2016, 4:41pm

Did you get any response from this issue? I am having the same effects.

Alfie · February 19, 2016, 6:54pm

Hi. I’ve sent all the requested logs and other information but not heard back.

anil · February 22, 2016, 7:55pm

Hi Alfie,

Firstly thanks for filing that issue it is tracked here MB-16766. Our engineer looked at the issue and as you can see from his comments we are working on fixing it in our next maintenance release. However in your particular case the reason you’re seeing this issue is due to undersized AWS instance you’re running tests on.
Hope that helps.

Thanks
Anil Kumar

Alfie · February 23, 2016, 8:57am

Hi Anil

Thanks very much for the link to the tracked issue with comments from your engineer and thank for highlighting the fact that the bug is being triggered by an undersized AWS instance.

Alfie

vtomasr5 · March 17, 2016, 10:13am

Hi @ll,

We’ve been experimenting this issue in exact same version of couchbase constantly and it’s frustrating because it’s a critical issue and there’s no patch out yet. Out environment is Ubuntu 14.04 LTS on vSphere. We’ve been investigating deeper and we’ve found that the errors shown in the web admin log tab are not the cause of the issue.

In the logs of the system (syslog) we’ve found a “segmentation fault” of the process “beam.smp” that is essentially the whole couchbase main process. At the time that it happens is when is shown the error mentioned by @Alfie.
The load of the 6 VMs is not heavy at all.

See syslog line:

Mar 16 12:05:01 cb01 kernel: [12682725.375185] beam.smp[57252]: segfault at 3fe18ee00019 ip 00007f33662b3f26 sp 00007f33979bc290 error 4 in libv8.so[7f3365f0d000+6da000]

The apport crash (except from the bindump) on Ubuntu:

ProblemType: Crash
Architecture: amd64
Date: Tue Mar 15 17:42:49 2016
DistroRelease: Ubuntu 14.04
ExecutablePath: /opt/couchbase/lib/erlang/erts-5.10.4.0.0.1/bin/beam.smp
ExecutableTimestamp: 1442599866
ProcCmdline: /opt/couchbase/lib/erlang/erts-5.10.4.0.0.1/bin/beam.smp -P 327680 -K true – -root /opt/couchbase/lib/erlang -progname erl – -home /opt/couchbase – -pa /opt/couchbase/lib/erlang/lib/appmon-2.1.14.2/ebin /opt/couchbase/lib/erlang/lib/asn1-2.0.4/ebin /opt/couchbase/lib/erlang/lib/common_test-1.7.4/ebin /opt/couchbase/lib/erlang/lib/compiler-4.9.4/ebin /opt/couchbase/lib/erlang/lib/cosEvent-2.1.14/ebin /opt/couchbase/lib/erlang/lib/cosEventDomain-1.1.13/ebin /opt/couchbase/lib/erlang/lib/cosFileTransfer-1.1.15/ebin /opt/couchbase/lib/erlang/lib/cosNotification-1.1.20/ebin /opt/couchbase/lib/erlang/lib/cosProperty-1.1.16/ebin /opt/couchbase/lib/erlang/lib/cosTime-1.1.13/ebin /opt/couchbase/lib/erlang/lib/cosTransactions-1.2.13/ebin /opt/couchbase/lib/erlang/lib/crypto-3.2/ebin /opt/couchbase/lib/erlang/lib/dialyzer-2.6.1/ebin /opt/couchbase/lib/erlang/lib/diameter-1.5/ebin /opt/couchbase/lib/erlang/lib/edoc-0.7.12.1/ebin /opt/couchbase/lib/erlang/lib/eldap-1.0.2/ebin /opt/couchbase/lib/erlang/lib/erl_docgen-0.3.4.1/ebin /opt/couchbase/lib/erlang/lib/erl_interface-3.7.15 /opt/couchbase/lib/erlang/lib/erts-5.10.4.0.0.1/ebin /opt/couchbase/lib/erlang/lib/et-1.4.4.5/ebin /opt/couchbase/lib/erlang/lib/eunit-2.2.6/ebin /opt/couchbase/lib/erlang/lib/gs-1.5.15.2/ebin /opt/couchbase/lib/erlang/lib/hipe-3.10.2.2/ebin /opt/couchbase/lib/erlang/lib/ic-4.3.4/ebin /opt/couchbase/lib/erlang/lib/inets-5.9.8/ebin /opt/couchbase/lib/erlang/lib/mnesia-4.11/ebin /opt/couchbase/lib/erlang/lib/orber-3.6.26.1/ebin /opt/couchbase/lib/erlang/lib/os_mon-2.2.14/ebin /opt/couchbase/lib/erlang/lib/otp_mibs-1.0.8/ebin /opt/couchbase/lib/erlang/lib/parsetools-2.0.10/ebin /opt/couchbase/lib/erlang/lib/percept-0.8.8.2/ebin /opt/couchbase/lib/erlang/lib/pman-2.7.1.4/ebin /opt/couchbase/lib/erlang/lib/public_key-0.21/ebin /opt/couchbase/lib/erlang/lib/reltool-0.6.4.1/ebin /opt/couchbase/lib/erlang/lib/runtime_tools-1.8.13/ebin /opt/couchbase/lib/erlang/lib/sasl-2.3.4/ebin /opt/couchbase/lib/erlang/lib/snmp-4.25/ebin /opt/couchbase/lib/erlang/lib/ssh-3.0/ebin /opt/couchbase/lib/erlang/lib/ssl-5.3.3/ebin /opt/couchbase/lib/erlang/lib/syntax_tools-1.6.13/ebin /opt/couchbase/lib/erlang/lib/test_server-3.6.4/ebin /opt/couchbase/lib/erlang/lib/toolbar-1.4.2.3/ebin /opt/couchbase/lib/erlang/lib/tools-2.6.13/ebin /opt/couchbase/lib/erlang/lib/tv-2.1.4.10/ebin /opt/couchbase/lib/erlang/lib/typer-0.9.5/ebin /opt/couchbase/lib/erlang/lib/webtool-0.8.9.2/ebin /opt/couchbase/lib/erlang/lib/xmerl-1.3.6/ebin /opt/couchbase/lib/couchdb/plugins/gc-couchbase-1.0.0/ebin /opt/couchbase/lib/couchdb/plugins/vtree-0.1.0/ebin /opt/couchbase/lib/couchdb/plugins/wkb-1.2.0/ebin /opt/couchbase/lib/couchdb/erlang/lib/couch-1.2.0a-961ad59-git/ebin /opt/couchbase/lib/couchdb/erlang/lib/couch_dcp-1.0.0/ebin /opt/couchbase/lib/couchdb/erlang/lib/couch_index_merger-1.0.0/ebin /opt/couchbase/lib/couchdb/erlang/lib/couch_set_view-1.0.0/ebin /opt/couchbase/lib/couchdb/erlang/lib/couch_view_parser-1.0/ebin /opt/couchbase/lib/couchdb/erlang/lib/ejson-0.1.0/ebin /opt/couchbase/lib/couchdb/erlang/lib/erlang-oauth/ebin /opt/couchbase/lib/couchdb/erlang/lib/etap/ebin /opt/couchbase/lib/couchdb/erlang/lib/lhttpc-1.3/ebin /opt/couchbase/lib/couchdb/erlang/lib/mapreduce-1.0/ebin /opt/couchbase/lib/couchdb/erlang/lib/mochiweb-1.4.1/ebin /opt/couchbase/lib/couchdb/erlang/lib/snappy-1.0.4/ebin /opt/couchbase/lib/ns_server/erlang/lib/ale/ebin /opt/couchbase/lib/ns_server/erlang/lib/gen_smtp/ebin /opt/couchbase/lib/ns_server/erlang/lib/mlockall/ebin /opt/couchbase/lib/ns_server/erlang/lib/ns_babysitter/ebin /opt/couchbase/lib/ns_server/erlang/lib/ns_couchdb/ebin /opt/couchbase/lib/ns_server/erlang/lib/ns_server/ebin /opt/couchbase/lib/ns_server/erlang/lib/ns_ssl_proxy/ebin /opt/couchbase/lib/erlang/lib/stdlib-1.19.4/ebin /opt/couchbase/lib/erlang/lib/kernel-2.16.4/ebin . -couch_ini /opt/couchbase/etc/couchdb/default.ini /opt/couchbase/etc/couchdb/default.d/capi.ini /opt/couchbase/etc/couchdb/default.d/geocouch.ini /opt/couchbase/etc/couchdb/local.ini -setcookie mqquwpxpapuonogl -name couchdb_ns_1@127.0.0.1
ProcCwd: /opt/couchbase/var/lib/couchbase
ProcEnviron:
LC_TIME=es_ES.UTF-8
LD_LIBRARY_PATH=
LC_MONETARY=es_ES.UTF-8
TERM=xterm
PATH=(custom, no user)
LC_ADDRESS=es_ES.UTF-8
LC_TELEPHONE=es_ES.UTF-8
LANG=es_ES.UTF-8
SHELL=/bin/bash
LC_NAME=es_ES.UTF-8
LC_MEASUREMENT=es_ES.UTF-8
LC_IDENTIFICATION=es_ES.UTF-8
LC_NUMERIC=es_ES.UTF-8
LC_PAPER=es_ES.UTF-8

I’d hope it helps to solve this serious issue.

Thanks.

Alfie · March 18, 2016, 7:15pm

Thanks for your post.

We were able to get around the issue while it’s still outstanding by moving to a larger sized AWS instance with more memory. I hope that helps you until the bug is fixed.

vtomasr5 · March 19, 2016, 7:00pm

Thanks for answering so quick. We’ll definitely try that.

Let’s hope for a quicker fix

atom_yang · May 5, 2016, 4:53am

I am using Version: 4.1.0-5005 Enterprise Edition (build-5005) on Ubuntu 12.04.5 LTS (GNU/Linux 3.2.0-90-generic x86_64). and I have the same issue.
My Couchbase Server will auto restart frequently. from the log of admin console says:

Service 'goxdcr' exited with status 1. Restarting. Messages: MetadataService 2016-05-05T11:54:47.790+08:00 [ERROR] metakv.ListAllChildren failed. path=/remoteCluster/, err=Get http://127.0.0.1:8091/_metakv/remoteCluster/: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: connection refused, num_of_retry=3
MetadataService 2016-05-05T11:54:47.790+08:00 [ERROR] metakv.ListAllChildren failed. path=/remoteCluster/, err=Get http://127.0.0.1:8091/_metakv/remoteCluster/: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: connection refused, num_of_retry=4
RemoteClusterService 2016-05-05T11:54:47.791+08:00 [ERROR] Failed to get all entries, err=metakv failed for max number of retries = 5
Error starting remote cluster service. err=metakv failed for max number of retries = 5
[goport] 2016/05/05 11:54:47 /opt/couchbase/bin/goxdcr terminated: exit status 1