We are running Couchbase Server 4 on an Ubuntu AWS instance. The server is crashing and failing to recover on a regular basis.
Here are some relevant messages from the log below:
Service ‘goxdcr’ exited with status 1. Restarting. Messages: MetadataService 2015-11-04T23:32:58.053Z [ERROR] metakv.ListAllChildren failed. path=/remoteCluster/, err=Get http://127.0.0.1:8091/_metakv/remoteCluster/: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: connection refused, num_of_retry=2
MetadataService 2015-11-04T23:32:58.053Z [ERROR] metakv.ListAllChildren failed. path=/remoteCluster/, err=Get http://127.0.0.1:8091/_metakv/remoteCluster/: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: connection refused, num_of_retry=3
MetadataService 2015-11-04T23:32:58.053Z [ERROR] metakv.ListAllChildren failed. path=/remoteCluster/, err=Get http://127.0.0.1:8091/_metakv/remoteCluster/: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: connection refused, num_of_retry=4
RemoteClusterService 2015-11-04T23:32:58.053Z [ERROR] Failed to get all entries, err=metakv failed for max number of retries = 5
[goport] 2015/11/04 23:32:58 /opt/couchbase/bin/goxdcr terminated: exit status 1 ns_log000 ns_1@127.0.0.1 23:34:29 - Wed Nov 4, 2015
Service ‘ns_server’ exited with status 137. Restarting. Messages: working as port ns_log000 ns_1@127.0.0.1 23:34:29 - Wed Nov 4, 2015
Service ‘memcached’ exited with status 137. Restarting. Messages: 2015-11-04T22:56:41.120171Z WARNING 53: Slow STAT operation on connection (127.0.0.1:54121 => 127.0.0.1:11209): 10812 ms
2015-11-04T23:17:01.017534Z WARNING 46: Slow STAT operation on connection (127.0.0.1:46253 => 127.0.0.1:11209): 1702 ms
2015-11-04T23:20:49.409502Z WARNING 53: Slow STAT operation on connection (127.0.0.1:51925 => 127.0.0.1:11209): 15320 ms
2015-11-04T23:21:53.118576Z WARNING 49 Closing connection [127.0.0.1:44584 - 127.0.0.1:11209] due to read error: Connection reset by peer
2015-11-04T23:21:53.124936Z WARNING 46 Closing connection [127.0.0.1:53585 - 127.0.0.1:11209] due to read error: Connection reset by peer ns_log000 ns_1@127.0.0.1 23:34:28 - Wed Nov 4, 2015
Hi, I would probably recommend posting a ‘collect info’ which can be done from the Web UI and then filing an issue. Since the logs above isn’t enough to identify the issue.
Firstly thanks for filing that issue it is tracked here MB-16766. Our engineer looked at the issue and as you can see from his comments we are working on fixing it in our next maintenance release. However in your particular case the reason you’re seeing this issue is due to undersized AWS instance you’re running tests on.
Hope that helps.
Thanks very much for the link to the tracked issue with comments from your engineer and thank for highlighting the fact that the bug is being triggered by an undersized AWS instance.
We’ve been experimenting this issue in exact same version of couchbase constantly and it’s frustrating because it’s a critical issue and there’s no patch out yet. Out environment is Ubuntu 14.04 LTS on vSphere. We’ve been investigating deeper and we’ve found that the errors shown in the web admin log tab are not the cause of the issue.
In the logs of the system (syslog) we’ve found a “segmentation fault” of the process “beam.smp” that is essentially the whole couchbase main process. At the time that it happens is when is shown the error mentioned by @Alfie.
The load of the 6 VMs is not heavy at all.
See syslog line:
Mar 16 12:05:01 cb01 kernel: [12682725.375185] beam.smp[57252]: segfault at 3fe18ee00019 ip 00007f33662b3f26 sp 00007f33979bc290 error 4 in libv8.so[7f3365f0d000+6da000]
The apport crash (except from the bindump) on Ubuntu:
We were able to get around the issue while it’s still outstanding by moving to a larger sized AWS instance with more memory. I hope that helps you until the bug is fixed.
I am using Version: 4.1.0-5005 Enterprise Edition (build-5005) on Ubuntu 12.04.5 LTS (GNU/Linux 3.2.0-90-generic x86_64). and I have the same issue.
My Couchbase Server will auto restart frequently. from the log of admin console says:
Service 'goxdcr' exited with status 1. Restarting. Messages: MetadataService 2016-05-05T11:54:47.790+08:00 [ERROR] metakv.ListAllChildren failed. path=/remoteCluster/, err=Get http://127.0.0.1:8091/_metakv/remoteCluster/: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: connection refused, num_of_retry=3
MetadataService 2016-05-05T11:54:47.790+08:00 [ERROR] metakv.ListAllChildren failed. path=/remoteCluster/, err=Get http://127.0.0.1:8091/_metakv/remoteCluster/: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: connection refused, num_of_retry=4
RemoteClusterService 2016-05-05T11:54:47.791+08:00 [ERROR] Failed to get all entries, err=metakv failed for max number of retries = 5
Error starting remote cluster service. err=metakv failed for max number of retries = 5
[goport] 2016/05/05 11:54:47 /opt/couchbase/bin/goxdcr terminated: exit status 1