Cluster stability

Hello
I recently have issues with cluster stability. During peak hours one server restarts itself causing the whole cluster to be unavailable.
Before I copy log entries, I have a question about automatic failover (it is off right now). If the serwer is failed over, will it join the cluster back when it is up?

I have 4 nodes cluster with Couchbase 4.1.0-5005 Community Edition (build-5005). Each machine has 32GB RAM, SSD HDD, Data RAM quota is set to 25G. There are 3 buckets, one main with 145M docs and two auxillary, which are rather empty (2K docs). We have about 700-800 ops per seconds during normal usage and 3-4K ops per second if some CRON’ed operations are being run.

Yesterday one server had few restarts in a row. I have the following errors in the log section:

Service ‘memcached’ exited with status 137. Restarting. Messages: 2017-08-03T22:10:19.944176+02:00 WARNING (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/dev_analysis (prod/main) - (vb 274) stream created with start seqno 4546687 and end seqno 4546703
2017-08-03T22:10:20.011078+02:00 WARNING (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/dev_analysis (prod/main) - (vb 274) Sending disk snapshot with start seqno 4546687 and end seqno 4546703
2017-08-03T22:10:20.031143+02:00 WARNING (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/dev_analysis (prod/main) - (vb 274) Backfill complete, 0 items read from disk 11 from memory, last seqno read: 4546703
2017-08-03T22:10:20.031156+02:00 WARNING (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/dev_analysis (prod/main) - (vb 274) Backfill task (4546688 to 4546702) finished
2017-08-03T22:10:20.031243+02:00 WARNING (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/dev_analysis (prod/main) - (vb 274) Stream closing, 11 items sent from backfill phase, 0 items sent from memory phase, 4546703 was last seqno sent, reason: The stream ended due to all items being streamed

Followed by:

Control connection to memcached on ‘ns_1@cb2.savecart’ disconnected: {error, closed}

Then buckets started to get up. When two small buckets were up, there was again Control connection to memcached on ‘ns_1@cb2.savecart’ disconnected: {error, closed} (repeated 2 times)

But the third bucket managed to get up.

6 minutes later:

Service ‘memcached’ exited with status 137. Restarting. Messages: 2017-08-03T21:56:22.632935+02:00 WARNING (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/dev_analysis (prod/main) - (vb 975) Stream closing, 0 items sent from backfill phase, 1 items sent from memory phase, 4542276 was last seqno sent, reason: The stream ended due to all items being streamed
2017-08-03T21:56:22.633530+02:00 WARNING (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/dev_analysis (prod/main) - (vb 984) stream created with start seqno 4404746 and end seqno 4404747
2017-08-03T21:56:22.633621+02:00 WARNING (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/dev_analysis (prod/main) - (vb 984) Stream closing, 0 items sent from backfill phase, 1 items sent from memory phase, 4404747 was last seqno sent, reason: The stream ended due to all items being streamed
2017-08-03T21:56:22.633839+02:00 WARNING (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/dev_analysis (prod/main) - (vb 989) stream created with start seqno 4291912 and end seqno 4291913
2017-08-03T21:56:22.633945+02:00 WARNING (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/dev_analysis (prod/main) - (vb 989) Stream closing, 0 items sent from backfill phase, 1 items sent from memory phase, 4291913 was last seqno sent, reason: The stream ended due to all items being streamed

And the situation with buckets getting up repeated

After two more restarts new error appeared:

Service ‘memcached’ exited with status 137. Restarting. Messages: 2017-08-03T22:14:01.192572+02:00 WARNING 84: Slow STAT operation on connection (127.0.0.1:56477 => 127.0.0.1:11209): 4269 ms
2017-08-03T22:14:01.194012+02:00 WARNING 109: Slow STAT operation on connection (127.0.0.1:42572 => 127.0.0.1:11209): 1878 ms
2017-08-03T22:14:02.846741+02:00 WARNING 110: Slow STAT operation on connection (127.0.0.1:33995 => 127.0.0.1:11209): 1279 ms
2017-08-03T22:14:02.846741+02:00 WARNING 112: Slow STAT operation on connection (127.0.0.1:37866 => 127.0.0.1:11209): 1716 ms
2017-08-03T22:14:02.849621+02:00 WARNING 123: Slow STAT operation on connection (127.0.0.1:35774 => 127.0.0.1:11209): 1338 ms

Later two other errors apperaed at the same time:

Service ‘goxdcr’ exited with status 1. Restarting. Messages: XmemNozzle 2017-08-03T22:31:02.149+02:00 [ERROR] xmem_06189a4a72b9f0553d754f56502b5e0a/default/savecart_bkp_cbbackup.savecart:11210_0 Received recoverable error in response. Response status=TMPFAIL, err = , response=[129 162 0 0 0 0 0 134 0 0 0 0 0 0 2 202 0 0 0 0 0 0 0 0]
XmemNozzle 2017-08-03T22:31:02.149+02:00 [ERROR] xmem_06189a4a72b9f0553d754f56502b5e0a/default/savecart_bkp_cbbackup.savecart:11210_0 Received recoverable error in response. Response status=TMPFAIL, err = , response=[129 162 0 0 0 0 0 134 0 0 0 0 0 0 2 203 0 0 0 0 0 0 0 0]
XmemNozzle 2017-08-03T22:31:02.149+02:00 [ERROR] xmem_06189a4a72b9f0553d754f56502b5e0a/default/savecart_bkp_cbbackup.savecart:11210_0 Received recoverable error in response. Response status=TMPFAIL, err = , response=[129 162 0 0 0 0 0 134 0 0 0 0 0 0 2 204 0 0 0 0 0 0 0 0]
XmemNozzle 2017-08-03T22:31:02.149+02:00 [ERROR] xmem_06189a4a72b9f0553d754f56502b5e0a/default/savecart_bkp_cbbackup.savecart:11210_0 Received recoverable error in response. Response status=TMPFAIL, err = , response=[129 162 0 0 0 0 0 134 0 0 0 0 0 0 2 205 0 0 0 0 0 0 0 0]
[goport] 2017/08/03 22:31:04 /opt/couchbase/bin/goxdcr terminated: signal: killed

Service ‘memcached’ exited with status 137. Restarting. Messages: 2017-08-03T22:29:52.215455+02:00 WARNING (reco) DCP (Producer) eq_dcpq:replication:ns_1@cb2.savecart->ns_1@cb4.savecart:reco - (vb 901) stream created with start seqno 6127 and end seqno 18446744073709551615
2017-08-03T22:29:52.215503+02:00 WARNING (reco) DCP (Producer) eq_dcpq:replication:ns_1@cb2.savecart->ns_1@cb4.savecart:reco - (vb 902) stream created with start seqno 6382 and end seqno 18446744073709551615
2017-08-03T22:29:52.215655+02:00 WARNING (reco) DCP (Producer) eq_dcpq:replication:ns_1@cb2.savecart->ns_1@cb4.savecart:reco - (vb 903) stream created with start seqno 6152 and end seqno 18446744073709551615
2017-08-03T22:29:52.215820+02:00 WARNING (reco) DCP (Producer) eq_dcpq:replication:ns_1@cb2.savecart->ns_1@cb4.savecart:reco - (vb 904) stream created with start seqno 6612 and end seqno 18446744073709551615
2017-08-03T22:30:37.297558+02:00 WARNING 70: Slow STAT operation on connection (127.0.0.1:34157 => 127.0.0.1:11209): 1398 ms

I took 40 minuts for the cluser to get up. During the last attempt, it took 154 seconds for small bucket to get up and 214 second form the biggest one.
Our outsourced admins claim that we should upgrade to 4.5. Ad it would solve the problem. Would it? The cluster is up for about 200 days.