Node died on write commit failure, unable to restart/recover node. help!


#1

Hi guys, I have set up 2 couchbase (version 3) nodes and was trying to insert about 100 million documents for testing purposes, but it failed half-way through (at 30 million) with an error of “Write Commit Failure. Disk write failed for item in Bucket “myhoney” on node”.
I’m not too sure why this happened, but this caused one of the node to die (status: down) which prevents me from accessing the web-admin on that node on port 8091. I can still however connect to the web-admin on the other node just fine and went on to fail-over that node.

Im not too sure on how to recover the dead node, but what i did try is to restart the dead node but for some reason it keeps giving me “connection timed out”.

I’m still new to couchbase so does anyone know why i was getting the error “Write commit failure” in the first place? and how do i recover/restart this node without reinstalling couchbase?

Im not sure if this is useful, but this is what i found in info.log

[user:info,2015-06-10T13:07:03.702,ns_1@ec2-54-66-131-63.ap-southeast-2.compute.amazonaws.com:<0.15588.56>:menelaus_web_alerts_srv:global_alert:81]Write Commit Failure. Disk write failed for item in Bucket "myhoney" on node ec2-54-66-131-63.ap-southeast-2.compute.amazonaws.com.
[ale_logger:error,2015-06-10T13:07:05.597,ns_1@ec2-54-66-131-63.ap-southeast-2.compute.amazonaws.com:ale<0.35.0>:ale:handle_info:253]ale_reports_handler terminated with reason {'EXIT',
                                            {noproc,
                                             {gen_server,call,
                                              ['sink-disk_debug',
                                               {log,
                                                <<"[error_logger:error,2015-06-10T13:07:05.591,ns_1@ec2-54-66-131-63.ap-southeast-2.compute.amazonaws.com:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]\n=========================CRASH REPORT=========================\n  crasher:\n    initial call: ale_disk_sink:-spawn_worker/1-fun-0-/0\n    pid: <0.193.57>\n    registered_name: []\n    exception error: no match of right hand side value {error,enospc}\n      in function  ale_disk_sink:'-write_data/3-fun-0-'/2 (src/ale_disk_sink.erl, line 487)\n      in call from ale_disk_sink:time_stat/3 (src/ale_disk_sink.erl, line 527)\n      in call from ale_disk_sink:write_data/3 (src/ale_disk_sink.erl, line 485)\n      in call from ale_disk_sink:worker_loop/1 (src/ale_disk_sink.erl, line 450)\n    ancestors: ['sink-disk_debug',ale_dynamic_sup,ale_sup,<0.31.0>]\n    messages: []\n    links: [<0.183.57>,#Port<0.197089>]\n    dictionary: []\n    trap_exit: false\n    status: running\n    heap_size: 610\n    stack_size: 27\n    reductions: 596\n  neighbours:\n\n">>},
                                               infinity]}}}; restarting

and here is what i found in memcached.log.13.txt

Wed Jun 10 08:16:23.622715 UTC 3: 215 Closing connection due to read error: Connection timed out
Wed Jun 10 08:16:23.622857 UTC 3: 272 Closing connection due to read error: Connection timed out
Wed Jun 10 08:16:23.622872 UTC 3: 273 Closing connection due to read error: Connection timed out
Wed Jun 10 08:16:23.622893 UTC 3: 620 Closing connection due to read error: Connection timed out
Wed Jun 10 08:16:23.622904 UTC 3: 623 Closing connection due to read error: Connection timed out
Wed Jun 10 08:16:23.622879 UTC 3: 622 Closing connection due to read error: Connection timed out
Wed Jun 10 08:16:52.092113 UTC 3: (myhoney) Requst to vbucket 1023 deletion is in EWOULDBLOCK until the database file is removed from disk
Wed Jun 10 08:16:52.095526 UTC 3: (myhoney) Deletion of vbucket 1023 was completed.
Wed Jun 10 08:16:52.096006 UTC 3: (myhoney) Requst to vbucket 1022 deletion is in EWOULDBLOCK until the database file is removed from disk
Wed Jun 10 08:16:52.099495 UTC 3: (myhoney) Deletion of vbucket 1022 was completed.
Wed Jun 10 08:16:52.099809 UTC 3: (myhoney) Requst to vbucket 1021 deletion is in EWOULDBLOCK until the database file is removed from disk
Wed Jun 10 08:16:52.103598 UTC 3: (myhoney) Deletion of vbucket 1021 was completed.

#2

Write Commit Failure is when Couchbase Server cannot write data to the filesystem, this is normally caused by a hardware or operating system problem such as a bad disk or file permissions.

If the disk is bad then that could affect Couchbase Server from starting too. Given that you are in AWS EC2, I personally would not spend any time trying to recover the node unless there is data on it you need. I assume when you failed it over you had replicas enabled and all your data should be on the single node currently.

I would create a new node and add to the cluster and rebalance and simple terminate the old node.


#3

Right, you’re spot on. I’ve found out that I’ve ran out of disk space hence the write commit failure.
Thanks