Couchbase 3.0 Node goes down for evry weekend


#1

Couchbase 3.0 node going down at every weekend.
Version: 3.0.0 Enterprise Edition (build-1209)
Cluster State ID: 03B-020-217

Please see below logs.
Request you to pls let me know , to avoid this Failove.

Event     Module Code     Server Node     Time
Remote cluster reference "Virginia_to_OregonS" updated. New name is "VirginiaM_to_OregonS".     menelaus_web_remote_clusters000     ns_1ec2-####104.compute-1.amazonaws.com     12:46:38 - Mon Nov 17, 2014
Client-side error-report for user undefined on node 'ns_1@ec2-####108 -.compute-1.amazonaws.com':
User-Agent:Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0
Got unhandled error:
Script error.
At:
http://ph.couchbase.net/v2?callback=jQuery162012552191850461902_1416204362614&launchID=8eba0b18a4e965daf1c3a0baecec994c-1416208180553-3638&version=3.0.0-1209-rel-enterprise&_=1416208180556:0:0
Backtrace:
<generated>
generateStacktrace@http://ec2-####108 -.compute-1.amazonaws.com:8091/js/bugsnag.js:411:7
bugsnag@http://ec2-####108 -.compute-1.amazonaws.com:8091/js/bugsnag.js:555:13

    menelaus_web102     ns_1@ec2-####108 -.compute-1.amazonaws.com     12:45:56 - Mon Nov 17, 2014
Replication from bucket "apro" to bucket "apro" on cluster "Virginia_to_OregonS" created.     menelaus_web_xdc_replications000     ns_1@ec2-####108 -.compute-1.amazonaws.com     12:38:49 - Mon Nov 17, 2014
Replication from bucket "apro" to bucket "apro" on cluster "Virginia_to_OregonS" removed.     xdc_rdoc_replication_srv000     ns_1@ec2-####108 -.compute-1.amazonaws.com     12:38:40 - Mon Nov 17, 2014
Rebalance completed successfully.
    ns_orchestrator001     ns_1@ec2-####107.compute-1.amazonaws.com     11:53:17 - Mon Nov 17, 2014
Bucket "ifa" rebalance does not seem to be swap rebalance     ns_vbucket_mover000     ns_1@ec2-####107.compute-1.amazonaws.com     11:53:04 - Mon Nov 17, 2014
Started rebalancing bucket ifa     ns_rebalancer000     ns_1@ec2-####107.compute-1.amazonaws.com     11:53:02 - Mon Nov 17, 2014
Could not automatically fail over node ('ns_1@ec2-####108 -.compute-1.amazonaws.com'). Rebalance is running.     auto_failover001     ns_1@ec2-####107.compute-1.amazonaws.com     11:49:58 - Mon Nov 17, 2014
Bucket "apro" rebalance does not seem to be swap rebalance     ns_vbucket_mover000     ns_1@ec2-####107.compute-1.amazonaws.com     11:48:02 - Mon Nov 17, 2014
Started rebalancing bucket apro     ns_rebalancer000     ns_1@ec2-####107.compute-1.amazonaws.com     11:47:59 - Mon Nov 17, 2014
Bucket "apro" loaded on node 'ns_1@ec2-####108 -.compute-1.amazonaws.com' in 366 seconds.     ns_memcached000     ns_1@ec2-####108 -.compute-1.amazonaws.com     11:47:58 - Mon Nov 17, 2014
Bucket "ifa" loaded on node 'ns_1@ec2-####108 -.compute-1.amazonaws.com' in 96 seconds.     ns_memcached000     ns_1@ec2-####108 -.compute-1.amazonaws.com     11:43:29 - Mon Nov 17, 2014
Starting rebalance, KeepNodes = ['ns_1ec2-####104.compute-1.amazonaws.com',
'ns_1@ec2-####107.compute-1.amazonaws.com',
'ns_1@ec2-####108 -.compute-1.amazonaws.com'], EjectNodes = [], Failed over and being ejected nodes = [], Delta recovery nodes = ['ns_1@ec2-####108 -.compute-1.amazonaws.com'], Delta recovery buckets = all     ns_orchestrator004     ns_1@ec2-####107.compute-1.amazonaws.com     11:41:52 - Mon Nov 17, 2014
Control connection to memcached on 'ns_1@ec2-####108 -.compute-1.amazonaws.com' disconnected: {badmatch,
{error,
closed}}     ns_memcached000     ns_1@ec2-####108 -.compute-1.amazonaws.com     21:19:54 - Sun Nov 16, 2014
Node ('ns_1@ec2-####108 -.compute-1.amazonaws.com') was automatically failovered.
[stale,
{last_heard,{1416,152978,82869}},
{stale_slow_status,{1416,152863,60088}},
{now,{1416,152968,80503}},
{active_buckets,["apro","ifa"]},
{ready_buckets,["ifa"]},
{status_latency,5743},
{outgoing_replications_safeness_level,[{"apro",green},{"ifa",green}]},
{incoming_replications_conf_hashes,
[{"apro",
[{'ns_1ec2-####104.compute-1.amazonaws.com',126796989},
{'ns_1@ec2-####107.compute-1.amazonaws.com',41498822}]},
{"ifa",
[{'ns_1ec2-####104.compute-1.amazonaws.com',126796989},
{'ns_1@ec2-####107.compute-1.amazonaws.com',41498822}]}]},
{local_tasks,
[[{type,xdcr},
{id,<<"949dcce68db4b6d1add4c033ec4e32a9/apro/apro">>},
{errors,
[<<"2014-11-16 19:35:03 [Vb Rep] Error replicating vbucket 201. Please see logs for details.">>]},
{changes_left,220},
{docs_checked,51951817},
{docs_written,51951817},
{active_vbreps,4},
{max_vbreps,4},
{waiting_vbreps,210},
{time_working,1040792.401734},
{time_committing,0.0},
{time_working_rate,0.9101340661254117},
{num_checkpoints,53490},
{num_failedckpts,1},
{wakeups_rate,11.007892659036528},
{worker_batches_rate,20.514709046386255},
{rate_replication,22.015785318073057},
{bandwidth_usage,880.6314127229223},
{rate_doc_checks,22.015785318073057},
{rate_doc_opt_repd,22.015785318073057},
{meta_latency_aggr,0.0},
{meta_latency_wt,0.0},
{docs_latency_aggr,1271.0828664152195},
{docs_latency_wt,20.514709046386255}],
[{type,xdcr},
{id,<<"fc72b1b0e571e9c57671d6621cac6058/apro/apro">>},
{errors,[]},
{changes_left,278},
{docs_checked,51217335},
{docs_written,51217335},
{active_vbreps,4},
{max_vbreps,4},
{waiting_vbreps,269},
{time_working,1124595.930738},
{time_committing,0.0},
{time_working_rate,1.019751359238166},
{num_checkpoints,54571},
{num_failedckpts,3},
{wakeups_rate,6.50472893793788},
{worker_batches_rate,16.51200422707308},
{rate_replication,23.01673316501096},
{bandwidth_usage,936.6809670630547},
{rate_doc_checks,23.01673316501096},
{rate_doc_opt_repd,23.01673316501096},
{meta_latency_aggr,0.0},
{meta_latency_wt,0.0},
{docs_latency_aggr,1500.9621995190503},
{docs_latency_wt,16.51200422707308}],
[{type,xdcr},
{id,<<"16b1afb33dbcbde3d075e2ff634d9cc0/apro/apro">>},
{errors,
[<<"2014-11-16 19:21:55 [Vb Rep] Error replicating vbucket 258. Please see logs for details.">>,
<<"2014-11-16 19:22:41 [Vb Rep] Error replicating vbucket 219. Please see logs for details.">>,
<<"2014-11-16 19:23:04 [Vb Rep] Error replicating vbucket 315. Please see logs for details.">>,
<<"2014-11-16 20:06:40 [Vb Rep] Error replicating vbucket 643. Please see logs for details.">>,
<<"2014-11-16 20:38:20 [Vb Rep] Error replicating vbucket 651. Please see logs for details.">>]},
{changes_left,0},
{docs_checked,56060297},
{docs_written,56060297},
{active_vbreps,0},
{max_vbreps,4},
{waiting_vbreps,0},
{time_working,140073.119377},
{time_committing,0.0},
{time_working_rate,0.04649055712180432},
{num_checkpoints,103504},
{num_failedckpts,237},
{wakeups_rate,21.524796565643623},
{worker_batches_rate,22.52594989427821},
{rate_replication,22.52594989427821},
{bandwidth_usage,913.0518357147434},
{rate_doc_checks,22.52594989427821},
{rate_doc_opt_repd,22.52594989427821},
{meta_latency_aggr,0.0},
{meta_latency_wt,0.0},
{docs_latency_aggr,13.732319632216313},
{docs_latency_wt,22.52594989427821}],
[{type,xdcr},
{id,<<"b734095ad63ea9832f9da1b1ef3449ac/apro/apro">>},
{errors,
[<<"2014-11-16 19:36:22 [Vb Rep] Error replicating vbucket 260. Please see logs for details.">>,
<<"2014-11-16 19:36:38 [Vb Rep] Error replicating vbucket 299. Please see logs for details.">>,
<<"2014-11-16 19:36:43 [Vb Rep] Error replicating vbucket 205. Please see logs for details.">>,
<<"2014-11-16 19:36:48 [Vb Rep] Error replicating vbucket 227. Please see logs for details.">>,
<<"2014-11-16 20:26:19 [Vb Rep] Error replicating vbucket 175. Please see logs for details.">>,
<<"2014-11-16 20:26:25 [Vb Rep] Error replicating vbucket 221. Please see logs for details.">>,
<<"2014-11-16 21:16:40 [Vb Rep] Error replicating vbucket 293. Please see logs for details.">>,
<<"2014-11-16 21:16:40 [Vb Rep] Error replicating vbucket 251. Please see logs for details.">>,
<<"2014-11-16 21:17:06 [Vb Rep] Error replicating vbucket 270. Please see logs for details.">>]},
{changes_left,270},
{docs_checked,50418639},
{docs_written,50418639},
{active_vbreps,4},
{max_vbreps,4},
{waiting_vbreps,261},
{time_working,1860159.788732},
{time_committing,0.0},
{time_working_rate,1.008940755729142},
{num_checkpoints,103426},
{num_failedckpts,87},
{wakeups_rate,6.50782891818858},
{worker_batches_rate,16.01927118323343},
{rate_replication,23.027702325898055},
{bandwidth_usage,933.1225464233472},
{rate_doc_checks,23.027702325898055},
{rate_doc_opt_repd,23.027702325898055},
{meta_latency_aggr,0.0},
{meta_latency_wt,0.0},
{docs_latency_aggr,1367.9901922012182},
{docs_latency_wt,16.01927118323343}],
[{type,xdcr},
{id,<<"e213600feb7ec1dfa0537173ad7f2e02/apro/apro">>},
{errors,
[<<"2014-11-16 20:16:39 [Vb Rep] Error replicating vbucket 647. Please see logs for details.">>,
<<"2014-11-16 20:17:31 [Vb Rep] Error replicating vbucket 619. Please see logs for details.">>]},
{changes_left,854},
{docs_checked,33371659},
{docs_written,33371659},
{active_vbreps,4},
{max_vbreps,4},
{waiting_vbreps,318},
{time_working,2421539.8537169998},
{time_committing,0.0},
{time_working_rate,1.7382361098734072},
{num_checkpoints,102421},
{num_failedckpts,85},
{wakeups_rate,3.0038659755104824},
{worker_batches_rate,7.009020609524459},
{rate_replication,30.539304084356573},
{bandwidth_usage,1261.6237097144026},
{rate_doc_checks,30.539304084356573},
{rate_doc_opt_repd,30.539304084356573},
{meta_latency_aggr,0.0},
{meta_latency_wt,0.0},
{docs_latency_aggr,1997.2249284829577},
{docs_latency_wt,7.009020609524459}]]},
{memory,
[{total,752400928},
{processes,375623512},
{processes_used,371957960},
{system,376777416},
{atom,594537},
{atom_used,591741},
{binary,94783616},
{code,15355960},
{ets,175831736}]},
{system_memory_data,
[{system_total_memory,64552329216},
{free_swap,0},
{total_swap,0},
{cached_memory,27011342336},
{buffered_memory,4885585920},
{free_memory,12694065152},
{total_memory,64552329216}]},
{node_storage_conf,
[{db_path,"/data/couchbase"},{index_path,"/data/couchbase"}]},
{statistics,
[{wall_clock,{552959103,4997}},
{context_switches,{8592101014,0}},
{garbage_collection,{2034857586,5985868018204,0}},
{io,{{input,270347194989},{output,799175854069}}},
{reductions,{833510054494,7038093}},
{run_queue,0},
{runtime,{553128340,5090}},
{run_queues,{0,0,0,0,0,0,0,0}}]},
{system_stats,
[{cpu_utilization_rate,2.5316455696202533},
{swap_total,0},
{swap_used,0},
{mem_total,64552329216},
{mem_free,44590993408}]},
{interesting_stats,
[{cmd_get,0.0},
{couch_docs_actual_disk_size,21729991305},
{couch_docs_data_size,11673379153},
{couch_views_actual_disk_size,0},
{couch_views_data_size,0},
{curr_items,30268090},
{curr_items_tot,60625521},
{ep_bg_fetched,0.0},
{get_hits,0.0},
{mem_used,11032659776},
{ops,116.0},
{vb_replica_curr_items,30357431}]},
{per_bucket_interesting_stats,
[{"ifa",
[{cmd_get,0.0},
{couch_docs_actual_disk_size,611617800},
{couch_docs_data_size,349385716},
{couch_views_actual_disk_size,0},
{couch_views_data_size,0},
{curr_items,1020349},
{curr_items_tot,2039753},
{ep_bg_fetched,0.0},
{get_hits,0.0},
{mem_used,307268040},
{ops,0.0},
{vb_replica_curr_items,1019404}]},
{"apro",
[{cmd_get,0.0},
{couch_docs_actual_disk_size,21118373505},
{couch_docs_data_size,11323993437},
{couch_views_actual_disk_size,0},
{couch_views_data_size,0},
{curr_items,29247741},
{curr_items_tot,58585768},
{ep_bg_fetched,0.0},
{get_hits,0.0},
{mem_used,10725391736},
{ops,116.0},
{vb_replica_curr_items,29338027}]}]},
{processes_stats,
[{<<"proc/(main)beam.smp/cpu_utilization">>,0},
{<<"proc/(main)beam.smp/major_faults">>,0},
{<<"proc/(main)beam.smp/major_faults_raw">>,0},
{<<"proc/(main)beam.smp/mem_resident">>,943411200},
{<<"proc/(main)beam.smp/mem_share">>,6901760},
{<<"proc/(main)beam.smp/mem_size">>,2951794688},
{<<"proc/(main)beam.smp/minor_faults">>,0},
{<<"proc/(main)beam.smp/minor_faults_raw">>,456714435},
{<<"proc/(main)beam.smp/page_faults">>,0},
{<<"proc/(main)beam.smp/page_faults_raw">>,456714435},
{<<"proc/beam.smp/cpu_utilization">>,0},
{<<"proc/beam.smp/major_faults">>,0},
{<<"proc/beam.smp/major_faults_raw">>,0},
{<<"proc/beam.smp/mem_resident">>,108077056},
{<<"proc/beam.smp/mem_share">>,2973696},
{<<"proc/beam.smp/mem_size">>,1113272320},
{<<"proc/beam.smp/minor_faults">>,0},
{<<"proc/beam.smp/minor_faults_raw">>,6583},
{<<"proc/beam.smp/page_faults">>,0},
{<<"proc/beam.smp/page_faults_raw">>,6583},
{<<"proc/memcached/cpu_utilization">>,0},
{<<"proc/memcached/major_faults">>,0},
{<<"proc/memcached/major_faults_raw">>,0},
{<<"proc/memcached/mem_resident">>,17016668160},
{<<"proc/memcached/mem_share">>,6885376},
{<<"proc/memcached/mem_size">>,17812746240},
{<<"proc/memcached/minor_faults">>,0},
{<<"proc/memcached/minor_faults_raw">>,4385001},
{<<"proc/memcached/page_faults">>,0},
{<<"proc/memcached/page_faults_raw">>,4385001}]},
{cluster_compatibility_version,196608},
{version,
[{lhttpc,"1.3.0"},
{os_mon,"2.2.14"},
{public_key,"0.21"},
{asn1,"2.0.4"},
{couch,"2.1.1r-432-gc2af28d"},
{kernel,"2.16.4"},
{syntax_tools,"1.6.13"},
{xmerl,"1.3.6"},
{ale,"3.0.0-1209-rel-enterprise"},
{couch_set_view,"2.1.1r-432-gc2af28d"},
{compiler,"4.9.4"},
{inets,"5.9.8"},
{mapreduce,"1.0.0"},
{couch_index_merger,"2.1.1r-432-gc2af28d"},
{ns_server,"3.0.0-1209-rel-enterprise"},
{oauth,"7d85d3ef"},
{crypto,"3.2"},
{ssl,"5.3.3"},
{sasl,"2.3.4"},
{couch_view_parser,"1.0.0"},
{mochiweb,"2.4.2"},
{stdlib,"1.19.4"}]},
{supported_compat_version,[3,0]},
{advertised_version,[3,0,0]},
{system_arch,"x86_64-unknown-linux-gnu"},
{wall_clock,552959},
{memory_data,{64552329216,51966836736,{<13661.389.0>,147853368}}},
{disk_data,
[{"/",10309828,38},
{"/dev/shm",31519692,0},
{"/mnt",154817516,1},
{"/data",1056894132,3}]},
{meminfo,
<<"MemTotal: 63039384 kB\nMemFree: 12396548 kB\nBuffers: 4771080 kB\nCached: 26378264 kB\nSwapCached: 0 kB\nActive: 31481704 kB\nInactive: 17446048 kB\nActive(anon): 17750620 kB\nInactive(anon): 2732 kB\nActive(file): 13731084 kB\nInactive(file): 17443316 kB\nUnevictable: 0 kB\nMlocked: 0 kB\nSwapTotal: 0 kB\nSwapFree: 0 kB\nDirty: 13312 kB\nWriteback: 0 kB\nAnonPages: 17753376 kB\nMapped: 14516 kB\nShmem: 148 kB\nSlab: 1297976 kB\nSReclaimable: 1219296 kB\nSUnreclaim: 78680 kB\nKernelStack: 2464 kB\nPageTables: 39308 kB\nNFS_Unstable: 0 kB\nBounce: 0 kB\nWritebackTmp: 0 kB\nCommitLimit: 31519692 kB\nCommitted_AS: 19222984 kB\nVmallocTotal: 34359738367 kB\nVmallocUsed: 114220 kB\nVmallocChunk: 34359618888 kB\nHardwareCorrupted: 0 kB\nAnonHugePages: 17432576 kB\nHugePages_Total: 0\nHugePages_Free: 0\nHugePages_Rsvd: 0\nHugePages_Surp: 0\nHugepagesize: 2048 kB\nDirectMap4k: 6144 kB\nDirectMap2M: 63993856 kB\n">>}]     auto_failover001     ns_1@ec2-####107.compute-1.amazonaws.com     21:19:53 - Sun Nov 16, 2014
Failed over 'ns_1@ec2-####108 -.compute-1.amazonaws.com': ok     ns_rebalancer000     ns_1@ec2-####107.compute-1.amazonaws.com     21:19:53 - Sun Nov 16, 2014
Skipped vbucket activations and replication topology changes because not all remaining node were found to have healthy bucket "ifa": ['ns_1@ec2-####107.compute-1.amazonaws.com']     ns_rebalancer000     ns_1@ec2-####107.compute-1.amazonaws.com     21:19:53 - Sun Nov 16, 2014
Shutting down bucket "ifa" on 'ns_1@ec2-####108 -.compute-1.amazonaws.com' for deletion     ns_memcached000     ns_1@ec2-####108 -.compute-1.amazonaws.com     21:19:49 - Sun Nov 16, 2014
Starting failing over 'ns_1@ec2-####108 -.compute-1.amazonaws.com'     ns_rebalancer000     ns_1@ec2-####107.compute-1.amazonaws.com     21:19:48 - Sun Nov 16, 2014
Bucket "apro" loaded on node 'ns_1@ec2-####108 -.compute-1.amazonaws.com' in 0 seconds.     ns_memcached000     ns_1@ec2-####108 -.compute-1.amazonaws.com     21:19:44 - Sun Nov 16, 2014
Control connection to memcached on 'ns_1@ec2-####108 -.compute-1.amazonaws.com' disconnected: {{badmatch,
{error,
timeout}},
[{mc_client_binary,
cmd_vocal_recv,
5,
[{file,
"src/mc_client_binary.erl"},
{line,
151}]},
{mc_client_binary,
select_bucket,
2,
[{file,
"src/mc_client_binary.erl"},
{line,
346}]},
{ns_memcached,
ensure_bucket,
2,
[{file,
"src/ns_memcached.erl"},
{line,
1269}]},
{ns_memcached,
handle_info,
2,
[{file,
"src/ns_memcached.erl"},
{line,
744}]},
{gen_server,
handle_msg,
5,
[{file,
"gen_server.erl"},
{line,
604}]},
{ns_memcached,
init,
1,
[{file,
"src/ns_memcached.erl"},
{line,
171}]},
{gen_server,
init_it,
6,
[{file,
"gen_server.erl"},
{line,
304}]},
{proc_lib,
init_p_do_apply,
3,
[{file,
"proc_lib.erl"},
{line,
239}]}]}     ns_memcached000     ns_1@ec2-####108 -.compute-1.amazonaws.com     21:19:44 - Sun Nov 16, 2014

#2

Can you please do a cbcollect_info and grab all the details needed to debug the issue? That would greatly help us to identify what’s going on. From these short logs it looks like auto failover happened, but the root cause why the node went down is not immediately seeable for me. Do you have more info what’s going on on this node? Especially “every weekend” sounds like a cron job or any other kind of job that shuts down/kills the node in question


#3

Ashwini - Thanks for reporting the issue MB-12696. Please run the cbcollect_info (http://docs.couchbase.com/admin/admin/CLI/cbcollect_info_tool.html) to gather the logs and attach it to ticket so we can investigate the issue. Thanks