Making sense of a failover alert?

p77gin · July 29, 2013, 5:31am

Hi,

All of a sudden we are seeing a lot of node failed over alerts getting generated from our cluster. However, I am not sure what the problem is. What in the following alert message gives an indication of the problem or the cause of the node failing over. Any insight would be useful:

Node (‘ns_1@10.63.49.44’) was automatically failovered.
[{last_heard,{1375,54648,503637}},
{outgoing_replications_safeness_level,
[{“bucket1”,green},{“bucket2”,green},{“bucket3”,green}]},
{incoming_replications_conf_hashes,
[{“bucket1”,
[{‘ns_1@10.63.49.250’,15072779},
{‘ns_1@10.63.49.254’,43514702},
{‘ns_1@10.63.52.163’,20025066},
{‘ns_1@10.63.52.248’,102664139},
{‘ns_1@10.63.52.31’,14267695},
{‘ns_1@10.63.55.240’,102513850},
{‘ns_1@10.63.55.243’,45156294}]},
{“bucket2”,
[{‘ns_1@10.63.49.250’,21779653},
{‘ns_1@10.63.49.254’,5083309},
{‘ns_1@10.63.52.163’,52116347},
{‘ns_1@10.63.52.248’,14958607},
{‘ns_1@10.63.52.31’,118257563},
{‘ns_1@10.63.55.240’,133609835},
{‘ns_1@10.63.55.243’,116082712}]},
{“bucket3”,
[{‘ns_1@10.63.49.250’,21779653},
{‘ns_1@10.63.49.254’,5083309},
{‘ns_1@10.63.52.163’,52116347},
{‘ns_1@10.63.52.248’,14958607},
{‘ns_1@10.63.52.31’,118257563},
{‘ns_1@10.63.55.240’,133609835},
{‘ns_1@10.63.55.243’,116082712}]}]},
{active_buckets,
[“bucket1”,“default”,“bucket2”,“bucket4”,“bucket5”,“bucket3”]},
{ready_buckets,[“bucket1”,“default”,“bucket2”,“bucket5”,“bucket3”]},
{local_tasks,[]},
{memory,
[{total,1343428856},
{processes,980505240},
{processes_used,978908392},
{system,362923616},
{atom,1407441},
{atom_used,1382419},
{binary,31379232},
{code,14124508},
{ets,296201944}]},
{system_memory_data,
[{system_total_memory,16726757376},
{free_swap,3682918400},
{total_swap,4194295808},
{cached_memory,4681129984},
{buffered_memory,457162752},
{free_memory,6667489280},
{total_memory,16726757376}]},
{node_storage_conf,
[{db_path,"/app/couchbase/data"},{index_path,"/app/couchbase/data"}]},
{statistics,
[{wall_clock,{3345314361,5205}},
{context_switches,{32058508120,0}},
{garbage_collection,{2173250971,14990191674832,0}},
{io,{{input,903639115866},{output,1650455515445}}},
{reductions,{6321524540155,6628061}},
{run_queue,0},
{runtime,{520645560,480}}]},
{system_stats,
[{cpu_utilization_rate,31.75},
{swap_total,4194295808},
{swap_used,519237632}]},
{interesting_stats,
[{couch_docs_actual_disk_size,2669386425},
{couch_docs_data_size,967832988},
{couch_views_actual_disk_size,0},
{couch_views_data_size,0},
{curr_items,368376},
{curr_items_tot,1104688},
{mem_used,1130789848},
{vb_replica_curr_items,736312}]},
{cluster_compatibility_version,131072},
{version,
[{public_key,“0.13”},
{lhttpc,“1.3.0”},
{ale,“8cffe61”},
{os_mon,“2.2.7”},
{couch_set_view,“1.2.0a-8352437-git”},
{mnesia,“4.5”},
{inets,“5.7.1”},
{couch,“1.2.0a-8352437-git”},
{mapreduce,“1.0.0”},
{couch_index_merger,“1.2.0a-8352437-git”},
{kernel,“2.14.5”},
{crypto,“2.0.4”},
{ssl,“4.1.6”},
{sasl,“2.1.10”},
{couch_view_parser,“1.0.0”},
{ns_server,“2.0.1-170-rel-community”},
{mochiweb,“1.4.1”},
{oauth,“7d85d3ef”},
{stdlib,“1.17.5”}]},
{supported_compat_version,[2,0]},
{system_arch,“x86_64-unknown-linux-gnu”},
{wall_clock,3345314},
{memory_data,{16726757376,10096570368,{<18445.1379.0>,34498960}}},
{disk_data,
[{"/",24190092,15},
{"/dev/shm",8167360,0},
{"/app",40316280,14},
{"/boot",198337,26},
{"/log",63897468,1},
{"/tmp",2015824,5},
{"/var",20154236,2}]},
{meminfo,
<<“MemTotal: 16334724 kB\nMemFree: 6537708 kB\nBuffers: 446448 kB\nCached: 4570544 kB\nSwapCached: 130760 kB\nActive: 4693900 kB\nInactive: 2587256 kB\nActive(anon): 1914016 kB\nInactive(anon): 390216 kB\nActive(file): 2779884 kB\nInactive(file): 2197040 kB\nUnevictable: 1999588 kB\nMlocked: 0 kB\nSwapTotal: 4095992 kB\nSwapFree: 3589088 kB\nDirty: 3216 kB\nWriteback: 0 kB\nAnonPages: 4156136 kB\nMapped: 49308 kB\nShmem: 0 kB\nSlab: 306040 kB\nSReclaimable: 244240 kB\nSUnreclaim: 61800 kB\nKernelStack: 2328 kB\nPageTables: 14772 kB\nNFS_Unstable: 0 kB\nBounce: 0 kB\nWritebackTmp: 0 kB\nCommitLimit: 12263352 kB\nCommitted_AS: 5216376 kB\nVmallocTotal: 34359738367 kB\nVmallocUsed: 303640 kB\nVmallocChunk: 34359424464 kB\nHardwareCorrupted: 0 kB
\nAnonHugePages: 2088960 kB\nHugePages_Total: 0\nHugePages_Free: 0\nHugePages_Rsvd: 0\nHugePages_Surp: 0\nHugepagesize: 2048 kB\nDirectMap4k: 10240 kB\nDirectMap2M: 16766976 kB\n”>>}]

regards,
-Piyush

tgrall · July 30, 2013, 3:06pm

Hello,

It is quite hard to give you an answer with only these log entries.

To help us to debug this issue the best will be to run the cb_collect_info command on each node.

I invite you to follow the steps documented here:

http://www.couchbase.com/wiki/display/couchbase/Working+with+the+Couchbase+Technical+Support+Team

and let me know when you have uploaded the log diles.

Regards
Tug
@tgrall

p77gin · July 31, 2013, 5:53am

Tugdual,

I tried collecting stats for the nodes. This caused the node to stop responding and the cluster failed it over. We ran into this problem on our production cluster. Nodes going offline like this is not acceptable. Any suggestions?

regards,
-Piyush

p77gin · July 31, 2013, 8:15am

Please ignore my previous question. I figured out a ways. Increased the fail-over time and got the stats. will be posting them over soon as described in the other link you sent.

regards,
-Piyush

p77gin · July 31, 2013, 9:48am

done. Files uploaded to the amazon URL given in the doc you had mentioned under walmart folder.

filename: cb_cluster_cbcollectinfo_files.zip

Also, since we are on 2.0.1, the cbhealthchecker couldn’t be executed.

Eagerly awaiting on any findings that you may have.

thanks,
-Piyush

tgrall · October 2, 2013, 4:41pm

Piyush,

Sorry I missed your previous comment (issue with my spam filter)…
do you still have the issue?

Regards
t