Q: rebalance error on Operator(k8s)


#1

hi
I use Operator deploy couchbase server 3 nodes. those nodes running normally for 16 days. but today, I can’t sync data from couchbase lite. I log in the web console, rebalance has an error when rebalancing 40%.

Rebalance exited with reason {service_rebalance_failed,cbas, {badmatch, {error, {bad_nodes,cbas,get_agent, [{‘ns_1@ydd-cbs-0002.ydd-cbs.default.svc’, {exit, {{{case_clause, {error, {unknown_error, <<“failed_to_cancel”>>}}}, [{service_agent,cancel_task,2, [{file,“src/service_agent.erl”}, {line,474}]}, {service_agent, ‘-cancel_tasks/2-fun-0-’,2, [{file,“src/service_agent.erl”}, {line,484}]}, {lists,foreach,2, [{file,“lists.erl”},{line,1323}]}, {service_agent,cleanup_service,1, [{file,“src/service_agent.erl”}, {line,502}]}, {service_agent,do_handle_connection,2, [{file,“src/service_agent.erl”}, {line,331}]}, {service_agent,handle_connection,2, [{file,“src/service_agent.erl”}, {line,308}]}, {service_agent,handle_cast,2, [{file,“src/service_agent.erl”}, {line,192}]}, {gen_server,handle_msg,5, [{file,“gen_server.erl”}, {line,604}]}]}, {gen_server,call, [{‘service_agent-cbas’, ‘ns_1@ydd-cbs-0002.ydd-cbs.default.svc’}, get_agent,infinity]}}}}]}}}}

I refer error delete the ydd-cbs-0002 pod, the k8s recreate the pod after 2 mins. I click rebalance couchbase server, the web console UI display error:

Unexpected server error, request logged.

the couchbase server logs below

[ns_server:error,2019-03-09T03:38:10.081Z,ns_1@ydd-cbs-0000.ydd-cbs.default.svc:<0.22129.7>:menelaus_web:loop:143]Server error during processing: [“web request failed”,
{path,"/controller/rebalance"},
{method,‘POST’},
{type,exit},
{what,
{timeout,
{gen_fsm,sync_send_all_state_event,
[{via,leader_registry,ns_orchestrator},
{maybe_start_rebalance,
[‘ns_1@ydd-cbs-0000.ydd-cbs.default.svc’,
‘ns_1@ydd-cbs-0001.ydd-cbs.default.svc’,
‘ns_1@ydd-cbs-0002.ydd-cbs.default.svc’,
‘ns_1@ydd-cbs-0003.ydd-cbs.default.svc’],
,all}]}}},
{trace,
[{gen_fsm,sync_send_all_state_event,2,
[{file,“gen_fsm.erl”},{line,232}]},
{menelaus_web_cluster,do_handle_rebalance,
4,
[{file,“src/menelaus_web_cluster.erl”},
{line,737}]},
{request_throttler,do_request,3,
[{file,“src/request_throttler.erl”},
{line,59}]},
{menelaus_web,loop,2,
[{file,“src/menelaus_web.erl”},
{line,121}]},
{mochiweb_http,headers,5,
[{file,
“/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/mochiweb/mochiweb_http.erl”},
{line,94}]},
{proc_lib,init_p_do_apply,3,
[{file,“proc_lib.erl”},{line,239}]}]}]

This problem often occurs.

how can I resolve this issue?

thanks!
angular


#2

Hi @angular,

Thanks for reporting this issue we need some more info to debug.

Thanks


#3

Hi @anil,

Thanks for your reply.

where are running K8S cluster on physical machine or vms or public clouds?

  • public cloud

did you follow the steps here for Deploying Sync Gateway cluster

  • yes, I follow the official step by step setup, and the sg Operating normally.

can you run cbopinfo tool to collect logs and attach them to JIRA issue K8S-906

  • sorry. when occuor error, I try to failover, but not succeed. project in developeing stage, so I delete the couchbase cluster and rebuild.