Understand index behavior after recovering failed node

chetan · August 14, 2018, 9:17pm

I am using couchbase 5.1 enterprise and using the default indexer setting for num_replica = 2.

Considering this, if an index node fails over and is subsequently recovered, what needs to be done to bring all index replicas on that node back online?

In my observation,

I am seeing that only some of indexes replicas came back online on failed node.
Four indexes were completely lost from all nodes.

I referred the documentation that says here

“Once an index has been created with a given number of replicas, if the number of index nodes in a cluster goes below the number of nodes required then new replicas will be created on any incoming index nodes, until the desired number of replicas exist for a given index.”

Is there anything additional I need to do in the event of a node failover or should things just come back up as they were prior to failover.

deepkaran.salooja · August 14, 2018, 10:48pm

@chetan, there is no user action required to recover indexes when a failed over node is recovered. Were the lost indexes created before upgrading to 5.1? Also, did the rebalance succeed after the recovered node was added back to the cluster?

chetan · August 15, 2018, 3:03am

Thank Deepakaran.

Indexes created after upgrade.
Rebalance successful after adding node.

Failover occured due to index maintenance running longer. I have now set the maintenance to abort if it exceeds certain time.

Yes, rebalance was successful. One of the issues I know about this cluster (a development environment) is lack of enough resources. Data seems to have grown more than capacity. I feel this is the larger problem I need to address.

One of the error messages in the log, before fail over:

Service ‘indexer’ exited with status 137. Restarting. Messages:
2018-08-13T22:00:17.367-04:00 [Info] ServiceMgr::GetTaskList returns &{[0 0 0 0 0 0 0 3] []}
2018-08-13T22:00:17.555-04:00 [Info] ServiceMgr::GetTaskList [0 0 0 0 0 0 0 3]
2018-08-13T22:00:17.522-04:00 [Info] ServiceMgr::GetCurrentTopology [0 0 0 0 0 0 0 3]
2018-08-13T22:00:17.984-04:00 [Info] Indexer::getCurrentKVTs Time Taken 7.588163965s
2018-08-13T22:00:18.066-04:00 [Info] Indexer::getCurrentKVTs Time Taken 79.902958ms
2018-08-13T22:00:18.077-04:00 [Info] Indexer::getCurrentKVTs Time Taken 10.565385ms
2018-08-13T22:00:18.141-04:00 [Info] Indexer::getCurrentKVTs Time Taken 64.503614ms
2018-08-13T22:00:18.229-04:00 [Info] ClustMgr:handleStats &{92 false}
2018-08-13T22:00:18.247-04:00 [Info] cpuCollector: cpu percent 51.11111111111111 for pid 18288
2018-08-13T22:00:18.370-04:00 [Info] Plasma: Adaptive memory quota tuning minFreePercent:10, freePercent:4.14366309844021, currentQuota=1073741824
2018-08-13T22:00:20.349-04:00 [Info] janitor: running cleanup.
[goport(/opt/couchbase/bin/indexer)] 2018/08/13 22:00:22 child process exited with status 137

deepkaran.salooja · August 15, 2018, 8:57pm

Yes, it looks like you need to allocate more RAM quota to indexer. The indexer process is being killed by OOM killer. If the process is repeatedly getting killed, the index information from that node may not get displayed on the UI.