A cluster of three nodes spread across 3 different AZs and part of the same network ( VPC )
Two Autoscaling groups are used where one is for the master ( initial node ) and one asg is for worker nodes ( 2 worker nodes )
This is automated setup and runs on VMs in AWS with each machine being m4.2xlarge in size.
Kong is running as an API GW and we access UI by pointing to kong to record set in private zone in route53.
Master node runs ( index, query & data ) and worker nodes only run data services.
Initial setup is automated
A node is deployed and initialized
Workers nodes are then started and join the initial node.
However, we are facing two issues:
- When the initial node goes down, the server comes to a halt. As per the documentation here : https://developer.couchbase.com/documentation/server/4.0/architecture/high-availability-replication-architecture.html if the master goes down , then the remaining nodes are responsible for selecting the new leader and continue. This does not seem to be case at the moment.
We have a private hosted zone in route53 that we pointed to master’s private ip, tried the scenario above and once master goes, we are unable to reach couchbase.
We updated our dns entries to point to all nodes in the cluster so that if the master node goes down, we should be able to reach other nodes. But that fails as well. So is our understanding of leader election in correct or perhaps a configuration issue?
Also, should our dns point to a single node in the cluster or all the three nodes since sdk are topology aware and vbuckets and cluster map would take care of the getting the desired data from the clusters?
- Upon revival of the original master node, we tried adding the existing worker nodes. Upon adding the existing worker nodes, Couchbase gave us a warning that adding nodes would remove all data on the node and indeed it wipe the data. Because the master node is reinitialized due to being part of an autoscaling group and configuration running via Ansible pull, could it possible that this is happening? If the original server is that was used for setup goes down, can we just add another node to the cluster and rebalance to be operational.
Having inability to reattach the existing nodes without loosing data seems problematic.
Any suggestions please?
Couchbase version: 4.6 Enterprise
Couchbase sdk: 2.3.7