Here is a proposal for best practice of handling unscheduled VM shutdowns on Azure, with a 2+ node Couchbase cluster.
It is a fact that Azure virtual machines do unscheduled reboots for software updates of the host OS. This could happen as often once per month. Here is the process workflow for those updates:
Azure SLAs dont apply to individual VMs. To achieve SLA, Azure recommends using their load balancing system with public endpoints and multiple vms, and using the ‘availablity sets’ feature to make sure some subset of your VMs survive a host-OS upgrade. However the load balancing seems a bad match for Couchbase because
- we want to use Azure's VPN/VLAN to have couchbase on a private network and have low latency between couchbase and our application servers on the same VLAN - these servers will be load balanced, but couchbase will not
- Couchbase smart clients (like C# client) will get surely confused if they are accessing a load balancer with 1 public IP address
- Couchbase is already load-balancing by it's design
(Windows Azure Host OS Updates) Each virtual machine hosting a Web or Worker Role receives a Stopping event, whereas VM Roles receive a standard Windows shutdown event. Worker, Web, and Virtual machine roles are allowed five minutes to respond to the stopping and shutdown event before they are forcibly stopped.The proposed idea, which I am going to test out as soon as I have a chance is
- Use the azure availability sets feature to ensure 1 or more nodes stays up during any host-OS upgrade.
- Five minutes is not enough time to remove a node and complete a rebalance, so failover is the only alternative for us.
- On each of the couchbase nodes, install a shell script in /etc/init.d that responds to OS shutdown events in within the 5 minutes allowed time frame. Call it couchbase-failover-azure-hosting
- couchbase-failover-azure-hosting will use the couchbase CLI tools to check now many non-failed over nodes are in the cluster. If there is 1 or more active nodes other than itself, it will failover the current node immediately. The there are not 1 or more active nodes other than itself, well that should never happen in this scenario.
Any feedback or suggestions welcome!