Azure Getting Started - Planning for HA
For those looking to create a highly available deployment, the following considerations can be key to your deployment.
Availability Sets
Availability Sets make use of two key concepts - Fault Domains, and Update Domains. At its core, Azure consists of racks upon racks of servers. Each rack can host any number of virtual machines. When creating a highly available pairing, you want to be sure that there is no single point of failure. That your workload will still be provisioned by one virtual machine if the other is under maintenance. Unfortunately, if you do not specify otherwise, there is no guarantee that your VMs will not be placed on the same rack, or the same 'Fault Domain'. In essence, a fault domain can be considered a rack within Azure. Every VM on the rack is subject to that rack's power and network connections. A rackwide failure, or a rackwide maintenance window will take down all VMs hosted on this single point of failure. When Azure refers to a fault domain, consider each fault domain a single point of failure.
An Availability Set distributes highly available workloads across multiple Fault Domains, thereby eliminating any single point of failure. Unless the entire data center is down, your workload will keep running. In essence, your workload is split between two or more racks, leveraging the redundant power supplies, network switches, etc, of each.
Grouping VMs in an availability sets also gives the Windows Azure Fabric Controller (FC) the information it needs to intelligently update the host OSs that your guest VMs are running on. Without availability sets the FC would have no idea that two machines were serving the same purpose and could reasonable take them both down for host OS updates.
An Availability Set also makes use of Update Domains. This allow you to determine how many of the workloads are down at any given time. You can set a priority order for shutting down the VMs and the number of update domains determines how many machines will be involved in the shutdown. In the image below, we see an Availability Set with 16 virtual machines, and four update domains. This means that a maximum of four VMs can be down for maintenance at a given time, allowing the other 12 to carry the load. Once the first four return to service, another group will be available for maintenance. In conjunction with Fault Domains, this allows an Availability Set to ensure that undue burden is not placed on either rack.
When considering your use case, including the number of VMs you want to create and the number of Availability Sets you will need to create, remember that as a rule, you want one Availability Set per workload. A workload can be considered any virtual machines working together towards a common single purpose. Therefore, two highly available SoftNAS VMs to perform a single function would constitute a workload.
Availability Zones
Availability Zones are a high-availability offering that protects your applications and data from datacenter failures. Availability Zones are unique physical locations within an Azure region. Each zone is made up of one or more datacenters equipped with independent power, cooling, and networking. To ensure resiliency, there’s a minimum of three separate zones in all enabled regions. The physical separation of Availability Zones within a region protects applications and data from datacenter failures. Zone-redundant services replicate your applications and data across Availability Zones to protect from single-points-of-failure. With Availability Zones, Azure offers industry best 99.99% VM uptime SLA. The full Azure SLA explains the guaranteed availability of Azure as a whole.
An Availability Zone in an Azure region is a combination of a fault domain and an update domain. For example, if you create three or more VMs across three zones in an Azure region, your VMs are effectively distributed across three fault domains and three update domains. The Azure platform recognizes this distribution across update domains to make sure that VMs in different zones are not updated at the same time.
An Availability Zone offers protection in case of a failure in any zone, but does not protect against a regional failure.