AZ-900: High Availability

High Availability

Now that we have an understanding of elasticity and scaling from the AZ-900 Series Part 2: Scalability and Elasticity post, let’s talk about another benefit which cloud computing provides – high availability

As part of the AZ-900: Azure Fundamentals exam, you are expected to understand the term high availability (HA). The name itself implies what HA is; how we make workloads highly available. In order to understand that, we first need to take a look at the various reasons that outages happen and what mechanisms are in place to help reduce the impact to business applications.

Outage Examples:

 
nuclear-2123685_640.jpg
 
  1. Server Hardware Issue
    Consider that you have deployed an operating system (e.g. Windows or Linux) via a virtual machine (VM). This VM consumes CPU and memory from the underlying physical host which is racked in a datacenter. If that physical host has a catastrophic issue, this could result in the complete hard stop of that VM. Any users who were accessing the system would suddenly lose access until someone fixes the system and restart the VM.

  2. Network Outage
    All applications at some point require network connectivity. Whether that is an incoming web request over the public internet, or internal networking where you need to communicate to a database server. An outage to the network will restrict the ability for servers to communicate and therefore an impact to users accessing the application.

  3. Application Failure - Configuration Changes or Software Bug
    In this scenario, someone has made a change to the system which results in an outage to the end user or perhaps a software bug was not discovered and becomes a problem down the road. This could be a new update to a piece of software, software patches, or other settings. The end result is impact to the users accessing the system until the application failure can be rectified.

  4. External System
    In some cases, you do not control all aspects of your application. Often applications connect to third-party systems for data-sets or other servers. If that third-party system is unavailable, then this will have a knock on impact to your application.

  5. Regional Outage – Disaster
    Regional outages, or if even local datacenter outages, represent a complete disaster. While we hope the worst never happens, it’s still best to plan for it. This could be smaller catastrophes like power and cooling issues in the datacenter causing it to be unusable, or a major disaster to the area (e.g. earthquake or other natural events). In this scenario, having another datacenter available somewhere else in the world is essential. Disaster recovery will be covered in more detail under Business Continuity and Disaster Recovery (BCDR), but it is still important to think about it in the context of application availability.

Service Level Agreements – SLAs

sla.PNG

With the potential for outages, we need to then consider our goal for the application. Not all applications need to be available all the time, and in some cases the occasional outage is not going to impact the business. This is where SLAs come in to play. The definition of an SLA is as follows: “A Service Level Agreement (SLA) is an agreement with the business and application teams on the expected performance and availability of a specific service.” (Source - Microsoft)

SLAs are measured in percentage (%) availability and we often talk about as number of “9s.” For example, an application that has “5 nines availability” will be available 99.999% of the time. This still means that in a year, the application has an expectation of being unavailable for XXX Days/hours.

As you change the availability you can see in the chart below the impact to the business. This table indicates the downtime you could expect to receive based on various SLAs Service Level Agreements – SLAs

availability.PNG

Failures and Recovery

mistake-968334_640.jpg

Just because you have a SLA, does not mean everyone is going to be satisfied when an outage occurs. There are a few additional metrics that you need to consider when the outages actually occur.

Mean Time to Recovery (MTTR) – This is the average time it takes to recover when an outage occurs and the lower the better. You want your teams to be able to recover the service in the shortest possible amount of time.

Mean Time Between Failures (MTBF) – This is essentially how often failures occur. You want this metric to be as high as possible so that failures are not happening frequently.

Recovery Point Objective (RPO) – RPO refers more specifically to the data loss that we are allowed to sustain. For example, a 5-minute RPO means that in the event of a failure, 5 minutes of data could be lost.

Recovery Time Objective (RTO) – RTO refers to the excepted time to recover the system in the event of an outage. Make sure to understand the difference here when compared to MTTR. MTTR is the time it’s taking to recover from outages and something you measure over time. RTO is the business objectives for that application.

Making applications Highly Available

Now that you have an understanding of all the key terms and reasons for an outage, what are some of the mechanisms you can use to make applications more highly available?

Cloud computing helps increase availability through a number of mechanisms and varies by the services used. Here are some examples that relate to the outage scenarios mentioned earlier:

VM Availability – Azure manages the underlying hosts that run the VMs. In the event a physical host has a failure, the VM will be automatically restarted on another host server. You can also choose to deploy multiple machines in a way that the VMs live on separate hosts, or in separate datacenters.

Regional Outages – Azure provides regions across the globe. Consider deploying workloads in more than one region to ensure availability.

Configuration/Software Changes – Consider using DevOps methodologies and CI/CD pipelines to deliver higher quality software and streamline deployments with less chance of human error.

In closing
As you learn more about specific Azure services, you will find quickly that every Azure service has its own SLA based on various configurations. You can choose database servers that have a higher SLA when needed if the business is willing to pay the extra cost for the additional guaranteed uptime.

Planning to build a system around the SLA, RPO, and RTO goals is essential, and metrics like MTTR and MTBF can help you understand how well you are performing against those goals.

In the next post in this AZ-900 series, you will learn more about Fault Tolerance and Disaster Recovery.

 
 

Enroll in the AZ-900 today and start your path to becoming certified in Azure Fundamentals

Previous
Previous

Installing PowerShell and Visual Studio Code

Next
Next

Preparing for Your Azure Administrator Certification (AZ-104)