The Five Nines: How to Measure High Availability Uptime
Whether you are a seasoned professional or new to networking, you will inevitably need to pay attention to uptime and availability. These are some of the most critical metrics that customers, vendors, and competitors pay attention to. Unfortunately, these statistics are often misunderstood, and incorrect assumptions in these areas can cost quite a bit of money.
Looking at metrics in the wrong way can result in the dreaded "watermelon effect," where a service provider is meeting the identified thresholds (green on the outside) but providing a level of service lower than what the customer wants (red inside).
You might have heard of something called the Five Nines. This refers to a certain level of availability or uptime versus periods of unavailability or downtime. Let's dig into the definition of each of these metrics, how they're calculated, and what kind of uptime and availability your business might need.
What is the Difference Between Availability and Uptime?
Although uptime and availability are often used interchangeably, they refer to distinctly different concepts. Uptime is a measurement of system reliability and is typically expressed as a percentage of time the computer, server, or system has been working or ready for use. However, availability is the probability that a system will work as required when needed during a mission period. Availability is even more important when most of your team is working remotely.
Uptime is a backward-looking metric. It accurately records how reliable the system has been over a certain period, which is typically a year, although it can vary. A sysadmin can reasonably infer that uptime is an indicator of availability, but it is by no means a guarantee. This is critical when evaluating service level agreements (SLAs) with a service provider. Guaranteed uptime is simply an affirmation of past performance, but it isn't an ironclad assurance of what will happen in the future.
Operational availability is calculated according to the following algorithm:
In the equation above, the terms mean these things:
OT = Operating time per calendar year
ST = Standby time
TPM = Total preventive maintenance time per calendar year
TCM = Total corrective maintenance time per calendar year
ALDT = Administrative and logistics delay time spent waiting for parts, maintenance personnel, or transportation per calendar year
Operating time (when a system is actively being employed) and standby time (when a system is ready but not actively being accessed) equal the total amount of time the system is available for use. The sum of those two numbers is divided by OT and ST plus preventive & corrective maintenance and administrative and logistics delays. The lower TPM, TCM, and ALDT are, the closer a system will achieve 100% operational availability.
What are the Five Nines?
There can be infinite levels of availability, often described as The Nines. The most common of which is the Five Nines level, or 99.999%. Here are a few of the most frequently-encountered nine levels.
“Five nines uptime” means that, over a year, a system has been operational or in standby 99.999% of the time, or all but 5 minutes and 16 seconds within the full 365 days. This chart indicates what various measurements indicate and how much downtime has been encountered over the previous year.
It’s easy to see how critical those numbers to the right of the decimal place become. A conventional server with a 99% uptime rate is still down for nearly 88 hours each year. Because the average cost to a business is $163,674 per hour of downtime, that can add up fast.
Why are the Five Nines Important?
This metric is most important when evaluating the level of SLA you need to sign with a service provider. The higher the number of nines, the less risk you run of any downtime; however, ensuring this level of availability requires more resources (e.g., a service provider having a staff member on standby 24/7/365), and the cost of those can quickly add up. If you sign an SLA with a guaranteed availability level that’s too low, however, you’ll save cash in the short-term, but it could cost you a substantial amount in the long run.
A four-hour SLA response window is one you'll commonly encounter. However, this doesn't mean that an issue causing a website crash will be fixed in four hours; it means that the service provider must begin troubleshooting within that period. How long the problem takes to resolve is not guaranteed. You should also be aware that promised availability often refers only to equipment-driven malfunctions; if downtime results from human error, planned downtime, or problems driven by maintenance, that typically isn’t guaranteed by an SLA.
To add a final level of complication into the mix, SLAs often provide uptime and availability statistics by a particular device. If your company’s ISP experiences a storage switch failure that prevents you from accessing the storage area network (SAN), any SAN SLA guarantee isn’t impacted. From the service provider’s perspective, the SAN is still operational, even if it isn’t available.
At the end of the day, SLAs are really only great if they work. If they don’t, there are seldom real repercussions for the provider while the customer feels the pain. If customers don’t have backup or disaster recovery plans in place, any outage could have a disproportionate impact on their bottom line. Remuneration guaranteed in an SLA will rarely cover the loss an extended downtime period can entail.
How to Achieve the Five Nines
The two basic approaches to achieving the Five Nines revolve around equipment and personnel. When a system is well-designed, it will involve load balancing, which is spreading a workload over multiple machines with enough excess computing and processing power that the system can adapt for a single failed component. If you have four servers operating at 25 percent capacity and one goes down, the system will automatically redistribute the load among the other three, so they’ll run at 33 percent capacity. This results in high availability, which is more of a feature than a process. A system with good load balancing will result in high availability.
Incident response management is more of a policy-based, personnel-centric concept. While uptime and availability are primarily determined by equipment not going down, they're also impacted by how quick the response is when components inevitably fail. Having an incident response plan is critical: when a system goes down in the middle of the night, the last thing you want to do is start figuring out who is responsible for what while you're losing money every hour.
You should ask your team and service providers the following questions:
Who is responsible for deciding to failover to the backup server or system?
How will that person be notified? Do you have systems in place for phone, email, or text?
What will that person do if they need to ask others for help when responding to an incident?
How will they notify others on the team, the company, and any impacted vendors or providers?
What if the incident occurs during off-hours when IT staff are not near their desks?
Using automated tools can drastically increase the speed with which you can respond to an event. Although numerous paid tools are available on the market, you can also find several free options to cover system basics. If you have a business that requires an absolute minimum level of downtime, be sure to find the specific tool you need. If the system is relatively low-impact, however, free tools might be sufficient.
SolarWinds ipMonitor Free Edition is a downscaled version of paid powerful automated monitoring software. You can view a dashboard, review each device’s status in the system, generate reports, centrally configure the system, and access THWACK, a user community for IT professionals dealing with SolarWinds software.
Zabbix is another free uptime monitoring software. It’s open-source and designed to monitor servers, applications, the network, and cloud services. The monitoring application can be customized for numerous industries, including retail, telecommunications, IT, marketing, and education. It conducts regular network scans and is one of the best all-in-one tools.
How Can an IT Professional Measure High Availability Uptime?
Uptime reflects past performance and is a valuable indicator of future availability but isn't a guarantee. You now have the operational availability formula; the more accurately you can project maintenance requirements and logistical delays, the more precisely you can forecast downtime. Especially when the cloud is down. This gives you an excellent idea of where to start with load balancing needs to guarantee high availability uptime.
Use these metrics to determine what level of availability you should require of service providers in SLAs. Develop a comprehensive incident response plan both within your company and with external parties that impact your systems and networks to ensure you’re hitting the right targets. Putting all of this together intentionally and proactively sets your company up for success.