AWS Disaster Recovery: Pilot Light, Warm Standby, Multi-site
Anyone who works in IT is familiar with Murphy's Law, which states: "Anything that can go wrong, will go wrong." Whether a hardware failure, human error, or system breach, failure is a fact of life in IT. However, downtime can be mitigated by using AWS.
AWS enables companies to seamlessly switch critical operations to the cloud, thus reducing downtime. Throughout this article, we will discuss four different disaster recovery procedures. But first, let's talk about two metrics used to determine a disaster recovery's viability: RPO and RTO.
RPO vs. RTO
The four types of disaster recovery (DR) outlined by Amazon are Backup and Recovery, Pilot Light, Warm Standby, and Multi-Site. The decision of when to apply which disaster recovery operation can be daunting. One important data point to consider is an acceptable RTO and RPO; I.E an acceptable level of business continuity and an acceptable level of data recoverability.
RTO refers to Recovery Time Objective and RPO refers to Recovery Point Objective. (For those of you vying for the AWS Solutions Architect certification, pay attention, because these will be on the exam.) An RPO refers to the optimal point of time from which you would like the data to be recoverable.
For instance, does the system require all the system's data from 10 hours ago, 10 minutes ago, or 10 seconds ago? If the business says, "In the case of a disaster, we would like to have all data recovered as recently as one hour ago." This means that your RPO is one hour. Another way to put it would be, the business fully expects to lose all data that was produced in the last hour, but not a minute before.
RTO, on the other hand, refers to how long it should take for the backup system to become fully functional. Similar to our last example, if the business said they would accept one hour of downtime between the disaster striking and the system's disaster recovery implementation, then that would be an RTO of one hour.
As you might expect, the shorter the RTO and the shorter the RPO, the more complex and expensive the backup system will be. So, before deciding on an appropriate DR approach, it is important to conduct a Business Impact Analysis. A Business Impact Analysis is simply a study on how important a particular system is to your day-to-day business, and how its downtime would affect the company.
It should answer questions such as how much money you would lose from downtime, whether or not the company's reputation would suffer, and how much capital should be invested in the system's DR operation.
Once the analysis is complete, you will be able to accurately determine which solution is best for your company. Let's take a look at the first method, Backup and Restore.
Backup and Restore
Backup and Restore is the first — and least expensive — disaster recovery method on the list. Out of all the possible DR solutions, Backup and Restore shares the most similarities with tape-backed recovery. However, the biggest difference is that instead of streaming data to a tape, the data would be streamed to an S3 — IA bucket or an S3 Glacier bucket. In terms of reliability, it is a safe bet to keep your most valuable data in S3 buckets, because they are designed for "eleven 9's" durability across multiple regions. In other words, there is virtually no way this data will be irrecoverable.
If your system goes down using this method of DR, the system administrator would need to upload the data back onto the system from the S3 buckets. What if your company is currently on tape drives, and doesn't have any data in S3? This is where Amazon Snowball comes into play.
Amazon Snowball is a data transport service offered by AWS. It is a physical hard drive that is hooked directly into a company's database. After you have uploaded your data, which may be many petabytes, it is shipped off to an AWS facility and stored in S3 buckets. This is a great way to prime your system for a Backup and Recovery DR solution.
In addition to Amazon Snowball, Backup and Restore cannot be mentioned without bringing up AWS Storage Gateway. Storage Gateway is an excellent cloud hybrid tool that allows on-premise data storage solutions access to the cloud.
For example, let's say the employees of your company commonly place information on a share drive. AWS Storage Gateway can be configured so that all the files will be replicated onto an S3 bucket. That way, in the case of a disaster, all of that data will have been seamlessly transferred to the cloud.
Backup and Restore is a great option for those who are weaning themselves off of on-premise storage and hardware. It allows users to easily procure their data in case of a hardware failure. However, AWS provides far more substantial DR methods. Let's take a look at the first one, Pilot Light.
RTO/RPO: Up to 24 hours.
Pilot Light strikes an excellent balance between affordability and reliability. One of the key differences between Backup and Restore and Pilot Light is that Pilot Light will always have core functionality running in the cloud. For instance, in Backup and Restore your data will be synced in an S3 bucket for retrieval in the event of a disaster. With Pilot Light, the data is synced with a database replica that is always on and ready to go.
Additionally, other core services would be available and on the ready, like an EC2 instance with all of the required software downloaded onto it. All of these EC2 Instances would have an Auto-Scaling Policy to ensure the instances scale out to meet your production needs.
Another great feature of Pilot Light — and the subsequent DR methods — is that it's largely automated. On the AWS console, an administrator can create a health check to verify the accessibility of a particular URL. For example, the homepage of the website you intend to backup. Then, in the event that this health check fails, the Pilot Light environment will be switched on.
One quick way that this could be done is for the health check to send an SNS (Simple Notification Service) message to a lambda function. If the SNS message describes a health check failure, the lambda will send a command to begin the Pilot Light EC2 Instance.
So, the Pilot Light DR will have a stopped EC2 Instance (or instances) that will be a replica of your system's core functionally. One caveat is that this will require more maintenance than Backup and Restore, because the administrator will have to ensure the back-up AMIs are up to date with the latest software and specifications.
RTO/RPO: 10's of minutes
Warm Standby can be considered the older brother of Pilot Light. Warm Standby will include all functionality required for the system, as opposed to just the core services. With Warm Standby, a production-ready environment is always locked and loaded — but is scaled down significantly. This configuration saves on cost, but increases RTO and RPO.
In a Pilot Light scenario, only an EC2 Instance and a DynamoDB may be running. In Warm Standby, however, everything is running — just in a much smaller capacity. This means the load balancer, gateways, databases, all subnets, and everything else are ready to go on a moment's notice.
Warm standby is an excellent choice for business critical solutions that require a rapid RTO, when the business does not want to sink a lot of money into the solution. If capital expenditure is of no object, however, a Multi-Site solution may be the better option.
A Multi-Site operation, also known as Hot Standby, is a one-for-one replication of your production environment. It is a truly fault-tolerant system. As you can imagine, this is a very, very expensive operation. Let's take a look at a couple reasons why.
One reason is that the EC2 instances are constantly scaled out, and constantly on. This means that the company is paying for all these resources that they are not actually using. In addition, all lambda functions would have to operate in real time for both environments, which would get expensive. Lastly, multi-site requires constant testing and configuration to ensure the operation is seamless in real-life situations.
All of this takes a lot of capital expenditure, but some would argue that peace of mind is priceless. In the event of a disaster scenario, all the administrator would have to do is switch the DNS and call it a day.
RTO/RPO: Real Time.
AWS offers a DR solution to fit the needs of any company, whether it is a mom and pop shop or a Fortune 100 corporation. First, determine what your business considers an acceptable RPO/RTO, and then choose a DR solution from there.
To take full advantage of the cloud, at least use Pilot Light. If your company is just getting their feet wet with AWS, then Backup and Recovery may be your best bet. Either way, you'll need to do your research to make an informed decision about disaster recovery procedures for your organization.