| career | management - David Brown
How to Properly Do Disaster Recovery Testing
You never know what terrible things might happen. In 2011, the Centers for Disease Control (CDC) conducted a drill to test preparedness for — get this — a zombie apocalypse. While it's doubtful that anyone at the CDC was afraid that a real zombie apocalypse might occur, the exercise highlighted the real truth: we should be ready for anything.
An IT department is integral to disaster preparedness. In fact, most of a company's business processes are in the IT infrastructure. That's why the evaluation of elements, architecture, and personnel is a major part of disaster recovery testing.
What's clear is that you can't rely wholly on the machines. How an IT professional acts and reacts before, during, and after a disaster is as important as any machine testing.
Tabletop Testing for Disaster Recovery
Short of an all-out disaster recovery drill, one of the best places to test for disaster preparedness is in the boardroom. This kind of walkthrough of potential disaster scenarios is called a tabletop test. It's a gathering of potential players should any real disasters occur. The purpose is to talk about what might happen and how each person or department would respond.
In a video explaining tabletop testing, the disaster recovery firm Databarracks compares this role-playing activity to a game of Dungeons & Dragons. For those who missed out, D&D is a game of heroic fantasy where players assume the roles of their daring characters. Its similarity to tabletop testing is in the way participants engage with each other through a series of challenging scenarios.
The players in the tabletop testing "game" may include a facilitator, department heads, vendor representatives, a note-taker, and sometimes emergency services personnel or interested observers. Because of the critical nature of IT processes, a CIO or IT manager might be the best person for the facilitator job. And it would be a good idea to bring in experts who handle sensitive databases or applications.
For any given tabletop test, you will need the right people — but not necessarily the "best" people. Most organizations have those experts and staff with a lot of institutional knowledge that can get you through any emergency. But what happens if those people are not there? Sometimes pulling in less experienced people can make your test more challenging and realistic.
To conduct your tabletop test, you will need a good setup. You will need a meeting room that's adequate to handle everyone — and their things. You will need to decide on a scenario, such as a power outage or a weather event. You will need to define your scope of disaster. (One server? One room? One data center?) And you will need to get your "battle box" together, which should include a number of essentials:
- Relevant documents and building plans
- Contact lists
- Inventory of data and applications
- IT asset management list
- Methods of procedure for potential IT actions
- Authorization codes
- Encrypted keys
- Your business continuity/ disaster recovery (BC/DR) plan
As you and your team walk through your chosen scenarios, you are playing a game of pretend. What would each department do, and in what order? Does the current BC/DR plan cover everything it needs to? What else can you do to make the plan more effective? A tabletop test is an excellent way to assess disaster readiness without ever touching a piece of hardware or software.
Simulation Testing: Taking Your Recovery Plan for a Spin
Of course, there's nothing like putting your disaster recovery plan through the paces with an active simulation test. This involves more than talking about disaster. In simulation testing, you turn systems on or off, introduce problems, and stress your IT infrastructure — all without affecting any live production environments in operation.
In planning your simulation testing, you will again need to determine the scope. Are you going to test a single server or application, or do you want to assess everything? Do you want to involve the whole company, a single department, or just a few people in your test?
Another consideration is the scenario. We all know about IT-related issues that could be a disaster for your company, such as a DDoS attack, a database failure, or a connection loss. But what about all the other things that could (God forbid) threaten your IT architecture?
You'll want to include a lot of possible scenarios in your disaster recovery plan. And you'll want to simulate them with testing. Here are a few examples from ready.gov:
- Active shooter
- Hazardous materials incidents
- Power outage
Running a simulation is like playing a game of pretend. Ask yourself, "What would happen if….?" Like any preparedness drill, it should be taken seriously. When a real disaster strikes, you want to be ready.
Assessing Results: RTO, RPO
Once you've completed your testing, you'll want to know how well you and your team did. Did everyone do what they were supposed to do? Did the equipment and software behave as expected? What were the major and minor problems that you encountered during your test? What went well? What are the areas for improvement?
Of course, you can do more than make subjective statements. In the mature field of business continuity, there are sophisticated measurements that you can use to determine levels of success. Let's have a look at a couple of those metrics.
Atlantic.net gives us two terms used in DR testing measurement, and we'll quote them here:
- Recovery Time Objective (RTO): "the maximum tolerable time allowed to recover client systems after a disaster scenario has been declared."
- Recovery Point Objective (RPO): "the measure of the maximum acceptable data loss recorded by time, or the maximum allowed age of the data when recovering a client's system."
It may be difficult to grasp these definitions in one reading. But those involved in DR testing and measurement will need to understand them thoroughly, and include them in their numbers. Any disaster recovery plan should adequately meet the organization's RTO/RPO requirements. And it's certain that other key performance indicators (KPIs) will come into play.
Part of the measurement process may be a checklist. If there are 100 required actions on a disaster recovery checklist and your team completes all but two, then you get a 98%. This is only one of many ways that can be used to measure DR testing success.
Final Thoughts: A Complete Continuity Plan
Any disaster recovery test should take into consideration timing, impact, and people. Evaluate your team's response to the loss of data, applications, or connectivity. What is the impact on SLAs and license agreements? How quickly can your IT engineers reboot systems and restore them to full function? The whole exercise should be an integral part of the organization's business continuity plan.
There's always a chance for things to go terribly wrong, and DR testing is a way to prepare for it. The alternatives to careful disaster planning and preparation can be devastating. DR testing can minimize the impact of a disaster and potentially save your company from financial or reputational ruin.