Incident Response: Setting Up an On-Call Plan
If you work in an organization with mission-critical systems, then you know the importance of setting up an incident response on-call plan. This is vital for preventing system outages such as network failures, server crashes, and anything else that could affect operations.
Having an on-call plan means that you have a pool of employees who are on-call and are scheduled to respond to issues after-hours. Typically, on-call people include IT admins, developers, cybersecurity specialists, and other teams or individuals.
There are challenges when trying to set up on-call teams, but none of them are insurmountable. We will take a look at some best practices that you should consider when setting up an on-call plan of your own.
Integrated Support Systems: Your On-Call Foundation
Your first line of defense in the battle against downtime is a good ticketing system. This means that it should, at the very least, be integrated into your helpdesk software. If you are running a helpdesk after hours, then your support staff will usually have to manually route through tickets to on-call staff when there is an emergency.
Some systems don't require human intervention and can route tickets to the appropriate teams and departments and then message or robocall the engineer on standby. Whichever approach your system takes, all of your teams must have visibility across the system to access tickets generated by the helpdesk.
Another thing to consider when working with multiple teams is a working agreement. A working agreement outlines each team's responsibilities and clarifies who is responsible for each task. You can think of working agreements as a set of rules that determine how each team will act towards one another, how long they will take to accomplish certain tasks, and how to proceed when things don't go according to plan.
Working agreements are in place to ensure uniform behavior from all teams, which results in consistency from team members. Teams learn more about each other's processes, products, and procedures, resulting in higher levels of collaboration and synergy between teams.
Finally, you want to go over your on-call strategy with a fine-toothed comb, looking for any holes in your support workflows. If you find any shortcomings or gaps in your on-call map, you must be able to account for them, or better yet, fix them.
Mapping Your On-Call Setup
To get an effective on-call solution, you really need your incident managers to get involved. They need to consider how each piece of the puzzle fits in with the rest of the plan. Elements like different teams need to be included when discussing what kinds of on-call support they will be offering.
Thresholds of severity need to be established before your on-call staff needs to take action. You must distinguish between emergencies and calls that can wait until the next business day. To do this, you will need to implement a severity structure within your ticketing and helpdesk software and train up your staff so that they know what constitutes an emergency and what doesn't.
Monitoring Tools for Incident Response
You also need to consider specialized tools to help you. Monitoring software is the cornerstone of any IT operation that requires a real-time response, so if you don't have anything in place, then you really should. There are tons of options out there with price points that vary from pricey to free and open source.
The real trick is to consolidate all of the data from your different monitoring solutions into a unified platform that sends out automatic alerts in emergencies and log tickets with your support desk. The alerts can also go directly to your after-hours on-call support members so that everyone that needs to know about serious issues gets the message.
In order to do this, you need to look at paging software. Again, you have many options that range from email, text/SMS messages, direct to cellphone, or even a robo-caller that will give details via voice. Most of these solutions have a scheduling system built into them, so you can manage and arrange your on-call teams from one single location. The less complexity, the better.
Obviously, prevention is better than cure, so you should also review the current state of your infrastructure and hardware. Do your servers have redundancy options like multiple power supplies or mirrored data stores?
How to Develop Effective Incident Response Processes
Any good on-call plan needs to have processes in place if it is to be effective. Think of these as the default instructions that you and your teams need to follow for different scenarios. Detailing each step in a document format is fine, but also think about adding value. Help your team process information more effectively by adding flow charts, diagrams, and organizational maps.
These should show things like dependencies of departments, services, and network details. Consider including anything else that will help on-call teams quickly assess the impact that an issue could have. Concentrate on your customers, the business, as well as your IT operations.
This means that a lot of team preparation and training needs to happen. Everyone needs to understand both the processes and the procedures required of them in an on-call situation. When onboarding new staff into the organization, they should also be given some exposure to what it means to be on-call, and what they can expect when they eventually help with those responsibilities.
Things like schedules, rosters, shifts, and rotations must all be available to your teams so that everybody knows when they are expected to be available after business hours. Life happens, so you need to think about structuring your system to accommodate unforeseen shift changes, shift swaps, and unavailable teams and staff. These things should be easy for the teams to administer for themselves and be authorized by their managers with little to no intervention from senior management unless it is needed for compliance, like DoD 8140 compliance, or regulatory requirements.
Incident Response: Rules, Tools, and Protocols
Next, you should think about the correct escalation protocols for each department. Think about what happens when the standby engineer is not contactable. Who do you call next? Is there a backup scheduled, and are they available? If they are not reachable, is their manager available? Who do you contact next if nobody from that team is available? These things need to be discussed and agreed upon to make sure that you always have a clear escalation path for your incidents.
Every alert and notification should follow a predefined set of rules. This ties back to the thresholds that we spoke about earlier. When a service stops running, or a host fails to respond to a ping, then you should have a rule in place that defines how long it should wait before it alerts your on-call staff. These should all happen automatically through automated scripting so that they require no human intervention.
Communication is another huge part of the on-call puzzle. In order to communicate effectively, you need to choose a common platform and stick to it. Many companies make the mistake of using too many different messaging apps, which can lead to missed messages and miscommunications. That's not to say that you should put all your eggs in one basket, though, and you should also have a backup in the event that your preferred platform should fail.
Common issues that happen quite often with similar solutions and outcomes should be documented and added to a runbook. A runbook is a series of scenarios that give you a breakdown of a problem's symptoms, possible fixes, and ways to verify once the fix has been implemented. There needs to be effective reporting that shows the stats of your teams and how effective the current solutions you have in place are working. You also need to document as much as possible. This is crucial when your staff is relocated to different departments or if they move on to other roles. With adequate documentation, there is a continuity of processes and procedures via knowledge transfers.
After an incident, you should have all of your teams sit down for a post-incident review. The focus should be on what went right, what went wrong, and how the problem can be prevented in the future. If the failure is unavoidable, then it needs to be looked at so that the response is effective and downtime is minimized. These form part of ongoing investigations and improvements that must be continually conducted to improve everything you have built up.
What Should Be Monitored?
One of the most important things to address first is what your company needs to monitor. Monitoring everything is pointless because it creates too much noise, making it impossible to see urgent alerts when they come through. Instead, you need to monitor the most critical infrastructure, applications, and services within your organization. Think about an application or data source that you simply cannot operate without and start there. Next, think about the public-facing services you have, such as websites, login portals, and email. Without customer interactions, your company could be in serious trouble, so definitely consider investing time and effort into monitoring those systems.
Also, think about those services that are out of your control, like the messaging platforms we mentioned earlier. Developing relationships with these providers is beneficial because you can receive enhanced communications when services are down, or scheduled to go down for maintenance, and many other scenarios where outages are possible. This gives you time to manage your own customers' expectations and prepare them for any downtime or unavailability. Better yet, maybe you can set up an alternative before an outage occurs, making for a much better customer experience.
Measuring the Success of an On-Call Plan
Anything that is important to running a business must be measured, and your on-call plan is no exception. To prove value and efficacy, you need to gather data about the systems you have in place. The most obvious measurable metric is the downtime numbers before your system was put in place, and the resulting months following after it was implemented. This is a clear indicator of how things are or aren't working. You need to speak with management and set clear definitions about what will be measured, how it will be measured, and in what time frames you will measure the stats for.
Next, you have to figure out how to implement the system with minimal impact to the business and show which metrics you will be collecting and monitoring ahead of time. Once you have ironed out all of those details, then you can start with your planning phase.
Before you can start implementing the systems in a live environment, you need to make sure that everything will work as intended once you set everything up. There is only one way to iron out potential bugs, and that is by testing out all the elements of your on-call plan. This can happen during office hours so that all the teams that are supporting the plan are reachable and available to make any tweaks or changes while you go.
Once you are satisfied that tickets are routing correctly, notifications are being sent out, and all the other steps are working, then you can move to a proper after-hours test. You need to see how the on-call plan works on your network.
At this stage, it is important to get as much feedback from your team members as possible. They are going to be the ones relying on the system, so you need to get them to a place where they are confident that they won't miss any alerts and that the system will not be too intrusive when they are on-call. You will also need feedback from the executive team, your marketing and sales teams, and even your product managers and owners. You need to have a company-wide consensus about how the system works, what it can do for each department and team, and how it will ultimately improve productivity and stability for the business.
Once you are satisfied that everything is working as it should, then you can start to roll out your on-call plan!
So far, we have learned quite a few things about how to set up a proper on-call solution. In order to have a notification system that alerts your on-call staff, you need an integrated ticketing system that can receive alerts from your systems.
This means that you need to have a monitoring solution in place to check on system health and any other important metrics. There are many moving parts to an on-call solution, so you really need to do your homework in order to succeed.
It is important to map out your on-call plan and drill down into the issues you are trying to solve. Once you have everyone's input and approval, then you can really start with your preparations and get started with the process of implementation.
The more you prepare, the more likely you will be setting the company up for success — even if disaster strikes.