Incident Response: Responsibilities When On-Call
A fundamental understanding of network security is important for CompTIA Security+, AWS Security, and CISSP certification. Network security incidents can indicate that an organization’s systems or data have been compromised. In other cases, critical outages that render the business inoperable and cause extreme revenue loss may also warrant an incident response.
An IT professional's job may entail preventing, responding to, and recovering from these types of incidents. It pays to have a comprehensive incident response plan in place before a security incident occurs. Setting clear expectations for the employees as well as clear procedures to follow is incredibly valuable.
In an incident response plan, employees are designated who will be on-call. When you are on-call, you may be contacted to investigate and fix issues that may arise. There are many ways on call may be dealt with. If it is a regular occurrence, an on-call rotation may be set up to avoid burnout from on-call employees. In other cases, a simple escalation matrix may be designed.
In this blog post, we will discuss best practices for setting up on-call responsibilities.
What is an Incident Response Plan?
An incident response plan addresses many issues that threaten daily work. These issues can include service outages, data loss, and cybersecurity incidents, just to name a few. An incident response plan will usually indicate criteria to enact. For example, an end user having printing problems may not warrant an incident response. However, if it happens to be the CFO working late and pulling financials for a board meeting, it may. There usually has to be a large business impact to a particular incident.
What Does Being On-Call Mean?
Being on-call means that you may be contacted outside of your regular working hours if necessary. This could be at any time as incidents do not wait for convenient times. Your employer may also set an expectation that you respond and engage within a specific time frame. Once engaged, you will be requested to investigate and possibly resolve the issue. This may entail pulling team members into a conference bridge if necessary. It is usually expected that you are engaged for the issue's duration, even if you bring on others to assist.
In most environments, the expectation is to triage the issue so that critical business functions can continue safely. This allows teams to meet during regular business hours to discuss next steps and more permanent solutions.
What are the Benefits of Being On-Call?
Some organizations have a 24/7 business need, but unfortunately, it is not economical to fully staff 24/7. IT professionals will need to discover what their business needs are. Other times, the business need is not quite there for full-time staff. On-call staff exists as well, such as in hospitals, electrical service companies, and plumbers.
Having an appropriately sized and qualified on-call rotation can help to allow quick resolutions without burnout on the team. Larger organizations may have owners of various systems split up, such as network teams, server teams, operating system teams, each with their own on call. On the other hand, smaller organizations may have employees that wear many hats in a much smaller rotation. It is important, though, to ensure that whoever is on-call is appropriately trained to deal with most issues to ensure on-call effectiveness.
Some organizations will pay overtime for on-call to help financially reward the team for after-hours work. This serves to help motivate them to continue to respond quickly. After a while, however, it can be easy to burn out if on-call is frequent. This financial reward or overtime (OT) can help prevent that. Some employees tend to look for opportunities to increase their financial compensation, which may help them elect to be on call more often, leaving those who prefer more free time to avoid it.
In a post-mortem discussion of the on-call, best practices and improvements can be discussed to help understand the incident and how to prevent going forward. This is an important step so that issues do not recur.
Who are On-Call Responders?
The particular incident may determine the team of on-call responders that it is routed too. In many cases, some level of Tier 1 Customer Service is 24/7. If not, there may be an after-hours voice mailbox that is checked. These are usually the front-line workers that determine if an incident warrants an on-call response and to which team it should be directed.
When we think of incident response, it is usually a DevOps, IT, or cybersecurity incident with a somewhat basic break/fix type issue. In rare cases, though, code could have been deployed that has logic issues, and either needs to be rolled back or pushed forward with correct code. In that case, a development or engineering team member may be engaged. In other cases, it could be a public relations incident or security breach, and the marketing/public relations team needs to be engaged to manage the incident publicly.
If the incident is serious enough, someone on the executive team may need to be engaged so that they are aware and able to make critical decisions that individual team members may not have the authority to.
Product management team members may get pulled in if there are multiple outages. This is to help minimize the impact and determine which customers or product lines prioritize the restoration of services.
Who are On-Call Stakeholders?
On-Call stakeholders are typically sections of the business that are not engaged in the team's active corrective measures but need to be apprised of the activities. Stakeholders also have to sign off on any incident response internal training programs.
Typically there will be an on-call member of the executive team. If there is a widespread business impact, they can help bring other departments together and mitigate high profile client complaints while the incident is being responded to. Legal may need to be involved in on-call in cases of data breaches or corporate sabotage.
If there is a large financial impact on the business, someone on that team may need to be alerted. This could be an issue with billing, accounting, or even an outage that results in financial loss or penalties paid to clients. For very large or important clients, the sales representative for those accounts may want to be aware to help manage expectations.
In cases of disaster, HR and/or facilities staff may need to be escalated to. This could be a hurricane that brings down a primary location and makes it unsafe for employees to enter.
What are On-Call Responsibilities?
The responsibilities of those that are on-call can vary from organization to organization. Typically it does not entail a full resolution but just a triage of the issue in cases where there is a severe and unexpected business impact that needs to be addressed. This helps to allow business continuity.
Everyone who has on-call responsibilities is vital to the incident response plan. There are several steps that can help ensure an incident response plan will be successful.
On-Call Preparation: The Right Tools for the Job
When scheduled to be on call, it is vital to prepare. In most cases, the minimum requirement is to be accessible via cell phone so that you can be contacted should an incident occur. Most roles in the on-call require that you have a working internet connection and the necessary software and tools loaded onto the machine you will use to connect to the environment. This may mean VPN software and access and your logins readily available. This also requires knowledge of the environment that is readily available. Learning and remembering what systems to connect to and how to access them is critical.
If there is a schedule, it is important to plan and anticipate being put on the on-call rotation. Doing so allows you to plan your schedule accordingly. If there are teams of on-call rotations, it is also important to be aware of them as well, so you reach out to them as an escalation point first.
On-Call Priorities: Triaging the Right Way
The incident response plan — and the training that goes with it — will help determine the first steps. There are several questions to consider when receiving an alert. Is this alert urgent or just informative and requires monitoring? Can I address this incident all by myself? Which team members do I need to keep updated on the progress of this?
If it does need to be acted upon, the affected systems and clients should be identified. This can aid in the determination of urgency. For example, if it only affects a minor reporting system, it may be able to wait until business hours and be snoozed until then. On the other hand, if it affects a large client's business function or critical internal business function, it likely needs to be addressed immediately.
If you attempt to address it but are unable to correct it — or correct it quickly — an escalation may need to happen. This escalation may involve a higher tier of your department or a separate department more focused and trained on that particular system.
Communication is a key soft skill during these types of incidents. You could be addressing the issue very diligently, but other teams may not realize you have engaged the issue if no communication happens. That may cause an escalation or cause other teams to start addressing the incident. As part of the incident response plan, know which teams should be communicated with and how frequently.
On-Call Solutions: Empowerment to Fix the Problem
If you are on the on-call team and are well aware of the issue, jump in and address if you can, as soon as possible. Communicate that this is happening. It can help a team member who is less trained on the system from jumping in and possibly making the issue worse. In return, this type of attitude toward responding should help other team members do the same when an incident occurs that affects their specialty area. This helps the issue get resolved correctly and much more quickly.
Know when to ask for help if you are unable to completely resolve. It is best not to churn on an issue you cannot resolve for hours without notifying the team or escalation points. On the other hand, do not escalate too quickly, or your team will get frustrated with you. A proper incident response plan will help determine escalation timelines. In place of that, a simple conversation as part of the training can also help determine that, but it is best to have it documented in some form.
An example of a time to ask for help is if you are troubleshooting a database issue that you are not familiar with. You spend 20 minutes researching but haven't found anything concrete to resolve. After 20 minutes, you may even find a solution but do not feel comfortable understanding the procedure to resolve it. Or you don't completely understand the solution's ramifications. This is a perfect time to escalate the issue if an escalation is available.
For those scenarios that can wait, ensure you still track the issue. Nothing is worse than being able to snooze an issue, only to forget about it and have a critical business impact because it was forgotten entirely. Typically this involves logging it into a ticketing system if not already and sharing with team communication channels. Some of those channels are increasingly chat applications versus email.
On-Call Reviews: Investigation and Post-Mortems
Post-mortems are extremely important. It may feel over discussing the issue but having these during business hours after reflecting can help prevent similar issues from occurring. Teams that were unable to be on-call may be able to join and add expertise that was originally unavailable. If the issue is unavoidable, a documented fix may be able to be stored in a knowledge base for future reference.
Aside from that, it is important to discuss the good, the bad, and how to do better. The purpose of this meeting is not to cite fault but be open and honest in order to ease the pain of a similar situation in the future.
On-Call Support: Continuous Improvement
Sometimes an incident may span multiple on-call team rotations. In that event, it is important to timely hand over and inform them of the details. The incident may be resolved, but customer response teams may still be fielding client questions in the next shift. It is extremely important to have continuity in the incident response between teams.
In the event that the incident is still ongoing, the next team needs to be brought up to speed. People have physical limits as to their ability to stay awake, and at a certain point, it is expected they should get relieved if the incident is ongoing through a few shifts. In an ideal environment with customer-facing support teams buffering the support issues, they will be provided regular updates so that customers can be in the know and have expectations set based on the latest information.
What Are Not On-Call Responsibilities?
All good incident response plans have a backup. Personal emergencies happen from time to time. If you have to make a trip to the ER, it certainly cannot be expected that you would answer an on-call page while at the ER. If you are the backup responder and the primary has not been reached out to first, you want to ensure the primary is attempted.
While the ER scenario is extreme, there can be less extreme situations where "life happens," and you're unable to respond quickly. A good incident response plan accounts for that and has procedures in place to address it.
In conclusion, an Incident Response Plan is a vital tool in your cybersecurity toolkit to maintaining business continuity in unexpected scenarios. At first glance, it can seem overly complicated and difficult to train. However, it is crucial to train and continuously improve the process. In the unlikely event that an incident happens, your team will be fully capable of following the protocols.
This ultimately leads to happy employees and happy customers. Customers are happy because your company addressed the issue with precision. Additionally, your employees will be happy because they know what to do in an unexpected situation.