| technology | networking - Graeme Messina
5 Types of Network Failures and How to Prevent Them
In many ways, your network is like a stack of dominoes. Services, hardware, and other components are carefully connected and can be vulnerable to failure. As is often the case, it pays to be aware of what could happen and plan accordingly. Depending on how critical your IT component is, you need to build in redundancy to guard against failure, outage, or disaster.
Networks are complicated, and the number of things THAT can go wrong is staggering. We have compiled a list of 5 network failure types that you might encounter, what causes them, how to fix them, how it impacts operations, and (hopefully) how to prevent them.
Type 1: Resource Failures
What is a resource failure? You can think of anything related to your network's infrastructure as a network resource. This includes hardware like routers and switches, services like DHCP and DNS servers, or anything else that keeps the data flowing on your network. Data links and electrical supply to the data center can also be considered resource failures.
This is quite a broad area that can be affected, and the number of events that can act upon your network is also very high. Let's look at some possible resource failures you should definitely prepare for in your Business Continuity Plan.
Example: Power Grid Blackout
If you live in an area where the power supply is stable, it is hard to imagine going without power for an extended period. But unexpected events happen from time to time that can seriously impact on electricity supply. A power substation or transformer that goes faulty can sometimes cause unexpected power grid blackouts, taking the entire area's power offline.
When such events do happen, there are knock-on effects other than your own business's power problems. If you have an upstream service provider that doesn't have sufficient backup power like a generator or inverter set up, then some of your online services might not work as expected. If the blackout continues for a few days, you will find that many companies simply cannot continue operating in that area.
Preventing such impacts can come in many different forms. The first and most obvious is to have long term power generation capacity. Most data centers have diesel generators that can be run for long periods with no impact to the business when power is cut. To act as a buffer from the time the main power goes out, and when the generator fires up, there is also battery backup connected to an inverter. You can think of this as a much bigger version of a UPS.
Alternative power has also become quite popular, with solar panels being the most used at some sites. These solar cells can save energy into a battery backup. In some cases, they can provide real-time power to supplement the utility company's power supply, reducing costs.
Example: Accidents and Natural Disasters
While seemingly rare, accidents and natural disasters do happen. Whether it is a plane crash, a fire, flood, or earthquake, you cannot predict when such things will (or won't) occur. If an airplane crashes into your data center or business, then you can expect catastrophic damage to your network. The same is true of any other type of disaster, such as an earthquake or fire. The damage points could be structural, and the building could collapse, or a main conduit could be severed, damaging all of your data and power connectors. All possibilities should be addressed in your disaster recovery testing.
There is no way to predict phenomenological damage from natural disasters or accidents, but you can prepare for them. The most common safety measure you can put in place for your network, and datacenter is a fire suppression system. There are many different types, but most modern systems are FM-200 derivatives. This is a gaseous system that is non-toxic and does no damage to electronics and buildings as water sprinklers do.
Most server rooms and data centers operate on raised flooring, preferably above the ground floor, to combat against flooding. This makes it far less likely that rising water will creep up into your infrastructure and cause damage. Other natural disasters like earthquakes are not easy to prepare for. Your buildings should be constructed to the specifications of your area's building codes so that they can withstand the general forces of mild earthquakes and possibly heavy ones too.
Seemingly less dangerous, but just as damaging to your company's productivity is pest control related. When pests such as rodents get into your networking trenches, there is the potential for severe communication and electrical cables. You need to ensure that you have the correct pest control measures in place regarding your local ordinance and governmental regulations.
Sometimes you can experience downtime due to construction work or even a poorly placed hole being dug on the sidewalk. Fiber cables are fragile, and if any strands are severed or damaged, it could spell disaster for your business if you only have one line to the outside world.
Finally, a lack of capacity and planning can also create unforeseen downtime and slow network speeds. Suppose your company is still operating off old hardware that cannot offer gigabit speeds on your local network. In that case, you might find your internal operations running at inefficient speeds. This causes network congestion, packet loss, and even network downtime. You need to ensure that your systems align with the modern requirements of data-heavy networks to have an efficient workflow within your organization.
Cloud storage, offsite backups, and site duplication are all measures that you can put in place to prevent permanent loss of data and operational capabilities. All of these considerations should be a part of your disaster recovery procedures so that when the worst-case scenario hits, you are still able to operate, even if it is in a reduced capacity. Always have redundancy for all your services in place, including internet and telephone services. An emergency high-speed LTE connection can make a huge difference if all the landlines in your area are severed, so having your eggs in more than one technological basket is always a good idea.
Type 2: Hardware Failures
Hardware failures occur when parts of your IT Infrastructure stop working. Hardware failure can happen for many different reasons, such as voltage spikes on the power grid, water damage, or even electrical component failures from age or a lack of maintenance. Without adequate planning, these can have an especially devastating impact on operations.
Example: Server Hardware Failures
This is one of the most common failure types within an organization. If a server goes offline due to a hardware failure, the service can only be restored once the failed hardware component has been replaced. Components that can fail are power supplies, motherboards, and hard drives. These components can become faulty for many different reasons, as we outlined above.
Thus, it is crucial to make sure that your server room has clean and reliable power connected to a UPS or power inverter. These provide power backup in case of power failures, but they also condition the power feeding into your server racks. This helps to prolong the life of your server hardware.
It is essential to consider that most server infrastructure is virtualized. For one virtual host, there can be multiple virtual servers operating inside of it. This means that if your virtual host goes offline, then you will also potentially have multiple servers going offline if you do not have appropriate redundancy in place.
To prevent these kinds of failures from causing prolonged and unnecessary downtime, you need to make sure that a few things are put in place. First, you need to make sure that all of your production hardware is under warranty. Or that you have some kind of service provider that can supply the parts for your hardware at a moment's notice. You can keep a selection of hot spares such as power supplies and hard drives on-site to help clear errors and ensure that small failures don't become big problems further down the line.
Most servers have multiple hard drives and power supplies that can operate if one of them fails. These parts can also usually be replaced without the server needing to be powered down. This is where the term "hot-swappable" comes from.
Type 3: Hard Drive Failures
Although we mentioned hard drives earlier, it is essential to understand a few critical things about how your company's servers and their hard drives work together. There are usually multiple hard drives on a traditional server that connect directly to a server's motherboard. To help with performance and redundancy, they are typically configured in a RAID array. This means that if one drive fails, then the system can continue to run without data loss. When a new hard drive is installed, then data is rebuilt and loaded onto that drive.
With virtual machines, there is usually a unit that stores all of the virtual host's hard drives for it, generally on a fiber network. This is known as a Storage Area Network (SAN), and it also uses RAID to help with performance and redundancy. Hard drive failures on a SAN can be disastrous if data cannot be recreated on replacement drives because the impact can potentially affect many different servers, services, and applications.
Example: Outdated Equipment
A balance between cost-effective computer infrastructure and technology improvements can be tricky to navigate. Because organization-wide upgrades are expensive, it is not uncommon to see some hardware staying in service beyond its warranty period. This might happen with legacy systems that are no longer supported, and building a replacement system will take a lot of resources to accomplish.
When outdated equipment fails, it has two effects. The first is a scarcity of skill and supply. If your hardware has not been manufactured for some time, then the odds of quickly finding replacements are not great. This also applies to the skills and expertise required to install replacement hardware on failed legacy systems. The scarcity impacts the cost of these repairs and makes it a very costly exercise to maintain such equipment.
Whenever systems start to reach the end of life, you need to start making suitable replacements before running into any more significant problems further down the line.
Example: Failed/Incompatible Firmware Upgrades or Patches
To keep your hardware running effectively, manufacturers will release firmware upgrades and patches from time to time. Firmware is the low-level code that tells your hardware how its components interact and what information is available to the operating system.
If firmware becomes corrupt on a device, then the result is usually that the device becomes "bricked." This means that it won't even be able to power on correctly, or, if it can power on, then it can't do much else. Failed firmware patching can occur when a device loses power mid-way through the process or if a communications cable is accidentally removed mid-process. Sometimes a firmware file is corrupt and can be flashed to a device, or an incompatible device accepts a flash, making it inoperable. There are usually extensive checks done by the flashing software before starting, but errors can sometimes occur.
Software patches to an operating system also have the potential to break certain services on a server. Microsoft has had many examples of hotfixes needing to be released to fix bugs introduced from Windows Updates, although this has gotten considerably better over the years.
The best way to prevent bad software patches from affecting your systems is to test them in isolation on a test network before deploying them to a live system. Any bus or performance issues should be noted before you begin rolling the patches out to the rest of your network.
Overheating can cause severe damage to your server infrastructure, and it can also introduce some strange errors too. Your data center needs to have proper cooling, especially if you have many populated server racks generating heat. Most server rooms will have a dedicated hot and cold aisle system that directs cool air into the servers and exhausts hot air out into a channel that drains it out of the room.
If these systems are not running correctly, then the hardware in that room will be affected by heat, which has short- and long-term effects on both the systems' operation and lifespan. The key takeaway from this is that your cooling systems always need to be running efficiently. To accomplish this, you must have set service intervals for your server room's cooling components.
Maintenance of fans, ducts, and filters needs to be carried out on all of your servers and rack-mounted equipment on a maintenance schedule as well. Over time, dirt will block the airflow for any device with a fan (or multiple fans) installed. This airflow needs to be kept clean for the best cooling performance.
Type 4: Software Failures
Software failures can occur for a whole multitude of different reasons. A license can expire, a configuration file can go missing or become corrupted, a bad software update can cause issues, a software bug can introduce issues — the list is almost endless.
Another factor to consider is whether the failed software is an off-the-shelf solution purchased from a vendor or if the application has been developed in house. The time it takes to get your software up and running again depends on who designed the non-functioning software and if they are available to provide support.
The result is usually the same: the software is not working — and many business functions are impacted. Your application might be performing some essential network functions that might now be impacting everything else, which will cause more issues until it is fixed.
If you are still manually rolling out your updates, then you will find that a manual fix will need to be implemented and rolled out across the business. This is avoidable with automation tools. Therefore, it is vital for all companies that rely heavily on software solutions to think about how they can better leverage their resources by automating as many processes as possible.
Perhaps a software update has been introduced to your environment but not validated, causing bugs and errors. You will need to visit each workstation or node where the update has been applied and either 1.) apply a patch or fix, or in some cases, 2.) roll back to the known good version of the software. It is always an excellent idea to have some kind of rollback plan when implementing software updates to your systems.
To prevent these kinds of failures, you need to ensure that you follow a proper testing and validation plan for all of your software updates. Most companies have a testing environment that mimics most of the mission-critical systems. Any updates that are applied can be closely watched and documented. When the rollout happens in your live environment, you won't be met with any surprises.
Type 5: Human Failures
Unfortunately, it is still possible for human errors to cause issues on your network. Anything from accidental hardware damage, or cable damage, or even poorly configured network devices can cause downtime on your network.
If your network is not maintained regularly, then you can also expect network failures to occur. Whenever a device is removed or added to your network cabinet, cable management must be adhered to, and documentation must be updated. Sometimes a network failure can occur because cables are not correctly labeled, creating problems when they are accidentally removed.
Even worse, if an unlabeled cable is removed, it can take that much longer for the fault to be found and then successfully troubleshot. It is for this reason that network maintenance must be carried out at regular intervals.
Another major cause has to do with bad changes being made to the network environment. Changes such as VLAN configurations, routing, and IP Address configurations that are not tested before being deployed have the potential for unexpected results. Network changes and network maintenance must not be treated as an afterthought and must instead be scheduled and carried out regularly.
Type 6: Security Failures
Security could be thought of as an extension of human errors, but many other variables also play a role. Security should be considered an active measure that must be implemented from the very start of a project and maintained throughout its life cycle. If you don't actively work to protect your environment, you can open yourself to unnecessary risks.
DDoS (Distributed Denial of Service) attacks have become a standard method of attack used by cybercriminals. The method involves using thousands of hosts that send requests to a website or server. The unexpected load can sometimes take such a service offline, meaning that the business cannot continue to operate until it is brought back up. There are ways to mitigate this. Modern solutions can detect DDoS attacks and then reroute that traffic to another data center or network appliance where the data packets cannot reach the intended target. Some internet service providers and internet hosts offer this kind of protection, so it is a good idea to find out if you can integrate such a solution into your online services.
Another point of entry is through malware and viruses. Antivirus protection has become commonplace in many organizations, but you need to have an acceptable IT policy to use this technology effectively. You can have the best security software packages in the world, but if your users are not following the rules, they will not be effective at all.
A lack of user awareness and training around social-engineering attacks and phishing scams can also lead to enormous security risks within an organization. Teaching your employees how to identify and avoid such scams and attacks will protect your company from losing valuable data.
Data Loss Prevention is an area that most companies are starting to employ to retain intellectual property and sensitive data. DLP solutions can scan all outgoing data such as emails and attachments and find specific keywords and phrases that relate to the protected data that you are trying to prevent from being exfiltrated out of your organization.
More advanced solutions can search within the metadata of files that are injected with proprietary data, making it easier to identify those responsible for trying to send out your data. Proper security needs to be implemented at all of your sites to ensure that hardware and any other IT asset is not removed without being authorized.
The Impact of Network Failures
Network failures mean possible network downtime. Depending on the company, this could equal the loss of thousands or tens of thousands of dollars per hour. In highly competitive markets with external-facing internet services, this can also potentially lose customers who will use a competitor while your company is offline.
The Business Costs of Network Failures
Many different factors can cause a network failure, and each has its only potential cost. Some of these costs are financial, but there are other things to consider, such as reputational damage or the tarnishing of your company's brand.
If your company experiences a catastrophe that destroys a data center, then you are probably looking at millions of dollars' worth of damage. Suppose you have off-site backups for all of your data, configuration files for all of your devices, and a proper disaster recovery plan in place. You are not out of the race just yet.
You will have to rely on your teams to execute the Disaster Recovery Plan to get your services back up and running, even if it is in a minimized form. If you don't have a plan to recover from such an unthinkable failure, then you might not be able to recover from that kind of worst-case scenario disaster.
Hardware costs are the most obvious concern for many people, and it is understandable. The hardware costs of servers, switches, routers, and universal power supplies – anything required to run your IT operations- are very costly.
You wouldn't think of software as carrying a cost when your network fails, but there are a few things to consider. If your company has its proprietary software and systems, then recovering from a complete failure will have its own unique set of challenges. You will need to protect all of your development resources like source code and repositories. If you are in a situation where your production environment is compromised, you can spin up another instance and restore data to it.
Your customers are vital to your operations, and so is their data. Losing critical data either through software or hardware failures can be challenging to recover from. Backups can help you to make sure that your customers experience as little frustration as possible. If your customers are unable to use your services, or if you are not able to help them to recover their data, then you risk damaging your company's reputation.
Intellectual property loss can be a considerable cost factor, especially if that data is exposed online through a cyber attack or exfiltration. You need to make sure that sensitive intellectual property is stored in a safe offsite location that you can access in a network failure event.
Suppose your company operates in a space where compliance and reporting penalties could be a factor. You will need to have safeguards to minimize the impact that a network failure could impose on your organization.
Brand damage can occur if your customers are negatively affected by a network or system failure. Your competitors will take advantage of your downtime, making it very difficult to win back your customers. Once a customer decides to jump ship, you will almost always struggle to win them back once your systems have recovered.
The Human Costs of Network Failures
While your staff is battling to get your network back up and running, there are naturally going to be other areas that suffer. These interruptions take your team away from productive work and create backlogs in other business areas. It also has the unintended consequence that your support staff suddenly have to respond to customer complaints.
This is difficult because your technical staff doesn't necessarily have the same customer-focused skills as customer-facing staff and can lead to miscommunication. In most cases, your marketing and sales staff will need to reach out and let your customers know about the interruptions to understand what is happening and its progress. Again, this takes your staff away from their core business roles and creates more backlogs throughout the business.
If overtime is needed to get your systems back up and running, you will incur additional labor costs during these types of failures. If you don't have the necessary in-house skills, then getting a service provider or subcontractor to assist will also incur additional costs.
Your teams will also experience increased stress levels while the problems are ongoing. Remember to give them some time off after fixing all of the issues as fatigue and stress can be real productivity killers afterward.
How to Prevent Network Failures
Now that we know what causes network and system failures, we can start to prepare for them. Nobody likes to fixate on only the negatives, but you have to plan for the worst when it comes to network failure prevention. Your first port of call should be preparing a disaster recovery plan. We could write an entire article on what you need to consider for a disaster recovery plan, but we will touch on a few basics that could be included in your disaster recovery plan.
Your staff needs to be trained with a series of test drills to make sure that everybody knows what they need to do in the event of an emergency. These test runs need to be carried out often so that your teams are ready to spring into action. Part of this preparation means that you need to document your disaster recovery. These can be in the form of playbooks, step by step guides, and any other resources that will help your teams get the job done.
To detect issues before they turn into a massive network outage, you need to be monitoring your systems. Continuously. This can be as elaborate as a fully integrated network operations center or a single workstation with monitoring software if your support staff can see an issue before it becomes an issue.
We've gone over a few different solutions that can help you to prevent these outages, like Uninterruptible Power Supplies and inverters, backup solutions, and fire or flooding protection. Your critical servers must be a part of high availability groups for redundancy. Combined with virtualization, you can minimize your downtime with little to no interruptions to service. Other things to look at include a comprehensive security training program and defensive security systems.
We've covered many different areas in this article, but we've looked at many essential features that should be a part of your day to day operations. The main takeaway is this: planning is only half of the battle. Implementing your plan is just as tricky. To accomplish this, you need to create and document your plan and make it accessible to everyone that needs to know what is in it.
Once you have all of the details figured out, you need to practice it at set intervals to make sure that when the time comes, your team is ready to act at a moment's notice. Cybersecurity training and workforce security awareness are also paramount.