How to Avoid Single Points of Failure with Redundancy
You would never get in the car without a spare tire in the back. And along those lines, you probably use overdraft protection for your bank account and stock extra toilet paper in your bathroom.
There are plenty of examples in daily life where we make sure to have extra resources in case we need them. This is what redundancy is all about. And if you want to keep your IT infrastructure in continuous operation, you will need at least two of just about everything.
What Is Redundancy?
The philosophy behind redundancy is that things in the universe have a tendency to break down and fail us just when we need them most. Call it the 2nd Law of Thermodynamics, or entropy, or Murphy's Law — it's all the same idea. Redundancy is a counter to Murphy and his annoying Law, and it's meant to give us some peace of mind that we can weather any storm.
Redundancy is when you have more than you actually need at the time. The dictionary gives a more precise definition for redundancy as it relates to the field of engineering:
The inclusion of extra components which are not strictly necessary to functioning, in case of failure in other components.
There's a fascinating 2009 research paper on the subject titled "When Failure is an Option: Redundancy, reliability and regulation in complex technical systems". It begins with an aviation tale about a British Airways flight over the Indian Ocean. After all four engines shut down because of volcanic ash, things look grim. Passengers begin writing letters home while the crew tries to keep up their spirits.
A jetliner can fly with only one engine. The reason to have four engines is to have redundancy — which normally works. But in this case, three backup engines were not enough. It just shows that the forces of entropy are strong, and even our best efforts may not be enough. Luckily, the engines finally restarted — but not before the flight made the Guinness Book for the longest non-powered flight.
Under normal circumstances, a single backup server may be enough. But the intricacies of redundancy (as Downer deals with in his paper) and the potential for failure may require that you have more than one spare, and it's best if you have one that's in another data center location. Depending on how critical your IT component is — whether an application, database, server, or connection — you need to build in redundancy to guard against failure, outage, or disaster.
Avoid Single Points of Failure
At the heart of the matter is the fact that any one component can fail at any time. Preventive maintenance and proactive monitoring can help to mitigate potential disaster, but depending on a single unit for a critical service is risky business. A single point of failure (SPOF) is a component that will bring down the whole system when it fails. And it's the worst thing in the world for an IT infrastructure.
It goes without saying that, given time, every IT element will fail. Hard drives will become corrupt, databases will lose their integrity, and connections will become unreliable. The problem is that we just don't know when that will happen.
HAL: "My F.P.C. shows an impending failure of the antenna orientation unit…. The unit is still operational, Dave, but it will fail within seventy-two hours."
If you don't have a HAL9000 unit to precisely predict your IT element failures, as Dave did in the movie "2001: A Space Odyssey", you may want to think about redundancy. If HAL and Dave had only had sufficient redundancy, there would have been no requirement for the subsequent EVA and all the havoc that ensued.
Having SPOFs throughout your network is a sure ticket to failure. On the other hand, as Techopedia tells us:
"Highly reliable systems are designed without SPOFs. This means that failure of a component, system or site does not halt system or operational functions."
Levels of Redundancy
Redundancy is a concept that can be applied at many levels across many technologies. It all depends on the scope of the system. A simple database in a small office should either be backed up or replicated, preferably offsite. An email server should be replicated on more than one server and backed up regularly. A small business with a single internet connection will suffer if they lose their only link to critical data.
But we can think of redundancy in many different ways. If you depend on a single cloud provider for instance, you are at risk of database or regional outages. A second cloud provider will give you the assurance that you can continue operations when AWS, Google, or Azure goes down. (It may be rare, but it happens.) Vendor-level redundancy can also apply to SaaS providers or other cloud-based services that you rely upon every day.
Data center redundancy is a good idea even if you don't have your services in the cloud. Keeping all your IT infrastructure in one data center is asking for trouble. All you need is a power outage or some other event that affects everything in the data center and then you have no access to your data. It's a good idea to put redundant servers, databases, and application in separate databases. You never know when Murphy will start to act up.
Having redundant connections means you're prepared if one goes down. Whether it's the loss of a transatlantic cable or a failure of a DSL link, you can't afford to be without connectivity. We might include here DNS redundancy. If you don't have a second DNS provider configured into your network, you may very well have an active network link but still be unable to access your applications.
Failover for Disaster Recovery
In the high stakes business world, time is money. Every minute that an internet service is down can cost a company in terms of service level agreement (SLA) penalties or lost income. That's why it pays to have an effective service failover system. Failover is what happens when a primary system fails and a secondary system takes over. The best failover systems don't require human intervention.
Automation continues to improve the reliability and performance of IT networks and systems. The traditional approach was to monitor and detect failures and either replace parts as soon as possible or manually switch over to redundant IT elements. With the power of artificial intelligence and automatic processes, these failover actions often occur immediately in today's advanced IT environments.
When a disaster occurs, an untold number of network or system elements may be disabled or inaccessible. The trick is to get the service up and running as soon as possible and worry about the defective elements later. Business continuity, of which, disaster recovery is a part, is all about keeping business processes running with minimal disruption.
Well engineered IT architectures are almost self-healing these days. A good business continuity plan will take advantage of all the latest technological innovations to restore service to users so that any outage is barely noticeable — even if everything seems to be falling apart at one location.
Redundancy is an essential part of IT, and it's not possible to be successful without it. When a disaster occurs, that's when your company will really find out how well redundancy has been engineered into your IT infrastructure. And if it is lacking, it will likely be quickly visible to your company's employees, customers, and unfortunately to the public at large.https://www.cbtnuggets.com/blog/career/management/how-to-properly-do-disaster-recovery-testing