Technology / Networking

What is Fault Tolerance?

How to Obtain Cisco Virtual Network Device Images for Labbing
Follow us
Published on November 17, 2023

Quick Definition: Fault tolerance is the ability of a network or system to continue operating even when physical components break or software fails. The key features of a fault-tolerant system are load balancing, clustering, redundancy, replication, and failover. 

Each business has its own expectations and tolerance for times when the network is slow due to heavy traffic, out of action because of failures, or delayed in returning to full service, aka how fault-tolerant it is. 

The CBT Nugget certification preparation course, CompTIA Network+ (N10-008), covers network design considerations, such as fault tolerance, high availability, quality of service, and traffic shaping, that network administrators should understand. In this article – the first of three – we’ll discuss fault tolerance, including what it is and key techniques you should know. 

What is Fault Tolerance?

Fault tolerance refers to the ability of a system to continue to function even when things break. The reality is no system component remains perfect forever. Physical components degrade over time and can break. Even though software doesn't wear out, it’s impossible to test for every situation. The system’s going to break – Eventually. 

So, what happens when something goes wrong? The answer is: It depends! It depends on the organization’s tolerance for unexpected downtime and the impact of particular applications on business functions.

For example, outages on a bank’s email or employee records systems have less financial impact than those on their ATM network or branch teller system. In those situations, you can justify investment in additional control systems and redundant components required to ensure operational continuity.

Fault-tolerant systems continue to operate even in the event of component failures. In fault-tolerant environments, the various subsystems are continuously monitored. Upon detection of degrading performance or outage, backup elements are automatically brought online so the overall system continues to operate.

For some enterprises, seconds of outage can result in millions of dollars in lost profit.  In others – think hospitals, air traffic control, or nuclear power plants – downtime might result in deaths!

Another term to know about is high availability. Cloud service providers, for example, might say their service will be available 99.999%  of the time. That translates to about 5 minutes of downtime in a year. That’s high availability, but still, 5 minutes of downtime may be too much for some critical applications.

And, if service availability is only 99.99%, that would mean almost an hour of downtime throughout the year. If that hour happens at the wrong time, it could have disastrous consequences. 

In comparison, a fault-tolerant system is designed to have zero downtime. In most cases, fault-tolerant systems also have high-availability components. So, a fault-tolerant cloud-based system might have a parallel high availability component system in separate cloud data centers – the redundant one ready to take over from the primary system as soon as required.

Key Techniques of Fault Tolerance

Systems requiring fault tolerance use principles like redundancy, replication, failover, and high availability techniques such as load balancing and clustering. The key concept of fault tolerance is monitoring system performance and availability so it can take proactive action to continue operations when problems occur.

Let’s quickly look at each technique:

Load Balancing

Networks will experience surges in traffic volumes that could result in poor performance, like slow response, time-outs, etc. Load balancing allows traffic to be distributed evenly among a pool of resources or servers so that users experience consistent responses.

Load balancers monitor their pool of target resources and divert traffic should one be offline or under stress. And you can have high availability load balancers and clusters.

Clustering

Clustering is similar to load balancing except that multiple resources are designed to operate as a single unit. Most resources can be clustered – disks, network ports, apps, and data servers.

Think of a cluster as having a built-in load balancer. For additional resilience, it’s possible to have load balancing between cluster resources.

Redundancy

A dictionary meaning of redundant is “unnecessary.” After all, a second network path is unnecessary when the first one works fine! But what happens if the first one fails? That’s when you’d want to switch traffic to a redundant – but now necessary – path.

So redundancy is having two of the same resources – network path, server, power supply, etc. – in offline standby to be brought online if the first instance falters. In some cases, you might plan for the outage of an entire data center by having one or more redundant standby sites ready to go immediately (HOT) or soon after the main site goes down.

Replication

We’ve talked about how network traffic is handled for high availability and fault tolerance, but what about the data? That’s where replication comes into play. The data from each transaction is stored in multiple storage units so each store has a duplicate of the others.

The other(s) will remain available if one storage device goes offline. In some instances of replication, only the data is replicated; in other cases, you might have replicated app servers that mirror the primary. If the main instance fails, the mirrored app server is primed, up-to-date, and ready to run!

Failover

Once the monitoring mechanism flags a problem in the primary system, failover technologies automatically bring the backup platform online. Recovery processes such as rollback and checkpointing, ensure the system's state is current before the backup is activated.

NetAdmin Role in Fault Tolerance

The decision to implement fault tolerance will be driven by executives, and the design handled by system and network architects.  Network administrators get involved when the design is implemented. They are responsible for setting up the parameters for load balancing, duplicate sites, gateways, and routers.

Net admins implement the active/passive arrangements on devices such as firewalls and multiple paths to standby ports, servers, and disk arrays. Net admins are also responsible for regularly backing up device configurations to support fast and accurate action to achieve recovery time objectives (RTO).

Wrapping Up

Nowadays, networks underpin almost every IT system and application that keeps organizations running. Certified network administrators are always in demand to help keep these networks running efficiently to meet business expectations.

One of the best ways to snag an entry-level position in this field is to earn CompTIA’s Network+, the leading vendor-neutral certification for net admins.

Begin your networking journey with entry-level CompTIA Network+ online training with CBT Nuggets. Our expert trainer, Keith Barker, will equip you with the knowledge and skills to confidently tackle the N10-008 certification exam.

Not a CBT Nuggets subscriber? Sign up today for 7-day free trial access to our CompTIA Network+ certification training.


Download

By submitting this form you agree to receive marketing emails from CBT Nuggets and that you have read, understood and are able to consent to our privacy policy.


Don't miss out!Get great content
delivered to your inbox.

By submitting this form you agree to receive marketing emails from CBT Nuggets and that you have read, understood and are able to consent to our privacy policy.

Recommended Articles

Get CBT Nuggets IT training news and resources

I have read and understood the privacy policy and am able to consent to it.

© 2024 CBT Nuggets. All rights reserved.Terms | Privacy Policy | Accessibility | Sitemap | 2850 Crescent Avenue, Eugene, OR 97408 | 541-284-5522