Fault Tolerance
Everything you ever needed to know about fault tolerance
What is fault tolerance?
Fault tolerance represents the capability of any system or equipment to sustain its operation during the presence of a fault.
Systems and equipment with high fault tolerance, depending upon the adopted fault tolerance mechanism, are able to completely or partially sustain their operation upon the occurrence of a fault. For this to work in practice, such systems can’t have a single point of failure (SPOF).
The essence of fault-tolerant designs
The development of fault-tolerant design requires careful consideration of failures that can be manifested throughout the equipment life cycle, along with their probable causes and consequences.
However, the design engineers must also consider the cost and resource factors needed to achieve the required level of tolerance, reliability, and dependability of the equipment.
It is often misunderstood that a fault-tolerant design should provide complete tolerance to all types of faults. This is not true. A good design should match the degree of tolerance to the criticality of the fault such that the overall optimization of cost and resource efficiencies can be achieved.
For example, it might not be cost-effective to spend money on product redesign, just to address a fault that has an extremely low chance of occurring.
The Essential Guide to CMMS
Download this helpful guide to everything a CMMS has to offer.
Characteristics of fault-tolerant systems
To create a fault-tolerant system, efforts are required at every stage of the equipment life cycle. This includes but is not limited to the specification and design phase (incorporating fault detection controls in the design), validation and verification (V&V), maintenance and operation (using OEM-approved replacement parts and guidelines for routine maintenance), and even disposal stage.
Each stage may adopt combinations of the below-stated techniques to develop new designs or improve current ones to enhance their level of fault tolerance:
- Fault detection and display
- Fault diagnosis and containment
- Fault masking and compensation
1) Fault detection and display
Fault detection refers to the capability of the system/equipment to sense and display the fault. It is the fundamental aspect of any fault-tolerant system. All other aspects are contingent upon the effectiveness of the fault detection process. If the system is not designed to detect its fault, or somehow incorrectly detects a fault, the rest of the aspects will also be ineffective.
For example, a simple air pressure sensor in a car tire pressure monitoring system (TPMS) can detect the air overfill and notify the driver via the car dashboard.
A representation of TPMS activation
In this case, the detection and display is the only acceptable tolerance level for this fault event. The customer can safely disengage the air hose before rupturing the tire.
If the pressure detection is inaccurate, the driver may disengage the hose too soon/late and experience tire failure during driving. Since there is no automatic correction of air pressure, the tolerance aspect for this fault is restricted to just detection and display.
2) Fault diagnosis and containment
In more sophisticated systems, additional layers are often added in the product design stage. Their purpose is to diagnose and perform containment on top of detection and display. These additional layers are warranted due to the criticality of the system or because of various safety concerns.
For example, a Distributed Control System (DCS) – a control system for process plants – not only monitors critical process parameters through a set of sensors but also performs a diagnosis to detect the location of the fault and perform necessary containment.
A representation of the DCS system
For instance, in the case of overpressure of petroleum products in a vessel, the system is triggered by relevant pressure sensors. It opens the safety pressure valve and exhausts the vapors out in the flare stack.
In this example, the containment is carried out by diverting the high-pressure flammable vapor to the exhaust stack, protecting the system from fire or explosion.
3) Fault masking and compensation
Another effective approach to fault tolerance is by masking the state of fault. It is very effective for equipment that can be monitored and controlled through the Internet of Things (IoT) technology.
With such equipment, one of the most significant challenges comes in the form of cybersecurity threats. These types of threats can attempt to induce the fault by altering the state of the equipment through the injection of false equipment data into the server.
With incorrect equipment state records, the very control and monitoring system originally intended to protect can instead cause the failure of the asset. Alternatively, it can be “tricked” into thinking the asset is in good condition when it is actually not – letting the deterioration lead to failure without triggering any alerts.
By incorporating fault-masking, the system is designed in a way that it can recognize and mask those incorrect values.
For example, in the electricity grids, the circuit breakers are often controlled and monitored through Supervisory Control and Data Acquisition (SCADA).
A representation of the SCADA system
Such a system closely monitors the voltage and frequency parameters of the electrical equipment and causes them to close or open to maintain power network stability.
An incoming cyberattack could alter the voltage and frequency limits on the equipment. Consequences? The system could cause a power breakdown instead of preventing it.
Fault masking is often carried out through algorithms that detect anomalous data streams and inject false data with the purpose of masking the data which represents the faulty state of the equipment. This prevents the bad data actors from spreading the fault and further exacerbating the grid’s reliability.
Improving fault tolerance through redundant designs
One of the simple actions that can be taken to increase fault tolerance is by incorporating redundancies in the design. Redundancy simply means the presence of an alternate system or solution that can take over the intended function should the primary system fail.
While redundancy improves fault tolerance, haphazardly adding systems should not be the objective as the amount of cost required to add any new system can significantly outweigh the attainable reliability benefit.
From the perspective of physical equipment, they can be broadly classified as either active or passive redundancies.
Active redundancies
Active redundancies can be established when multiple pieces of equipment are operated simultaneously. In this configuration, each piece of equipment contributes its share towards attaining the intended function while still acting as redundancy for each other.
A simplistic active redundancy is the parallel operation of two pumps at half of their rated capacities. Both pumps jointly operate to achieve desired discharge pressure. If one pump fails, the other pump can still be boosted to its rated capacity to attain intended discharge pressure on its own. To attain economy of design, the reliability engineers have come up with various other complicated ways to achieve active redundancies such as K of N redundancies and graceful degradation.
In K of N redundancies, a given subset of equipment is always under operation. This increases the reliability of the system as some of the equipment is still on hot standby and can join the operation upon failure of some equipment. This guarantees greater reliability compared to the simple parallel operation of two pumps as there will be a larger number of small pumps operating.
Graceful degradation is an alternative to adding costly identical and parallel systems. It ensures that the features or functionality of the overall equipment degrades proportionally to the number of failed components. To achieve such scalable degradation, an examination of all possible failures within all components should be carried out. Their impact on the overall system’s performance should be analyzed and documented.
Such techniques provide tolerance to partial failures and enable the system to continue its function at a degraded capacity.
Passive redundancies
Passive redundancy is the standby redundancy where the alternate equipment is present – but it can only take over the intended function upon failure of the primary equipment.
We can differentiate two types of passive redundancies:
- Operating passive redundancies
- Non-operating passive redundancies
Operating passive redundancies are the ones where the alternative equipment is present as a hot spare. The standby equipment is hot because it could be operating under no-load conditions. In some cases, it may be serving a function that is outside the definition of primary equipment’s function.
Upon failure of the primary equipment, the operating standby equipment can be automatically transitioned into performing the function of primary equipment.
An example of operating passive redundancies can be a secondary alternator that operates under no-load conditions and meets all other paralleling conditions such as the same terminal voltage, frequency, and phase sequence. Upon failure of the primary alternator, the secondary alternator can be automatically synchronized with the system and take over the load.
In the case of non-operating passive redundancies, the standby equipment is powered down. Upon failure of primary equipment, the standby equipment can be automatically or manually set to operating conditions and take over the functionality of primary equipment.
A good example of non-operating passive redundancy is a standby municipal water pump which can be started and operated manually to deliver water to residents if the primary water pump malfunctions. Since the restoration of operation is not critical, an operator can go and start the pump (and synchronize it with the system later, as needed).
Guide to Moving from Reactive to Preventive Maintenance
Want to transition away from costly reactive maintenance but don't know where to start? This guide has everything you need to know.
Reliability techniques for analyzing fault tolerance
Fault tolerance is a part of reliability engineering efforts and requires careful examination of all possible failures that can happen within the equipment. The Failure Mode Effect Analysis (FMEA) and the Fault Tree Analysis (FTA) are two well-known techniques to analyze system design from bottom-up and top-down approaches respectively.
To better understand tolerance, the failure sequence and dependencies must be analyzed and investigated. A particularly useful technique to analyze dependencies and sequence is the Markov model where the probability of any failure event would depend upon the state of the previous event.
Similarly, another powerful technique is Monte Carlo simulations that can be used to model the impact of uncertainties of any failure event on the system performance.
Want to see Limble in action? Get started for free today!
Fault tolerance and maintenance operations
Do fault-tolerant systems need less maintenance? Well, yes and no.
Because of redundancies and other characteristics we discussed earlier, such systems can usually take on more faults before their functionality is compromised. However, if the issues aren’t addressed, the accumulation of faults will eventually lead to a system or equipment breakdown. Therefore, maintenance teams should use a CMMS software to make sure corrective maintenance actions are taken in due time.
In some sense, fault tolerance gives maintenance and support teams more breathing room. They still need to deal with the problem, but maybe not right away.
While fault-tolerant designs have their challenges in terms of increased costs and complexity, they make up for it in the form of improved equipment reliability.