Wouldn’t it be amazing if you had the power to look into the future and identify failures in your system before they happen? What a superpower that would be! Lucky for you, you don’t need superpowers. You have Fault Tree Analysis.
In this article, we’ll give you a look at the history behind Fault Tree Analysis and provide context around when to use it. Soon, you’ll have a solid understanding of the different types, symbols, and approaches, as well as helpful software solutions to set you up for success.
What is Fault Tree Analysis?
Fault Tree Analysis (FTA) is a tool to analyze the potential for system or machine failure by graphically and mathematically representing the system itself. It is a top-down approach that reverse-engineers the root causes of a potential failure through the root cause analysis process.
In other words, if you ask yourself, “how likely is it that this machine will break down,” Fault Tree Analysis will help you answer that question.
FTA replicates how failure moves through a system. It creates a graphical model of how component failures lead to system-wide failures. These models help reliability engineers create well-defined systems with the proper redundancies that prevent component failures from cascading into system-wide failures – in other words, create more fault-tolerant systems.
Even if the process sounds like rocket science, the terms used in FTA are pretty straightforward.
The analytical graphs used to model FTA’s look like trees, so (unsurprisingly) they are called fault trees. The fault tree diagram will help you understand how one or more small failure events lead to a catastrophic failure. This will help you choose the right corrective and preventive measures in the future.
The history behind Fault Tree Analysis
In 1962, Bell Telephone Laboratories designed safeguards for the intercontinental ballistic missile (ICBM) system for the US air force called the Minuteman System. Safety was vital for such a complex and dangerous technology. To improve their reliability analysis, Bell Laboratories created the fault tree analysis method.
This new methodology added a graphical element that helped visualize the concepts of Failure Modes and Effects Analysis (FMEA) — a similar but very related method of preventing failure. Later on, Boeing adopted the FTA, making it a popular analysis method widely used today to analyze failure potential of critical systems.
This rigorous analysis ensures that complex systems operate safely and reliably, keeping the planes flying, the cars driving, and the world around us running as efficiently as it should be.
When to use Fault Tree Analysis
Fault tree analysis can be done at the time of design of the system or during operation (to anticipate potential failures and take preventive actions). The goal is to boost the subsystems and components that are highly likely to fail or cause a major incident before it actually happens.
It can be implemented alone or as a complement to FMEA analysis.
Who uses FTA and why?
In general, fault tree analysis helps prevent future failures and identify critical areas of concern for new workflows, products, and services. That is why various industries use FTA as a method for safety analysis and risk mitigation like:
Aerospace, aeronautical, and defense operations
Power generation and system safety
Cybersecurity system analysis
Specialty chemical manufacturing
Healthcare and pharmaceuticals
Environmental study and disaster management
Notice a theme here? These are industries that could have significant impacts on people’s lives if something goes wrong. When a plane goes down, or a healthcare device doesn’t work as it should, the risk of lives lost or other tragic events is high. FTA is what those industries use to keep those high-risk activities safe.
Why Fault Tree Analysis is worth the effort
FTA can be a technical topic with a lot of math and problem-solving. But there are some outstanding benefits to getting to know it and putting it to work in your business. It:
Assists in analyzing, understanding, and improving your systems
Lets you address one fault at a time in a very systematic way
Runs an assessment of several systems and their relationship with one another
Focuses on the root cause of the failure, not just the repair
Prioritizes your repairs based on failure rates and issues that lead to catastrophic failures
Helps design and plan maintenance in line with the failure probability of each system
Takes human error into account
With all of these benefits, it just makes sense to bring FTA into your analysis toolbox. With it, you have the power to see the future and predict things.
Fault Tree Analysis symbols and structure
FTA is performed by building fault trees. Fault trees have a standard set of symbols and naming rules used across plants and industries.
The fault tree is a directed acyclic graph (DAG) (meaning, you will read it in one direction from start to finish) which shows the flow and relationship between a series of activities. The activities are categorized as either events or gates.
Events happen in a system or process and can cause or contribute to a failure, such as the breakdown of an individual component. We’ve described events that come up in fault trees below. Event symbols will have only one input and one output.
Here’s a short description of the meaning of each event
Top event (TE): This is the event at the top of the fault tree and is the subject of the analysis. It is often the catastrophic event that causes a system-wide outage. A rectangle represents the top event. It has an input but no output because it is the ultimate culmination or end of the series of events in the tree.
Basic events (BE): Represents root cause events that spread up the chain of the system to cause the top event. The BE is represented by a circle that does not have any input. This is the opposite and is at the other end of the fault tree from the top event.
Intermediate events: These are the events caused by one or more other events. BEs cause intermediate events, which eventually cause TE. Intermediate events are represented by rectangles that have both an input and an output.
Transfer events: A transfer event can be created when a fault tree is too large to fit in a paper. This way, we can replace one big part of the fault tree with a single symbol and elaborate on what comes next on a separate diagram. Triangles represent transfer events. The transfer-out event will have a triangle with output to the right of the triangle. Transfer events will have input to the top top of the triangle.
Underdeveloped events: Sometimes, events happen that are not basic, but there is not enough information to develop a subtree. These events are marked as underdeveloped events. Undeveloped events are represented by the diamond or rhombus symbol.
Conditional events: Conditional events are the ones that act as a condition for an INHIBIT gate which is mentioned later. An oval symbol represents conditional events.
House events: An external event that is normally expected to occur. These events can either happen or not happen, so they carry the probability of 1 or 0, respectively.
Gates, sometimes called logic gates, represent how failures spread through the system. Occasionally, a single event can result in a top-level event (i.e., catastrophic failure). Other times, a combination of two or more different events can cause the top event. This is where the concept of boolean logic comes in.
Gates represent the boolean logic operators (AND, OR, UNION, NOT, etc.) and show how events combine to cause failure. Each gate will have only one output event but can have one or more input events.
The most used gates in drawing fault trees are described below:
AND gate: This gate can have any number of input events. The output event it is connected to will only occur if all the input events happen. AND gate has a rounded top out of which comes the output, as shown in the image.
Priority AND gate: An output event will only occur if all input events happen in a specific sequence. It looks very similar to AND gates, just with an added line at the bottom.
OR gate: An output event will occur if any one or more of the input events occur. The symbol for the OR gate will have a pointed top end, where the output emerges. The other end is curved and is connected to the inputs, looking somewhat like a rocket.
XOR gate: An output will occur only if exactly one input element occurs. It would look like if you tried to draw in a triangle inside the standard OR gate.
k/N or VOTING gate: For this gate, there will be an ‘N’ number of input events and one output event. The output event will occur if ‘k’ number of input events occurs. It looks similar to the OR gate with a ‘k/N’ written at the bottom end.
INHIBIT gate: Similar to AND gate, an output event will occur when input events occur, and a conditional event also occurs. The symbol for the INHIBIT gate is a hexagon. The input event is connected directly below the gate, and the conditional event is connected to the right of the gate. At the top is the output like in all other symbols.
Types of Fault Tree Analysis
Standard Fault Tree Analysis isn’t the only method available. Other extensions of FTA have been developed for specific use cases and industries. The extensions would be capable of visualizing features that are not easily expressed by standard fault trees. Some of them are:
Dynamic FTA: Dynamic Fault Trees (DFT) extend standard fault trees by modeling complex system components’ behaviors and interactions.
Repairable FTA: Repairable Fault Trees (RFT) enhance the FTA model by introducing the possibility to describe complex dependent repairs of system components.
Extended FTA: Takes multi-state components and random probabilities into consideration.
Fuzzy FTA: Takes unreliable factors that are difficult to predict (like the wind or weather) into account with a complex mathematical concept called fuzzy set theory.
State-event FTA: SEFT Is used to analyze dynamic behavior that ordinary fault trees cannot model.
Generally speaking, FTAs fall into two categories; qualitative and quantitative.
Qualitative analysis is performed every time, while quantitative analysis can be done as an add-on in situations when you know the probabilities of the events in your fault tree. Let’s take a deeper look at each of them.
Qualitative FTA is used to gain insight into the structure of fault trees to analyze the vulnerabilities of a system. There are many different ways to conduct qualitative fault tree analysis, such as:
Minimal cut sets (MCS) help identify the vulnerabilities of a system. If an FT contains a small number of components or a set of elements with a high likelihood of failure, the system would be deemed unreliable. MCS identifies these sets of elements in a fault tree. If you can reduce the probability of failure of some components or add redundancies, you will improve the system’s reliability.
Minimal path sets (MPS) will help you determine the robustness of a system. It tries to identify the minimum set of components that can keep the system functional. After those elements are identified, you can spend time working to lower the chance of them failing. This increases the overall reliability of the system.
Common cause failures (CCF) determine if multiple failures can be caused by a single element. The components identified through CCF are considered critical components. Your team needs to make sure these components are routinely inspected and replaced (as necessary). A computerized maintenance management system (CMMS) like Limble can plan and schedule maintenance of these critical components.
Quantitative FTA can be used to calculate the actual probability of the failure you are analyzing. Assigning that numerical probability of failure will help you better understand and prioritize your risk.
The result of quantitative FTA can be in the form of stochastic or importance measures:
Stochastic measures give you the probability of failure for the system.
Importance measures assign the level of importance that a cut set or path is to the reliability of the whole system.
When you know the probability of your basic events, you can easily calculate the probabilities of your intermediate events based on the gates that connect them. The most common gates are AND gates and OR gates. Here’s a simple example.
An example of quantitative FTA method
Here, A, B, C, and D are basic events. E is an intermediate event and TE is the top event. The intermediate event E is connected to the basic events A, B, and C using an AND gate. A, B, and C have to fail for the intermediate event E to happen. The probabilities of failure for A, B, and C are known. Therefore:
The top event failure TE is reached by connecting E and D through an OR gate. E in itself is a failure event and the probability of occurrence of the basic event D is known.
The probability of top event failures can be calculated like this using the qualitative FTA method.
Steps you can follow when conducting Fault Tree Analysis
We’ve mapped out the general steps you should take to complete your Fault Tree Analysis.
Step 1: Build a diverse team
When dealing with complex systems, you want different voices in the room.
Experienced professionals in the field will be able to reference past experiences from their professional life. They will also be aware of the technical aspects of the system that impact them the most. Other team members with less technical knowledge can contribute by pitching out-of-the-box ideas and other helpful information.
Brainstorming sessions and meetings need a leader, someone who has experience in conducting FTA. Engineers of respective fields, industrial engineers, and system design specialists are required for any FTA team.
Step 2: Identify failure causes
FTA works from the top down. Start with the top event, then try to identify the various failures that could cause or contribute to it. If you keep digging to build off of each event, it will eventually lead you to the root causes (now that’s what we call getting your hands dirty!). You will be left with a beautiful fault tree.
Potential failures, their characteristics, duration, and different impacts of the failure have to be defined to start and complete the process. Take fire doors in a high traffic area or factory as an example.
These doors are held open until the power fails or the fire alarm is triggered. If the fire alarm is faulty, there is an issue with the wiring, the backup batteries have run low, or someone has tampered with it. The alarm will trigger the doors to close when they are not supposed to. Resulting in a low-level failure, but one that can cause massive frustration and interrupt the entire organization.
Step 3: Understand the inner workings of the system
The team performing FTA needs to have a deep understanding of the inner workings of the system. The engineers working at the system level will have a good idea of how everything works and what failures you will want to avoid. Other team members can then raise questions that result in an expanded list of failure causes worth exploring.
Someone with knowledge and expertise of the system should be in charge of guiding the discussion. The goal is to get a good grasp of the system’s requirements, connections, and dependencies.
Your team should collect the schematics of the system, specifications of different components, and other available manufacturer information. If you’re using Limble CMMS, these asset specifications are available at the touch of a button. Studying these materials should build an understanding of how each sub-system and component are connected to each other.
Step 4: Draw the FTA diagram
Once the team understands the system’s inner workings, the next step is to graphically present a functional map of the system using boolean logic. Using the fault tree symbols and structure above, your team can draw the graphical representation of the system and how they are all connected.
Step 5: Identify MCS, MPS, or CCF
After the fault trees are complete, your team can identify MCS, MPS, or CCF based on what they want to accomplish.
MCS or minimal cut sets are identified to know the most vulnerable parts of the system.
MPS or minimal path sets are determined to identify the core components and subsystems required to remain operational.
CCF identifies the components that cause the maximum number of failures.
Your reason for performing FTA in the first place will determine whether the team needs to find MCS, MPS, CCF, or a combination of the three.
Optional step: Assess the probability of failure
More often than not, you’ll find multiple pathways that can lead to the same failure event. For an extensive system, it would be nearly impossible to address all failure causes at once.
To prioritize which events to address first, the team can calculate the probabilities of each failure for different critical sets. The critical set with the highest chance of failure should be given top priority.
This is an optional but valuable step. If you know the probability of each failure, it will be worth the time to use them!
Step 6: Develop risk mitigation strategies
Now it is time to use your Fault Tree Analysis to minimize your risk of failure.
High priority has to be given to protect MPS (the minimum set of components to keep the system operational).
Strict maintenance schedules have to be maintained for CCFs as they can cause a multitude of issues.
A CMMS system like Limble can help you assure adherence to required maintenance schedules. This includes following the best practices for spare parts management, so the maintenance team always has replacement components in stock. This effort has to be put in to minimize the probability of failure.
Fault Tree Analysis examples
Here are two different examples of Fault Tree Analysis to help paint the picture of how the process works.
The car won’t start
FTA example for a car that will not start
*The explanation we give below doesn’t directly match the FTA shown above. We wanted to give a more practical explanation than “remove your foot from the brake” to start the car 🙂
You wake up one morning and get ready for work. You hop into your car, turn the key, and — nothing. Your car won’t start. It’s not even turning over.
Knowing a thing or two about cars, you hop out, pop the hood and check the battery. Next, you check the gas gauge to make sure you are not out of gas before getting back into the car to ensure that the lights were not left on overnight.
In this example, the car not starting is the failure or Top Event (TE). The three options as to why the car won’t start are all connected by an OR gate, meaning any one or a combination of the three could cause the vehicle not to start.
Taking it one step further, when you check the battery, you have a few things that could cause the failure. The battery is old and needs to be replaced, or the battery is flat and needs a jump. The next question to ask would be why the battery is flat. If the headlights were left on, your next task is to determine how to avoid that in the future? Make sure to check them before getting out of the car.
Suppose you want to calculate the probability of failure. In that case, you need to assign a number representing the probability of occurrence to the events and then use the qualitative FTA method to calculate the top event failure.
Server experiences a catastrophic failure
This example is more technical than the last one. Let’s say you have a server that stores critical data, and it experiences a catastrophic failure.
Fault Tree Analysis example for a server failure
Here are quick explanations for certain elements:
B is a non-redundant system bus.
PS is the power supply to the server.
C1 and C2 are two redundant central processing units (CPUs) for the server, meaning one of the two CPUs can fail without causing total system failure.
M1, M2, and M3 are memory components that can be shared between both CPUs.
This fault tree maps out the path, cut sets, and probabilities of the top event (system failure) happening.
Failure spreads from the basic events to the top event through the gates G1 – G6. Gate G1 is an INHIBIT gate with the condition that the system failure will happen only when the system is in use. This means that faults can be repaired during scheduled downtime allocated for maintenance. Gate G2 indicates that failure of either basic event B or the failure of the sub-system propagated till G3. Gate G3 fails only when both the CPU subsystems (with C1 and C2) fail.
Each CPU subsystem consists of the power supply (PS), CPU (C1 or C2), and memory component propagated through G6. Each CPU subsystem will fail if either the power supply, CPUs, or the memory component fails. Failure at a level above will happen only if both the CPU subsystems fail. G6 is a voting gate, and for failure to propagate, at least two of the three memory components must fail.
The boolean expressions for the system are as below (the ∩ stands for the boolean operator “union,” which is basically where the two components function joins or overlaps):
G1 = U ∩ G2
G2 = B ∩ G3
Combining the two gets us:
G1 = U ∩ (B ∩ G3)
G1 = (U ∩ B) ∪ (U ∩ G3)
You can continue in this manner until all the intermediate events are eliminated, and only basic events remain to get you to the minimal cut sets. This is the top-down approach.
Since the probabilities of the basic events are not stated, you can’t perform a quantitative analysis.
Fault Tree Analysis compared to other analytical methods
FTA is not the only analytical methodology out there. Let’s take a look at a few others to see how they compare.
While FTA uses a top-down method to assess points of failure, Failure Modes and Effect Analysis or FMEA uses a bottom-up approach. It questions what could go wrong at each step that may cause failure instead of looking at the failure first.
Also, FMEA does not look at the relationship between different events or conditional events the way FTA does. Therefore, FTA is a more complex but thorough analysis.
Failure mode effects and criticality analysis (FMECA) is easy to grasp. It is like FMEA, but it adds a criticality analysis or ranked list. FMEA looks at a long list of “what-if’s” FMECA allows you to rank failures so you can better plan and prioritize your work.
Event tree analysis focuses on specific questions and answering them in a very straightforward way. Moreover, it doesn’t have the general use that fault tree analysis does. It is generally used in financial industries.
Streamlining the process with FTA software
FTA for large and complex systems can quickly become so large that they can’t be drawn on a single page or a whiteboard. You can work around this by using tried and true transfer elements. However, even with them, the diagram can become too large to handle, read, and comprehend. Fault tree analysis software is an excellent solution for this type of problem.
In addition to simplifying graphical representation, some applications have algorithms that can automatically identify quantitative aspects of FTA like MCS, MPS, and CCF. If you know your probability of failure for your basic events, the probabilities for top events and subsystem failures can be calculated with the click of a button.
Those are by no means all available solutions, just the more popular ones. There are many out there with additional features suited to different uses. Shop around to find the right product for you based on your specific purpose and industry.
As you can tell, a lot of research and expertise has gone into developing the Fault Tree Analysis process. If you would like to dive deeper into this subject, check out these additional resources:
FTA lecture on YouTube by Department of Industrial Ans Systems Engineering at IIT Kharagpur
Another FTA lecture on Youtube by xSeriCon, an engineering consultancy and safety training firm.
Wrapping it up
Fault Tree Analysis can certainly be complex. If you get the right team together and practice it enough, you’ll start to feel that you can look into the future and anticipate failures and their causes. You’ll be the wizard that plans fault repair into scheduled maintenance downtime and keeps your team working proactively more than they work reactively.
At Limble, we’re here to support you every step of the way. Our CMMS system will house all of the information you and your team need to effectively build FTA’s, manage activities to mitigate risk, and so much more. It’s our mission to make your job as easy and streamlined as possible. Reach out to us with questions or to see how our CMMS can support you.
"I can track my inventory and it sends me emails when I'm running low on an item. Also that I can track how much time I'm spending on certain jobs over an extended period of time."
— Cody Jensen
Very easy to use, access
"I like the price, the fact I can see it on my phone or the computer. I like that it is internet-based."
— Curt Waisath
Valley Salt LLC
It just works
"Honestly - the customer support has been fabulous. We had a minor feature request that was deployed within 24 hours - which is unheard of. Even better when you consider our business is located in a completely different time zone (somewhere in Australia). Limble is quite intuitive and I love the ability to have assets nested within each other."
— Ed Cronin
Great for smaller or larger facilities
"We haven't fully integrated Limble yet but we are already seeing improvements in our efficiency. As we fully integrate Limble we expect to see more benefits and increase our response and completion times. The customer support has been outstanding. The Limble team is very quick to respond to any questions and they are very open to suggestions."
— Mike Hill
Children's Home of Lubbock
Limble is the best thing to happen to this company
"Limble does such a good job at keeping track of what's been done and letting me know when and what I need to do next."
— Tom Jones
Little Giant Ladder Systems
Great product at a great price
"Terrific customer service, easy to use, and at a great value. Our old Maintenance software was very difficult to use and was very expensive."
— Brian Williams
Download our Preventative Maintenance Checklist
Take the management stress away from preventative maintenance.
Cheat-sheet to better productivity and reliability
Steps we've learned over years working with thousands of customers
Important tips to help you avoid common costly pitfalls when creating your PM plan