How to do a root cause analysis
Root Cause Analysis can be accomplished using many different tools and techniques. Even though those processes may look different, they all arrive at the same end goal: fixing the root cause of an issue.
To do a root cause analysis the right way, you should follow four basic steps.
Step 1: Define the problem
Start with the obvious: What is the problem? By defining the problem, the symptoms, and what you can see happening, you set the scope and direction of the analysis.
Without a specific problem statement, it’s hard to create a path to a solution. A well-defined problem statement also helps determine the scale and scope of the potential solution to be implemented. When you’re writing your problem statement, keep these three pieces in mind:
- How would you describe the problem at hand?
- What do you see happening?
- What are the specific symptoms?
Step 2: Collect the data
Collect all available data related to the incident. Ask yourself, “What proof is there? How long has this problem existed? What is the impact of the problem?” Be sure to record any other data you think might help you determine the issue.
Take, for example, machine failure in a manufacturing plant. These are examples of types of information you’ll want to document.
- The age of the machine
- Time of continuous operation
- Operating patterns
- Maintenance schedule
- Operators handling the machine
- Specifications of the machine
- Schematic of the plant infrastructure
- Seperating characteristics of the machine
- Characteristics of the operating environment
Inspecting the machine in person also provides information that could be beneficial for root cause analysis. It will be easy for facilities that run predictive maintenance to collate data quickly.
Step 3: Map out the events
Establish a timeline of events. This will help you determine which factors among the data collected are worth investigating. RCA needs data points that potentially lead to the root cause. Putting events and data in chronological order helps to differentiate causal events from non-causal events.
From the data collected, you can identify correlations between various events, their timing, and other data collected. Remember that correlation does not mean causation.
Questions to ask yourself when looking for correlations:
- What sequence of events allowed this to happen?
- What conditions are present/allowed this to happen?
- What other problems surround the occurrence of the main problem?
The next step is to map out a causal graph. These graphs are used to represent the relationship between events that happened and the data collected.
But it’s important to not stop investigating when you find a correlation between events. Correlation means there is a link between two events, but it doesn’t automatically mean that one event caused the other. That’s why it’s essential to continue your sleuthing until you find a causal relationship. Find out what event caused another event. This will help you find the actual root cause.
From the data collected, chronological sequencing, and clustering, we should be able to create a causal graph (or use one of the root cause analysis tools we discuss later). You can use this graph to represent the relationship between various events that occurred and the data collected. The different paths are given different probability weights. They can serve as a visual tool to track down the root cause.
Example of a causal graph. Source: Adam Kelleher on Medium
Step 4: Solve the root of the problem
Once you’ve identified the root cause, you can quickly determine the best solution to fix it. You can then map it against the scope defined in your initial problem statement. If the solution works with your available resources, it can be implemented.
Fixing the root cause should eliminate the issues. If the symptoms occur again, it’s time to return to the drawing board and conduct RCA again.
Once the problem is solved, you will need to take proactive steps to ensure it doesn’t happen again. There can be multiple solutions applied to solve a single issue.
For example, the root cause could be the wear of a bearing, which happened much earlier than expected. In this case, the procedure has to be adjusted to change the bearing at an earlier time. Similar steps to avoid recurrence of fault can be changes in the maintenance schedule, different modes of maintenance, changes in design, different OEM vendors, etc.
The implemented solution will have to be in line with the available resources. So, if the root cause is pushing the machine too hard, the obvious answer is to shorten the machine run time. However, if the production schedule doesn’t allow for shortened runtimes, another solution might be scheduling more preventive maintenance.
Tried-and-true RCA tools and techniques
There are many tried and trusted frameworks available to execute RCA. None of these methods are foolproof, but they provide a solid base for how to go about root problem investigation. Each method has its own list of benefits and shortfalls. Some methods are more suitable for different industries and types of problems.
You and your company should have your own unique protocol when conducting RCA. In some instances, external consultants might be brought in to conduct RCA. In such cases, the consultants will generally have their own preferred technique or a combination of techniques they use. This is one of the reasons why it is hard to create a universal template for RCA that everyone can follow.
Let’s look at the different forms of root cause analyses.
The 3 Rs of Root Cause Analysis
No doubt you’ve heard these 3 Rs: “reduce, reuse, recycle” or maybe even “reading, writing, arithmetic.” But RCA also has its own system of 3 Rs: Recognize, Rectify, Replicate.
Recognize
The actual cause of a problem is not always apparent, and simple cosmetic fixes usually don’t do much to correct the underlying fault. Even though RCA can be an elaborate time-consuming exercise, we do it to pinpoint the actual cause so we can take corrective actions that will eliminate future issues. As mentioned earlier, RCA can also be done to identify the reason for an unexpected positive outcome.
This first step is when you notice something’s not working quite right. The machine is leaking fluid, making a weird sound, or not running as productively as it usually does. This is when it’s time to put on your detective cap and find out what’s going on.
Rectify
Once you’ve recognized the root cause, it’s time to start a corrective course of action. If the root cause is addressed, the same problem should not be cropping up again. If the same problem reappears, it’s likely because the cause you identified was not actually the root cause.
In this case, you might have to go through the RCA process again to make sure that you get to the actual root cause.
For example, you notice the machine is leaking fluid, so you patch the hole in the metal. If you stop seeing fluid on the ground under the machine, you’ve solved the problem, and you’ve taken care of the root issue. But if a leak crops up again in a week, it’s time to run another RCA to find out if there are other holes in the metal or if gaskets are failing.
Replicate
Once you’ve identified and rectified the root cause, your next step is to ensure it will not happen again at any point during the process or system. Sometimes you’ll want to do an RCA to get to the bottom of an unexpectedly good outcome. In that case, you will test whether the same factors can be replicated in other scenarios and environments.
Suppose there were issues with faulty parts coming off the line, but you’ve since fixed the issue. The next step would be to replicate the problem to test whether you actually fixed the root issue.
In that case, you’d need to replicate what happened during this period to ensure that you got to the bottom of the issue.
5 Why analysis
5 Whys is the original technique developed by Sakichi Toyoda for root cause analysis at Toyota factories. It is addressing everything with a ‘why’, just like a curious child. Keep asking ‘why’ until you’ve reached the root cause. You can continue this process until you reach a stage where there is no need to ask ‘why’ again. At that point, you should have reached the root cause of the problem.
As a rule of thumb, asking and finding answers to 5 subsequent ‘why’s’ should be more than enough to reveal the root cause of most problems. Hence the name ‘5 why’ analysis.
Benefits of the 5 Whys:
- Helps identify the root cause of a problem
- Offers an understanding of how one process can cause a chain of problems
- Helps determine the relationship between different root causes
- Highly effective without complicated evaluation techniques
When to use the 5 Whys:
- For simple to moderately complex problems
- More complex issues may need this method in conjunction with another
- Any time human error is involved in the issue
Fishbone diagram (a.k.a. Ishikawa diagram)
The Ishikawa method for root cause analysis emerged from quality control techniques employed in the Japanese shipbuilding industry by Kaoru Ishikawa. The shape of the resulting diagram looks like a fishbone, which is why it is called a fishbone diagram. This diagram is built on the idea that multiple factors can lead to a failure/event/effect.
The 5 M framework (shown above) from the Toyota Production System uses RCA with the Ishikawa method. The 5 Ms are:
- Man/mind power
- Machines
- Measurement
- Methods
- Material
The problem or fault is written down at the far right end, where the fish head would be. The cause of the problem is represented along the horizontal line. Further effects and their respective causes are written down along the fish bones representing each of the 5 Ms. This process continues until the team is convinced that the root cause is identified.
Benefits of the fishbone diagram:
- A good way to brainstorm within a defined structure
- Helps to visually diagram a problem or condition’s root cause
- Helps to show bottlenecks in the process
- Helps to find ways to improve the process
When to use a fishbone diagram:
- To analyze a complex problem with many causes
- When you need a different view of the issue
- To identify root causes
- To identify bottlenecks and identify issues where a process doesn’t work
Failure mode and effects analysis (FMEA)
FMEA is a proactive approach to root cause analysis, preventing potential failures of a machine or system. It is a combination of reliability engineering, safety engineering, and quality control efforts. It tries to predict future failures and defects by analyzing past data.
A diverse cross-functional team is essential when using FMEA. You will need to clearly define and communicate the scope of the analysis to your team members. Each subsystem, design, and process is closely reviewed. The purpose, need, and function of each system are questioned. Potential failure modes are brainstormed. Failure of similar processes and products in the past can also be analyzed.
The potential effects and disruptions that could be caused by each of the identified failure modes are assessed and used to calculate its RPN.
If the failure mode has a higher RPN than a company is comfortable with, you can address this by changing one or more factors outlined in the image above.
Benefits of FMEA:
- Enables early identification of a failure point
- Captures the collective knowledge of a team
- Improves the quality, reliability, and safety of the process
- A logical, structured approach for identifying process areas of concern
- Reduces process development time, cost
- Documents and tracks risk reduction activities
When to use the FMEA methodologies:
- When designing a new product, process, or service (DFMEA)
- When you’re going to update a current way of doing things
- When you have a plan for quality improvement
- When you need to understand the failures in a process and improve upon them (PFMEA)
Fault tree analysis (FTA)
Fault tree analysis is a method for root cause analysis that uses boolean logic (using AND, OR, and NOT) to figure out the cause of failure. It was developed in Bell laboratories to evaluate an Inter Continental Ballistic Missile (ICBM) launch control system for the U.S Air force.
Fault tree analysis example. Source: Six Sigma Study Guide
Fault tree analysis tries to map the logical relationships between faults and the subsystems of a machine. The fault you are analyzing is placed at the top of the chart. If two causes have a logical OR combination causing effect, they are combined with a logical OR operator. For example, if a machine can fail while in operation or while under maintenance, it is a logical OR relationship.
If two causes need to occur simultaneously for the fault to happen, it is represented with logical AND. For example, if a machine only fails when the operator pushes the wrong button AND relay fails to activate, it is a logical AND relationship. It is represented using the boolean AND symbol. In the image above, AND is the blue symbol, and OR is the purple symbol.
The fault tree created for a failure is analyzed for possible improvements and risk management. This is an effective tool to conduct RCA for automated machines and systems.
Benefits of using a fault tree analysis:
- Use deduction to find the causes of each event, like the 5 whys
- Highlights the critical elements related to system failure
- Creates a visual representation for analysis
- Can focus on one area of failure at a time
- Exposes system behavior and possible interactions
- Accounts for human error
- Promotes effective communication
When to use a fault tree analysis:
- When the effect of a failure is known — to find out how it might be caused by a combination of other factors
- When designing a solution — to identify ways it may fail in order to make the solution more robust
- To identify risks in a system
- To find failures that can cause the failure of all parts of a “fault-tolerant system”
Pareto charts
A Pareto chart indicates the frequency of defects and their cumulative effects. Italian economist Vilfredo Pareto recognized a common theme with almost all frequency distributions he could observe. There is a vast imbalance between the ratio of failures and the effects caused by them.
He proposed that in any system, 80% of the results (or failures) are caused by 20% of all potential reasons.
The principle is dubbed the Pareto principle (some know it as the 80-20 rule). This skew between cause and effect is evident in many different distributions, from wealth distribution among people to failures in a machine.
Paret chart for shirt defects. Source: Tulip.co
With the 80-20 principle in mind, you can use Pareto analysis to dig into failures and possible causes. To start, draw a bar graph that includes the frequency of faults and causes. With this graph, it’s easier to see the skew between causes and failures. Usually, you’ll see how a small percentage of factors cause the majority of faults.
Next, you’ll analyze the causes that contribute to the largest number of faults and take corrective action to eliminate the most common defects.
Benefits of using pareto charts:
- Defects are ranked in order of severity, with the most severe handled first
- Can determine the cumulative impact of the defect
- Offers a better explanation of defects that need to be resolved first
When to use a pareto chart:
- To analyze problems or causes in a process that involves the frequency of occurrence, time, or cost
- To narrow down a list of problems to find the most significant
- To analyze a problem with a broad list of causes to identify specific components
Pareto charts work great for determining the priority for taking up root cause analysis. According to the Pareto principle, eliminating 20% of the most common failure causes can result in reducing the overall number of malfunctions by 80%. Pareto charts will indicate the top failure causes to be further investigated and addressed, according to the criticality of the machine, the impact failure of a specific part, or a combination of the two.
Honorable mentions
Root cause analysis is very open-ended and has a lot of widely used tools in various industries. We covered the major ones in the sections above, but these systems also deserve some recognition. A few honorary mentions:
- Cause and effect diagrams. The Fishbone diagram is an example of cause and effect diagrams. Many similar tools try to map the relationship between causes and effects in a system.
- Kaizen is another tool from the stable of Japanese process improvements. It is a continuous process improvement method. Root cause analysis is embedded within the structure of Kaizen.
- Barrier analysis is an RCA technique commonly used for safety incidents. It is based on the idea that a barrier between personnel and potential hazards can prevent most safety incidents.
- Change analysis is used when a potential incident occurs due to a single element or factor change.
- A scatter diagram is a statistical tool that plots the relationship between two data in a two-dimensional chart. It can also be used as an RCA tool.
CMMS to save the day
If you’re feeling overwhelmed by all the different methods, metrics, and charts, not to worry, we’ve got your back. A computerized maintenance management system, or CMMS, can help you easily create, record, and track data used in root cause analysis.
With Limble, you can also create your own “5 Why’s” template and save it for use in the future. This makes it easy for anyone to quickly start a 5 Why RCA, repeating the same steps for consistent results.
To create your own 5-Whys template in our CMMS, you create a work order template, your space to record what happened. Below that, you can add child instructions asking the “why”. Your first “Why” can be to run a test to determine if the fault was a fluke, or if something is actually broken. You can also use custom tags to pull reports on just those specific work order templates. This gives you a clean, well-documented approach to RCA.
Additional RCA Resources
Root cause analysis is a vast umbrella term that cannot be exhaustively explained in a single article. Here are some additional resources to learn more about RCA, its tools and techniques:
- This 70-minute video from the consulting firm KT Kepner-Trego is an excellent place to start a broad understanding of RCA and major techniques.
- Six Sigma US is an accredited provider of Lean Six Sigma certifications. They have extensive material on root cause analysis and also provide online courses and certifications for it. You have the option to choose between classes with different structures that can accommodate your schedule.
- Root cause analysis course from the University System of Georgia is available on Coursera. You can enroll in the course for free and receive certification for a minimal fee. Coursera courses are widely recognized.
- The textbook Root Cause Analysis by Mathew A Barsalou is an excellent guide to choosing the right RCA tool for the proper context.
- “Root Cause Analysis: The Core of Problem Solving and Corrective Action” by Duke Okes is another comprehensive and authoritative resource on root cause analysis.
Now is not the time to cut corners
Root cause analysis is complex and should not be done on a whim. Your team might decide to cut corners to save on time and speed up the process. But if you want to get to the bottom of any complex event, rushing the process can be detrimental to the whole project. When you have a good reason to conduct RCA, it is in your best interest to create an environment where the process can be executed successfully.
If you want to know how a CMMS could make your job less stressful, get started with Limble on a free trial, or set up a demo with our team.
Is there a list of RCA examples for IT environment such as Application, Database, Server, Network Device, Network. Also, please recommend insightful RCA resources for IT
Hey Kenny, I do not know any from the top of my head. Most guides we come across while writing this piece, even when they were for specific industries, still just focused on explaining the general concept.
I would have to google it the same as yourself. Good luck with the search!
These tools are not easy to use, espicallly in complex problems, but explanation is good.
Comments are closed.