Sherlock Holmes, Thomas Magnum … and you. Yep, maintenance professionals are right up there with the great Magnum P.I. (and his ’stache) when it comes to solving the trickiest cases.
But what if there was an even easier way to solve those complex problems? It’s a little technique we call Root Cause Analysis (RCA).
In this post, we will show you how to get to the bottom of why things break, and all the different ways you can fix them. Get out your detective notebook because we are going to take a deep dive into all things Root Cause Analysis.
What is root cause analysis?
By definition, root cause analysis is the process of finding the underlying cause for an effect we observe or experience. In the context of failure analysis, RCA is used to find the root cause of frequent machine malfunctions or a significant machine breakdown.
Like those meddling kids in Mystery Inc., you’ll use your detective skills to determine:
- what happened
- why it happened
- how to prevent it from happening again
RCA is a reactive process, meaning it’s performed after the event occurs. But once a root cause analysis is done, it takes the shape of a proactive mechanism since it can predict problems before they occur.
If you fix a symptom of the problem, but you don’t fix the actual cause of the problem, there’s a high chance the failure will happen again.
For example, suppose you replace the broken belt but don’t change the misaligned part causing the belt to overheat and break. In that case, you could bet your paycheck that the belt is going to fail again. RCA tries to follow the chain of cause and effects to pinpoint the problem that will make all the other faults disappear when finally eliminated.
The RCA process and outcomes
Conducting root cause analysis can be very complicated. It involves a vast amount of data collection and review. The result of a root cause analysis isn’t always black and white. It can’t always tell you if the problem you identified is the root cause.
You will often get only a strong correlation between cause and effect and not the exact cause. From there, you’ll have to use your experience and professional knowledge to judge whether to investigate further or not.
RCA is a craft that requires specialized knowledge and in-the-field experience. Meaning you’re likely the best person for the job here. Otherwise, any fixes implemented will likely be just a cosmetic solution to the problem. In the worst-case scenario, the changes made could actually make the situation worse.
Despite these limitations, RCA is still a powerful tool for understanding and improving the fundamental nature of systems and procedures.
Over the years, RCA has evolved to work within various fields, each with its own unique needs and approach. The most apparent use of RCA is in the medical field. The TV show House is an excellent example of RCA in action.
In the show, a complex and bizarre medical case usually shows up at the hospital. The doctors are stumped! That is until the unconventional wildcard Dr. House jumps in and saves the day with his crazy theories and methods.
The good doctor uses root cause analysis to dig into an issue and keeps digging until the real cause of the patient’s symptoms is finally revealed. A happy ending for all!
Aside from the healthcare field, many other industries use root cause analysis regularly. Some of them are:
- manufacturing (machine failure analysis)
- industrial engineering and robotics
- industrial process control and quality control
- information technology (software testing, incident management, cybersecurity analysis)
- complex event processing
- disaster management and accident analysis
- pharmaceutical research
- change management
- risk and safety management
These industries will generally use one specific type of root cause analysis that fits their situation best. Below are some examples of different types of RCA methodologies used by various fields and industries.
Different types of RCA
RCA comes in different forms depending on the problem you’re trying to solve. Here’s what they look like:
- Safety-based RCA comes from the field of occupational safety and health, as well as accident analysis. This type of root cause analysis is used to determine why an accident happened at work I.e. why someone cut themselves or why a part was accidentally dropped by a worker at heights).
- Production-based RCA is used in the field of manufacturing to ensure quality control. You might use this to find out why the injection-molded plastic parts are coming off the line warped.
- Process-based RCA is used in business and manufacturing to determine the fault in a process or a system. This might be used in accounting to determine why vendors aren’t getting paid on time.
- Failure-based RCA is used in engineering and maintenance to determine the root cause of any type of equipment failure.
- Systems-based RCA originated as a combination of some of the root cause analysis techniques listed above. This methodology is an approach that combines two or more methods of RCA. It can be used in a wide variety of fields/applications.
Maintenance Acronym Guide
Confused by PMs, MTTR, and DFMEA? This maintenance acronym guide will have you talking the talk in no time.
When to perform a root cause analysis
When you’re doing an RCA to determine the source of a fault, you’ll usually find 3 basic types of problems:
- physical causes
- human causes
- organizational causes
You can also do a root cause analysis if you want to drill down and find out exactly why a process or procedure is producing better-than-average results. By identifying the cause of a positive event, you could presumably replicate it and see those results elsewhere. Even if it’s time-intensive, one round of RCA can mean a lot of bang for your buck.
Keep in mind that RCA requires a significant investment of time, manpower, and money. And it will likely cause further disruption in the specific production line or the system you’re working on. So bearing that in mind, you don’t need to (and you shouldn’t) do RCA for every single fault.
Unfortunately, there is no cut-and-dry rule when to run an RCA and when not to. As the expert and the experienced professional, you’re generally the best person to determine whether or not to run a root cause analysis.
If the same fault occurs over and over, it’s worth investigating. If the same defect is repeatedly happening, you can assume that it won’t be cleared simply by fixing the visible problem. There is an underlying reason for the recurring faults. These types of incidents need to be investigated with RCA.
To determine if a failure is critical, you can look at the cost to the plant or the total downtime due to the particular failure. When a critical failure occurs, it needs to be investigated to identify the root cause to help avoid this situation in the future. Explosions at an oil rig and airplane crashes are examples of critical failures that need to be investigated.
There are critical machines and critical subprocesses in any system. A failure of these types of machines will halt the entire operation because there may not be a backup or mitigation plan for that particular machine. In this case, how critical the machine is will determine whether or not to do RCA.
The 3 Rs of Root Cause Analysis
No doubt you’ve heard these 3 Rs: “reduce, reuse, recycle” or maybe even “reading, writing, arithmetic.” But RCA also has its own system of 3 Rs: Recognize, Rectify, Replicate.
The actual cause of a problem is not always apparent, and simple cosmetic fixes usually don’t do much to correct the underlying fault. Even though RCA can be an elaborate time-consuming exercise, we do it to pinpoint the actual cause so we can take corrective actions that will eliminate future issues. As mentioned earlier, RCA can also be done to identify the reason for an unexpected positive outcome.
This first step is when you notice something’s not working quite right. The machine is leaking fluid, making a weird sound, or not running as productively as it usually does. This is when it’s time to put on your detective cap and find out what’s going on.
Once you’ve recognized the root cause, it’s time to start a corrective course of action. If the root cause is addressed, the same problem should not be cropping up again. If the same problem reappears, it’s likely because the cause you identified was not actually the root cause.
In this case, you might have to go through the RCA process again to make sure that you get to the actual root cause.
For example, you notice the machine is leaking fluid, so you patch the hole in the metal. If you stop seeing fluid on the ground under the machine, you’ve solved the problem, and you’ve taken care of the root issue. But if a leak crops up again in a week, it’s time to run another RCA to find out if there are other holes in the metal or if gaskets are failing.
Once you’ve identified and rectified the root cause, your next step is to ensure it will not happen again at any point during the process or system. Sometimes you’ll want to do an RCA to get to the bottom of an unexpectedly good outcome. In that case, you will test whether the same factors can be replicated in other scenarios and environments.
Suppose there were issues with faulty parts coming off the line, but you’ve since fixed the issue. The next step would be to replicate the problem to test whether you actually fixed the root issue.
In that case, you’d need to replicate what happened during this period to ensure that you got to the bottom of the issue.
RCA is about solving problems. But one of the most significant benefits for you is that being skilled at RCA makes you look good. When you’re good at what you do, you can get management on your side (which usually means an easier time getting the budget you need). And it can even make a big enough impression that it can change your career trajectory for the better. And we’ve got your back with our CMMS program. Limble has actually helped people get promotions because it makes them better at their jobs!
How to do a root cause analysis
RCA can be accomplished using many different tools and techniques. And even though those processes may look different, they all arrive at the same end goal: fixing the root cause of the issue.
To do a root cause analysis the right way, you should follow four basic steps.
Step 1: Define the problem
Start with the obvious: What is the problem? By defining the problem, the symptoms, and what you can see happening, you set the scope and direction of the analysis.
Without a specific problem statement, it’s hard to create a path to a solution. A well-defined problem statement also helps determine the scale and scope of the potential solution to be implemented. When you’re writing your problem statement, keep these three pieces in mind:
- How would you describe the problem at hand?
- What do you see happening?
- What are the specific symptoms?
Step 2: Collect the data
Collect all available data related to the incident. Ask yourself, “What proof is there? How long has this problem existed? What is the impact of the problem?” Be sure to record any other data you think might help you determine the issue.
Take, for example, machine failure in a manufacturing plant. These are examples of types of information you’ll want to document.
- the age of the machine
- time of continuous operation
- operating patterns
- maintenance schedule
- operators handling the machine
- specifications of the machine
- schematic of the plant infrastructure
- operating characteristics of the machine
- characteristics of the operating environment
Inspecting the machine in person also provides information that could be beneficial for root cause analysis. It will be easy for facilities that run predictive maintenance to collate data quickly.
Step 3: Map out the events
Establish a timeline of events. This will help you determine which factors among the data collected are worth investigating. RCA needs data points that potentially lead to the root cause. Putting events and data in chronological order helps to differentiate causal events from non-causal events.
From the data collected, you can identify correlations between various events, their timing, and other data collected. Remember that correlation does not mean causation.
Questions to ask yourself when looking for correlations:
- What sequence of events allowed this to happen?
- What conditions are present/allowed this to happen?
- What other problems surround the occurrence of the main problem?
The next step is to map out a causal graph. These graphs are used to represent the relationship between events that happened and the data collected.
But it’s important to not stop investigating when you find a correlation between events. Correlation means there is a link between two events, but it doesn’t automatically mean that one event caused the other. That’s why it’s essential to continue your sleuthing until you find a causal relationship. Find out what event caused another event. This will help you find the actual root cause.
From the data collected, chronological sequencing, and clustering, we should be able to create a causal graph (or use one of the root cause analysis tools we discuss later). You can use this graph to represent the relationship between various events that occurred and the data collected. The different paths are given different probability weights. They can serve as a visual tool to track down the root cause.
Example of a causal graph. Source: Adam Kelleher on Medium
Step 4: Solve the root of the problem
Once you’ve identified the root cause, you can quickly determine the best solution to fix it. You can then map it against the scope defined in your initial problem statement. If the solution works with your available resources, it can be implemented.
Fixing the root cause should eliminate the issues. If the symptoms occur again, it’s time to return to the drawing board and conduct RCA again.
Once the problem is solved, you will need to take proactive steps to ensure it doesn’t happen again. There can be multiple solutions applied to solve a single issue.
For example, the root cause could be the wear of a bearing, which happened much earlier than expected. In this case, the procedure has to be adjusted to change the bearing at an earlier time. Similar steps to avoid recurrence of fault can be changes in the maintenance schedule, different modes of maintenance, changes in design, different OEM vendors, etc.
The implemented solution will have to be in line with the available resources. So, if the root cause is pushing the machine too hard, the obvious answer is to shorten the machine run time. However, if the production schedule doesn’t allow for shortened runtimes, another solution might be scheduling more preventive maintenance.
Tried-and-true RCA tools and techniques
There are many tried and trusted frameworks available to execute RCA. None of these methods are foolproof, but they provide a solid base for how to go about root problem investigation. Each method has its own list of benefits and shortfalls. Some methods are more suitable for different industries and types of problems.
You and your company should have your own unique protocol when conducting RCA. In some instances, external consultants might be brought in to conduct RCA. In such cases, the consultants will generally have their own preferred technique or a combination of techniques they use. This is one of the reasons why it is hard to create a universal template for RCA that everyone can follow.
Let’s look at the different forms of root cause analyses.
5 Why analysis
5 Whys is the original technique developed by Sakichi Toyoda for root cause analysis at Toyota factories. It is addressing everything with a ‘why’, just like a curious child. Keep asking ‘why’ until you’ve reached the root cause. You can continue this process until you reach a stage where there is no need to ask ‘why’ again. At that point, you should have reached the root cause of the problem.
As a rule of thumb, asking and finding answers to 5 subsequent ‘why’s’ should be more than enough to reveal the root cause of most problems. Hence the name ‘5 why’ analysis.
Benefits of the 5 Whys:
- helps identify the root cause of a problem
- offers an understanding of how one process can cause a chain of problems
- helps determine the relationship between different root causes
- highly effective without complicated evaluation techniques
When to use the 5 Whys:
- for simple to moderately complex problems
- more complex issues may need this method in conjunction with another
- any time human error is involved in the issue
Fishbone diagram (a.k.a. Ishikawa diagram)
The Ishikawa method for root cause analysis emerged from quality control techniques employed in the Japanese shipbuilding industry by Kaoru Ishikawa. The shape of the resulting diagram looks like a fishbone, which is why it is called a fishbone diagram. This diagram is built on the idea that multiple factors can lead to a failure/event/effect.
The 5 M framework (shown above) from the Toyota Production System uses RCA with the Ishikawa method. The 5 Ms are:
- man/mind power
The problem or fault is written down at the far right end, where the fish head would be. The cause of the problem is represented along the horizontal line. Further effects and their respective causes are written down along the fish bones representing each of the 5 Ms. This process continues until the team is convinced that the root cause is identified.
Benefits of the fishbone diagram:
- a good way to brainstorm within a defined structure
- helps to visually diagram a problem or condition’s root cause
- helps to show bottlenecks in the process
- helps to find ways to improve the process
When to use a fishbone diagram:
- to analyze a complex problem with many causes
- when you need a different view of the issue
- to identify root causes
- to identify bottlenecks and identify issues where a process doesn’t work
Failure mode and effects analysis (FMEA)
FMEA is a proactive approach to root cause analysis, preventing potential failures of a machine or system. It is a combination of reliability engineering, safety engineering, and quality control efforts. It tries to predict future failures and defects by analyzing past data.
A diverse cross-functional team is essential when using FMEA. You will need to clearly define and communicate the scope of the analysis to your team members. Each subsystem, design, and process is closely reviewed. The purpose, need, and function of each system are questioned. Potential failure modes are brainstormed. Failure of similar processes and products in the past can also be analyzed.
The potential effects and disruptions that could be caused by each of the identified failure modes are assessed and used to calculate its RPN.
If the failure mode has a higher RPN than a company is comfortable with, you can address this by changing one or more factors outlined in the image above.
Benefits of FMEA:
- enables early identification of a failure point
- captures the collective knowledge of a team
- improves the quality, reliability, and safety of the process
- a logical, structured approach for identifying process areas of concern
- reduces process development time, cost
- documents and tracks risk reduction activities
When to use the FMEA methodologies:
- when designing a new product, process, or service (DFMEA)
- when you’re going to update a current way of doing things
- when you have a plan for quality improvement
- when you need to understand the failures in a process and improve upon them (PFMEA)
Fault tree analysis (FTA)
Fault tree analysis is a method for root cause analysis that uses boolean logic (using AND, OR, and NOT) to figure out the cause of failure. It was developed in Bell laboratories to evaluate an Inter Continental Ballistic Missile (ICBM) launch control system for the U.S Air force.
Fault tree analysis example. Source: Six Sigma Study Guide
Fault tree analysis tries to map the logical relationships between faults and the subsystems of a machine. The fault you are analyzing is placed at the top of the chart. If two causes have a logical OR combination causing effect, they are combined with a logical OR operator. For example, if a machine can fail while in operation or while under maintenance, it is a logical OR relationship.
If two causes need to occur simultaneously for the fault to happen, it is represented with logical AND. For example, if a machine only fails when the operator pushes the wrong button AND relay fails to activate, it is a logical AND relationship. It is represented using the boolean AND symbol. In the image above, AND is the blue symbol, and OR is the purple symbol.
The fault tree created for a failure is analyzed for possible improvements and risk management. This is an effective tool to conduct RCA for automated machines and systems.
Benefits of using a fault tree analysis:
- use deduction to find the causes of each event, like the 5 whys
- highlights the critical elements related to system failure
- creates a visual representation for analysis
- can focus on one area of failure at a time
- exposes system behavior and possible interactions
- accounts for human error
- promotes effective communication
When to use a fault tree analysis:
- when the effect of a failure is known — to find out how it might be caused by a combination of other factors
- when designing a solution — to identify ways it may fail in order to make the solution more robust
- to identify risks in a system
- to find failures that can cause the failure of all parts of a “fault-tolerant system”
A Pareto chart indicates the frequency of defects and their cumulative effects. Italian economist Vilfredo Pareto recognized a common theme with almost all frequency distributions he could observe. There is a vast imbalance between the ratio of failures and the effects caused by them.
He proposed that in any system, 80% of the results (or failures) are caused by 20% of all potential reasons.
The principle is dubbed the Pareto principle (some know it as the 80-20 rule). This skew between cause and effect is evident in many different distributions, from wealth distribution among people to failures in a machine.
Paret chart for shirt defects. Source: Tulip.co
With the 80-20 principle in mind, you can use Pareto analysis to dig into failures and possible causes. To start, draw a bar graph that includes the frequency of faults and causes. With this graph, it’s easier to see the skew between causes and failures. Usually, you’ll see how a small percentage of factors cause the majority of faults.
Next, you’ll analyze the causes that contribute to the largest number of faults and take corrective action to eliminate the most common defects.
Benefits of using pareto charts:
- defects are ranked in order of severity, with the most severe handled first
- can determine the cumulative impact of the defect
- offers a better explanation of defects that need to be resolved first
When to use a pareto chart:
- to analyze problems or causes in a process that involves the frequency of occurrence, time, or cost
- to narrow down a list of problems to find the most significant
- to analyze a problem with a broad list of causes to identify specific components
Pareto charts work great for determining the priority for taking up root cause analysis. According to the Pareto principle, eliminating 20% of the most common failure causes can result in reducing the overall number of malfunctions by 80%. Pareto charts will indicate the top failure causes to be further investigated and addressed, according to the criticality of the machine, the impact failure of a specific part, or a combination of the two.
Root cause analysis is very open-ended and has a lot of widely used tools in various industries. We covered the major ones in the sections above, but these systems also deserve some recognition. A few honorary mentions:
- Cause and effect diagrams. The Fishbone diagram is an example of cause and effect diagrams. Many similar tools try to map the relationship between causes and effects in a system.
- Kaizen is another tool from the stable of Japanese process improvements. It is a continuous process improvement method. Root cause analysis is embedded within the structure of Kaizen.
- Barrier analysis is an RCA technique commonly used for safety incidents. It is based on the idea that a barrier between personnel and potential hazards can prevent most safety incidents.
- Change analysis is used when a potential incident occurs due to a single element or factor change.
- A scatter diagram is a statistical tool that plots the relationship between two data in a two-dimensional chart. It can also be used as an RCA tool.
The Essential Guide to CMMS
The Essential Guide to CMMS
CMMS to save the day
If you’re feeling overwhelmed by all the different methods, metrics, and charts, not to worry, we’ve got your back. A computerized maintenance management system, or CMMS, can help you easily create, record, and track data used in root cause analysis.
With Limble, you can also create your own “5 Why’s” template and save it for use in the future. This makes it easy for anyone to quickly start a 5 Why RCA, repeating the same steps for consistent results.
To create your own 5-Whys template in our CMMS, you create a work order template, your space to record what happened. Below that, you can add child instructions asking the “why”. Your first “Why” can be to run a test to determine if the fault was a fluke, or if something is actually broken. You can also use custom tags to pull reports on just those specific work order templates. This gives you a clean, well-documented approach to RCA. You can easily show management, look like the star detective, get a promotion, and a big fat raise.
OK, so maybe it won’t all happen in that exact order. But at the very least, it will make your life a lot easier when it comes to fixing issues.
Root cause analysis examples
RCA example #1: The case of the faulty parts
Injection molding machines are widely used around the world to create plastic in almost any shape or form. The part the machine produces should match specifications within the allowable tolerance.
Let’s say there is a high incidence rate of faulty products, and we need to get to the bottom of it.
First, the problem needs to be well defined. This includes explaining the exact defect the plastic output is having. By observing the output, we can determine if it is one of the four primary defects within injection molding. They are:
- gassing & venting
- part distortion
- short mold
Let’s presume that the defect is part distortion. First, write down the problem, including the number of defects occurring as a percentage. Once that is completed, collect all the available data. Pull any maintenance logs can be pulled from your CMMS, review, manuals from the injection mold machine manufacturer, etc.
Collect information on each defective product. From this, measure the deviation from specifications. Take the heat signature of the product once it comes out of the mold, then measure the temperature of molten plastic in the barrel.
We know that part distortion almost always occurs due to temperature problems. But we cannot be sure where the temperature problem is…is it in the barrel while heating or in the mold while cooling?
By analyzing the data you collected, you would be able to identify that. For this example, we’ll assume the heat signature of the finished product is different from the expected one.
This determines that the problem is in the cooling process. Further investigation concludes that the root problem is the wrong spatial arrangement of cooling liquid conduits.
Changing the conduit arrangement that best fits the mold currently being produced will solve the problem of part distortion.
RCA example #2: The mystery of the blown fuse
Next, let’s say a machine stopped because it overloaded and the fuse blew.
Investigation shows that the machine is overloaded because it had a bearing that wasn’t being sufficiently lubricated.
Your investigation continues, and you find that the automatic lubrication mechanism had a pump that was not pumping sufficiently. A review of the pump shows that it has a worn shaft. Investigation of why the shaft was worn discovers that there isn’t an adequate mechanism in place to prevent metal scraps from getting into the pump. This enabled scraps to get into the pump and damage it.
The apparent root cause of the problem is metal scrap contaminating the lubrication system. Fixing this problem should prevent the whole sequence of events from happening again. The real root cause could be a design issue if no filter prevents the metal scrap from getting into the system. Or if it has a filter that was blocked due to a lack of routine maintenance, then the actual root cause is a maintenance issue.
Compare this with an investigation that does not find the causal factor: replacing the fuse, the bearing, or the lubrication pump will probably allow the machine to go back into operation for a while. But there is a risk that the problem will simply reoccur until the root cause is dealt with. (This example originally appeared here).
Nice work, detective.
Additional RCA Resources
Root cause analysis is a vast umbrella term that cannot be exhaustively explained in a single article. Here are some additional resources to learn more about RCA, its tools and techniques:
- This 70-minute video from the consulting firm KT Kepner-Trego is an excellent place to start a broad understanding of RCA and major techniques.
- Six Sigma US is an accredited provider of Lean Six Sigma certifications. They have extensive material on root cause analysis and also provide online courses and certifications for it. You have the option to choose between classes with different structures that can accommodate your schedule.
- Root cause analysis course from the University System of Georgia is available on Coursera. You can enroll in the course for free and receive certification for a minimal fee. Coursera courses are widely recognized.
- The textbook Root Cause Analysis by Mathew A Barsalou is an excellent guide to choosing the right RCA tool for the proper context.
- “Root Cause Analysis: The Core of Problem Solving and Corrective Action” by Duke Okes is another comprehensive and authoritative resource on root cause analysis.
Now is not the time to cut corners
Root cause analysis is complex and should not be done on a whim. Your team might decide to cut corners to save on time and speed up the process. But if you want to get to the bottom of any complex event, rushing the process can be detrimental to the whole project. When you have a good reason to conduct RCA, it is in your best interest to create an environment where the process can be executed successfully.
If you want to know how a CMMS could make your job less stressful, get started with Limble on a free trial, or set up a demo with our team.
Is there a list of RCA examples for IT environment such as Application, Database, Server, Network Device, Network. Also, please recommend insightful RCA resources for IT
Hey Kenny, I do not know any from the top of my head. Most guides we come across while writing this piece, even when they were for specific industries, still just focused on explaining the general concept.
I would have to google it the same as yourself. Good luck with the search!
These tools are not easy to use, espicallly in complex problems, but explanation is good.
Comments are closed.