Root cause analysis is the process of finding the basic underlying cause for an effect we observe or experience. In the context of failure analysis, RCA is used for finding the root cause of frequent machine malfunctions or a big machine breakdown.
In this post, we take an in-depth look at how to perform RCA: we outline the steps, describe common tools and techniques, and give a couple of practical examples.
Let us start with defining RCA first.
What is Root Cause Analysis (RCA)?
Root Cause Analysis is the process of tracing causes of an observable problem and identifying the basic underlying issue that was causing it. Fixing the identified basic problem should stop the recurrence of other problems that originated due to it.
If the problem fixed is not the underlying cause, there is no guarantee that the same fault will not occur again. RCA tries to follow the chain of cause and effects to pinpoint the problem that, when eliminated, makes all the other faults disappear.
RCA is not a process that guarantees an outcome. Conducting RCA can be very complicated and generally involves a vast amount of data collection and scrutiny. The result of an RCA is not always black and white. It is not a litmus test that can conclusively indicate whether the problem we identified is the root cause or not. More often than not, we will get only a strong correlation between cause and effect and not a causal relation. From there, an experienced professional has to judge whether to investigate further or not.
RCA is a craft that requires domain knowledge and experience. Otherwise, any fixes implemented will likely be just a cosmetic solution to the problem. In the worst case scenario, the changes we make can result in worsening the problem.
Despite this dose of uncertainty, RCA is still a very powerful tool for understanding and improving the fundamental nature of systems and procedures.
The origin of RCA
RCA has existed as an investigative tool for centuries. But it was never formalized for a long time. It was formally introduced to the world of engineering and technology by Sakichi Toyoda. He was the founder of Toyota and he is widely considered to be the father of Japanese industrialization. He introduced the technique of ‘5 Whys’ as a tool for RCA which is still used today.
One could argue that the innovations from Japanese manufacturing like Six Sigma, Kaizen, and other lean manufacturing processes, can be attributed to the practice of finding the root cause of problems and fixing it, not being satisfied with a cosmetic solution. All these process improvement techniques have helped to catapult the efficiency of manufacturing processes all over the world.
Why conduct Root Cause Analysis?
There are two broad ways in which RCA can be used:
To identify the root cause for problems (more common way).
To recognize the root causes for positive changes experienced: sometimes the procedures we implement give results that are way better than expected; when the reason for the phenomenally good results cannot be easily explained, RCA can be used to identify them.
When to conduct Root Cause Analysis?
Conducting RCA requires a significant investment of time, manpower, and money. It will cause further disruption in the production line or the system in which RCA is to be conducted.
Therefore, RCA should not be done for every single fault. There is no cut and dried rule for when to conduct RCA. Here are some of the instances based on which experienced professionals can make an informed decision whether to conduct RCA or not:
Persistent faults – If the same fault occurs over and over, it is worth investigating. As the same fault is recurring, we can deduce that the fault will not be cleared by fixing the visible problem. There is some underlying reason for the recurring faults. Such incidents need to be investigated with RCA.
Critical failure – The degree of how critical a failure is can be measured using the cost to the plant, or the total downtime due to the particular failure. When such a failure occurs, it has to be investigated to identify the root cause of the failure. This will help avoid such occurrences in future. Explosions at an oil rig and aeroplane crashes are examples of critical failure that need to be investigated.
Failure impact – There are critical machines and critical subprocesses in any system. A failure of such machines will halt the entire operation, as there might not be a backup or mitigation plan for that particular machine. In essence, the criticality of the machine determines whether to conduct RCA for a failure or not.
RCA process is based on the 3 Rs
The true cause for an effect we observe is not always obvious. Cosmetic fixes do not do much to correct the underlying fault. The elaborate exercise of RCA is conducted to pinpoint the true cause, so we can take corrective actions that will eliminate future issues. As mentioned earlier, RCA can also be done to identify the cause for an unexpected positive outcome.
Once the root cause is recognized, the corrective course of action has to be undertaken. If the root cause is addressed, the same problem should not be cropping up again. If the same problems reappear, it is highly likely that the cause identified was not the root cause. You will have to conclude that the previous RCA conducted was not comprehensive, and more investigation needs to be done.
Once the root cause is figured out and rectified, you will have to ensure that the same fault will not occur again in the same system in another location, or at a later time. If the RCA was done to identify the reasons behind unexpectedly good outcomes, you will have to test whether the same factors can be replicated in other scenarios and environments.
In essence, root cause analysis is used to precisely figure out what happened, how it happened, and why it happened, for any incident that occurs.
RCA is applied across many different industries
RCA is in essence a knowledge tool for identifying the root cause of any event or fault. Faults and problems occur in almost every industry and RCA techniques can be used to investigate the underlying cause and contributing factors.
The most obvious and ubiquitous use we come across is in medical diagnosis. The same symptoms can be caused by a whole set of illnesses. It is the duty of the doctor to identify the underlying cause before a patient can be treated effectively. Almost all episodes of the popular TV show House M.D are exercises in root cause analysis, though in an unconventional manner.
There are many other industry verticals that use root cause analysis on a regular basis. Some of them are:
manufacturing (machine failure analysis)
industrial engineering and robotics
industrial process control and quality control
information technology (software testing, incident management, cybersecurity analysis)
Root cause analysis is a structured way of thinking and investigation for any type of incident. With that in mind, RCA is not just limited to the areas mentioned above. It can be implemented in any sector or industry where the root cause of a problem needs to be fished out.
Root cause analysis steps
RCA can be accomplished using many different tools and techniques. These different tools make use of different conceptual models to identify the problem at the root. Though all the tools differ at a cosmetic level, each of the techniques has to go through the conceptual steps to conclude the analysis.
Step 1: Problem statement
A problem statement and definition are essential for any form of analysis, not just RCA. A clear description of the problem and symptoms experienced. This gives the scope for the analysis.
Without a precise problem statement, RCA will be like a rudderless boat – unable to know which direction to sail to, and unable to change direction. A well-defined problem statement also helps to determine the scale and scope of the potential solution to be implemented.
Step 2: Data collection
All available data related to the incident has to be collected. Take for example machine failure in a manufacturing plant. Some of the pertinent information that needs to be collected is given below:
A timeline of events has to be established. This will help to determine which factors among the data collected are worth investigating. RCA needs data points that potentially lead to the root cause. Chronological sequencing of events and data is very helpful in deciphering causal events from non-causal events.
From the data collected, correlations can be found between various events, their timing, and other data collected. This can be used as an initial step to differentiate between causal and non-causal events. One important thing to remember is that correlation does not mean causation.
You cannot conclude any analysis when a correlation is identified. Causal relations need to be investigated.
From the data collected, chronological sequencing, and clustering, we should be able to create a causal graph (or use one of the root cause analysis tools we discuss later in the article). This graph can be used to represent the relationship between various events that occurred and the data collected. The different paths are given different probability weights, and can serve as a visual tool to track down the root cause.
Once the root cause is identified, the solution to fix it can be easily determined. It can be mapped against the scope defined in the problem statement. If the solution fits the scope, it is implemented.
Fixing the root cause should eliminate the recurrence of the symptoms. If the symptoms occur again, we would need to go back to the drawing board and conduct RCA again.
Once the problem is solved, steps have to be taken to avoid the recurrence of the same. There can be multiple solutions applied to solve a single problem. For example, the root cause could be the wear of a bearing, which happened much earlier than expected. In such a case the procedure has to be adjusted to change the bearing at an earlier time. Similar steps taken to avoid recurrence of fault can be changes in the maintenance schedule, different modes of maintenance, changes in design, etc.
The implemented solution will have to be in line with the available resources. So, if the root cause is pushing the machine too hard, the obvious solution is to shorten machine run time. However, when the production schedule doesn’t allow it, another solution might be to schedule preventive maintenance more frequently.
Root cause analysis tools and techniques
Root cause analysis is not a singular way to the answer. It is a conceptual framework for investigating the true reasons behind the events we observe. There are many frameworks available to execute RCA that have been tried and tested by experimenters. None of these methods are foolproof, but they provide a solid base for how to go about root problem investigation. Some of the prominent tools and techniques are discussed below.
Multiple tools are available to conduct root cause analysis. Each of them has its own merits, and certain methods are more suitable for different industries and types of problems that are being investigated.
Each company and its management team should have a protocol to adhere to when conducting RCA. Different companies might prefer different techniques. In some instances, external consultants might be brought in to conduct RCA. In such cases, the consultants will have a preferred technique or a combination of techniques they use to conduct RCA. This is one of the reasons why it is hard to create a universal template for RCA that everyone can follow.
Oftentimes, the company will have a preferred RCA technique. If that one does not give the needed answers, other techniques might be explored.
5 why analysis
5 Whys is the original technique developed by Sakichi Toyoda for root cause analysis at Toyota factories. It is addressing everything with a ‘why’, just like a curious child. When we ask why the visible problem occurred, we can trace its cause. Then the question ‘why’ can be asked about the cause we just identified.
This process can be continued till a stage where there is no need to ask ‘why’ any further. At that point, we should have reached the root cause of the problem. As a rule of thumb, asking and finding answers to 5 subsequent ‘why’s’ should be more than enough to unveil the root cause of most problems. Hence the name ‘5 why’ analysis.
Fishbone diagram (a.k.a. Ishikawa diagram)
The Ishikawa method for root cause analysis emerged from quality control techniques that were employed in the Japanese shipbuilding industry by Kaoru Ishikawa. The shape of the resulting diagram looks like a fishbone, which is why it is also called a fishbone diagram. This diagram is predicated on the idea that multiple factors can lead to the failure/event/effect we are investigating.
The problem or fault is written down at the rightmost end, where the fish head is presumed to be. The cause for it is represented along the horizontal line. Further effects and their respective causes are written down along the fish bones that represent each of the 5 Ms. This process is continued until the team conducting it is convinced that the root cause is identified.
The Fishbone diagram serves as a visual aid for structured brainstorming sessions. The same technique is also used for product design, ergonomic design, and process improvement.
Failure mode and effects analysis (FMEA)
FMEA is a proactive approach to root cause analysis, preventing potential failures of a machine or system. It is a combined systematic approach of reliability engineering, safety engineering, and quality control efforts. It tries to predict future failures and defects by analyzing past data.
A diverse cross-functional team is essential to undertake FMEA. The scope of the analysis has to be well-defined and conveyed clearly to all the team members. Each subsystem, design and process is brought under the microscopic scrutiny of the cross-functional team. The purpose, need, and function of each system are questioned. Potential failure modes are brainstormed. Failure of similar processes and products in the past can also be analyzed to supplement the process.
The potential effects and disruptions that could be caused by each of the identified failure modes are assessed and used to calculate its RPN.
If the failure mode has a higher RPN than a company is comfortable with, it needs to be addressed by changing one or more factors outlined in the image above.
Fault tree analysis
Fault tree analysis is a method for root cause analysis that uses boolean logic to figure out the cause of failure. It was developed in Bell laboratories to evaluate an Inter Continental Ballistic Missile (ICBM) launch control system for the U.S Air force.
Fault tree analysis tries to map the logical relationships between faults and the subsystems of a machine. The fault we are analyzing is placed at the top of the chart. If two causes have a logical OR combination causing effect, they are combined with a logical OR operator. For example, if a machine can fail while in operation or while under maintenance, it is a logical OR relationship.
If two causes need to occur simultaneously for the fault to occur, it is represented with logical AND. For example, if a machine fails only when the operator pushes the wrong button AND relay fails to activate, it is a logical AND relationship. It is represented using the boolean AND symbol. In the image above the AND symbol is the blue symbol, and OR is represented by the purple symbol.
The symbols used in the diagrams represent different kinds of events: a circle is a basic event, a pentagon for an external event, a rhombus for an undeveloped event, an ellipse for a conditioning event, a rectangle for an intermediate event, etc.
The fault tree created for a failure is analyzed for possible improvements and risk management. This is an effective tool to conduct RCA for automated machines and systems.
Italian economist Vilfredo Pareto recognized a common theme with almost all frequency distributions he could observe: there is a huge asymmetry between the ratio and the effects caused by them.
As a rule of thumb, he indicated that, in any system, 80% of the results (or failures) are caused by 20% of all potential reasons.
The principle is dubbed the Pareto principle (some know it as the 80-20 rule). This skew between cause and effect is evident in many different distributions, from wealth distribution among people to failures in a machine.
With the Pareto principle in mind, failures and their possible causes are analyzed. A bar graph and line graph are drawn, with the frequency of faults and the causes for the faults. With this graph, we are able to observe the skew between causes and failures. Usually, we will discover how a small percentage of factors causes the majority of faults.
The causes that contribute to the most number of faults are then analyzed further, and corrective actions are taken to eliminate the most common faults.
Pareto charts are excellent tools to determine the priority for taking up root cause analysis. According to the Pareto principle, eliminating 20% of the most common failure causes can result in reducing the overall number of malfunctions by 80%. Pareto charts will indicate the top failure causes to be further investigated and addressed, according to the criticality of the machine, the impact failure of a specific part, or a combination of the two.
Root cause analysis is very open-ended and it has a lot of widely used tools in various industries. Major ones were mentioned in the sections above. Still, there are more noteworthy tools for RCA. Here are a few honorary mentions:
Cause and effect diagrams – The Fishbone diagram is an example of cause and effect diagrams. There are many similar tools that try to map the relationship between causes and effects in a system.
Kaizen – It is another tool from the stable of Japanese process improvements. It is a continuous process improvement method. Root cause analysis is embedded within the structure of Kaizen.
Barrier analysis – It is an RCA technique commonly used for safety incidents. It is conducted on the premise that a barrier between personnel and potential hazards can prevent most safety incidents.
Change analysis – When a potential incident occurs due to a change in a single element or factor, change analysis is employed as the root cause analysis technique.
Scatter diagram – Scatter diagram is a statistical tool that plots the relationship between two data in a two-dimensional chart. It can also be used as an RCA tool.
Root cause analysis examples
RCA example #1
Injection moulding machines are widely used around the world to create plastic in almost any shape or form. The part produced by the machine should match specifications for the same, within allowable tolerance.
Let’s imagine there is a high incidence rate of faulty products and we need to get to the bottom of it.
First, the problem needs to be well defined. This includes explaining the precise defect the plastic output is having. By observing the output we can determine if it is one of the four main defects that could occur with injection moulding. They are:
gassing & venting
Let’s presume that the defect is part distortion. The problem has to be clearly written down, with the number of defects occurring as a percentage. Once that portion is completed, all the available data has to be collected. Maintenance logs can be pulled from a CMMS, manuals from the injection mould machine manufacturer can be reviewed, etc.
Information has to be collected on each defective product. From this, the deviation from specifications should be measured. The heat signature of the product is taken once it comes out of the mould. The temperature of molten plastic in the barrel is also measured.
We know that part distortion almost always occurs due to temperature problems. But we cannot be sure where the temperature problem is, in the barrel while heating or in the mould while cooling. From the data collected, we would be able to identify that. Let us assume the heat signature of the finished product is different from the expected one.
This determines that the problem is in the cooling process. Further investigation concludes that the root problem is the wrong spatial arrangement of cooling liquid conduits.
Changing the conduit arrangement that best fits the mould currently being produced will solve the problem of part distortion.
RCA example #2
Imagine an investigation into a machine that stopped because it overloaded and the fuse blew. Investigation shows that the machine overloaded because it had a bearing that wasn’t being sufficiently lubricated. The investigation proceeds further, and finds that the automatic lubrication mechanism had a pump which was not pumping sufficiently, hence the lack of lubrication. Investigation of the pump shows that it has a worn shaft. Investigation of why the shaft was worn discovers that there isn’t an adequate mechanism in place to prevent metal scraps getting into the pump. This enabled scraps to get into the pump, and damage it.
The apparent root cause of the problem is metal scrap contaminating the lubrication system. Fixing this problem ought to prevent the whole sequence of events recurring. The real root cause could be a design issue if there is no filter to prevent the metal scrap getting into the system. Or if it has a filter that was blocked due to a lack of routine inspection, then the real root cause is a maintenance issue.
Compare this with an investigation that does not find the causal factor: replacing the fuse, the bearing, or the lubrication pump will probably allow the machine to go back into operation for a while. But there is a risk that the problem will simply reoccur, until the root cause is dealt with.
Root cause analysis is a vast umbrella term that cannot be exhaustively explained in a single article. Here are some additional resources to learn more about RCA, its tools and techniques:
This 70-minute long video from the consulting firm KT Kepner-Trego is a good place to start a broad understanding of RCA and major techniques.
Six Sigma US is an accredited provider of Lean Six Sigma certifications. They have extensive material on root cause analysis and also provide online courses and certifications for it. You have the option to choose between classes with different structures that can accommodate your schedule.
Root cause analysis course from the University System of Georgia available on Coursera. You can enrol in the course for free and receive certification for a very small fee. Coursera courses are widely recognized.
The textbook Root Cause Analysis by Mathew A Barsalou is an excellent guide to choosing the right RCA tool for the right context.
“Root Cause Analysis: The Core of Problem Solving and Corrective Action” by Duke Okes is another comprehensive and authoritative resource on root cause analysis.
Technical analysis should not be done by cutting corners
Root cause analysis is a complex methodology and should not be done on a whim. The team might decide to cut corners to save on time and speed up the process. If you want to get to the bottom of any complex event, rushing the process can be detrimental to the whole project. If you have a good reason to conduct RCA, then it is in your best interest to create an environment in which the process can be executed successfully.
If you are a maintenance professional dealing with complex, problematic assets, do not hesitate to reach out and check how Limble CMMS can help you alleviate all of your asset management problems.
July 5, 2021,
Is there a list of RCA examples for IT environment such as Application, Database, Server, Network Device, Network. Also, please recommend insightful RCA resources for IT
Hey Kenny, I do not know any from the top of my head. Most guides we come across while writing this piece, even when they were for specific industries, still just focused on explaining the general concept.
I would have to google it the same as yourself. Good luck with the search!
"Limble is very easy to get involved in and no contracts, with simple monthly billing. I have auditioned other CMMS companies and they make it too difficult, to try out. Limble strategy is very simple - here is our software, you can customize it in most categories and let us know if you have any questions. As a multiple building County Facilities Director, I highly recommend trying it!"
— Michael Boursier
Five star program
"This is one of the most easiest CMM Systems I have used. With unbelievable response times to questions. The Limble staff is very helpful. With this system, our equipment downtime has been cut by 20%."
— Gordon Shanks
Sunbelt Forest Products
Over all very good
"Nice layout and easy to use. Email alerts are very useful, and the comparison between planned and unplanned is very helpful"
— Paul Sheppard
I'm amazed with the functionality & customer service
"Executive summary software produces to give me a snapshot of where each contact center is at in preventative maintenance on critical building assets."
— Kris Anderson
Limble is the best thing to happen to this company
"Limble does such a good job at keeping track of what's been done and letting me know when and what I need to do next."
— Tom Jones
Little Giant Ladder Systems
"Great experience. Solved our obvious PM tracking issues but also addressing our SHE&S requirements (safety audit task tracking), Environmental checks are being logged, Corporate Audit items tracked"