FMEA & FMECA: How To Perform Failure Mode And Effects Analysis
Understanding failure modes and their effects is at the core of modern product and system designs. FMEA and FMECA analysis are necessary to identify and correct different flaws that can lead to item breakdowns.
Many businesses think they can get away without doing failure mode and effects analysis, which is why they end up on the CPSC’s list of recalled products. Probably the most famous example in recent times was the exploding Samsung Galaxy Note 7. Samsung said that they revised their safety testing process, but one could argue that this problem could have been prevented if the company designing the battery performed proper type of FMEA analysis at different points in the product life cycle.
Before we jump into our step-by-step explanation of FMEA and FMECA, let’s take a short stop to discuss different FMEA types.
Introduction to FMEA types
FMEA was developed around 1950. and was first used by the US military to optimize the production of munitions. Since it proved to be a very effective analysis tool, it was quickly adopted by NASA, and a bit later by the automotive and aviation industries.
Today, it has a wide application in many different industries. We can split FMEA into a few different categories based on what we are analyzing:
FFMEA (functional failure mode and effect analysis)
Failure mode and effect analysis (FMEA) is a structured approach to identifying potential failure points in different systems like products (Design FMEA) and processes (Process FMEA). It consists of finding weak points in components, assemblies, and subsystems and reviewing how their failure would impact the rest of the system they are a part of.
FMEA process represents one of the first systematic approaches to failure analysis and is a core task in a wide variety of reliability engineering, safety engineering, and quality engineering efforts.
Failure modes are identified by looking at issues similar products and processes had in the past, as well as by using common physics failure logic (i.e. a component will fail at X degrees based on its building material). Based on how severe the effects of failure are, one can either make changes aimed at reducing the chance that this failure can happen or look to reduce the impact that the failure will have on the rest of the system.
The end goal of FMEA and FMECA analysis is to help you eliminate (or at least minimize) chances that the system will experience severe breakdowns. When you know potential weaknesses in your designs/processes and their effects on the whole system, it is much easier to prioritize which parts of the system need to be adjusted to improve system reliability.
What is FMECA?
Failure mode effects and criticality analysis (FMECA) is an extended version of FMEA that incorporates criticality analysis into the whole process.
FMEA in its basic form is a qualitative method that only explores “what-if” scenarios. By introducing the criticality component, we add a dose of quantitative measurement through which we assign a criticality rating to identified failure modes.
This way, the level of risk can be more accurately estimated – and we can better prioritize our corrective actions.
How to perform FMEA analysis
FMEA uses a systematic methodology to assess the failure mode of a system or component with the ultimate goal of ranking and prioritizing each mode according to its level of importance, effect, and probability of occurrence. The gathered information is used to complete the FMEA worksheet and guide important decisions about design, process, or system modifications that mitigate failure propagation at different interfaces and levels.
Below we take a closer look at the steps involved in FMEA to understand how its forward logic works.
Step #1: Decide which FMEA you’ll perform and gather the necessary information
As outlined in the intro, there are three types of FMEA.
DFMEA takes the entire life span of the component into consideration at the design stage. Material properties, interface between components, geometry, and engineering requirements are examples of focus points in DFMEA.
PFMEA considers all the steps (processes) involved to arrive at the final component and is popular in the manufacturing industry. Processing methods, machinery, maintenance strategies, and operational requirements are examples of assessment points in PFMEA.
Lastly, Functional FMEA moves from a part or component-level assessment to a system-level analysis. FFMEA aims to identify specific process issues within a system-wide context.
Once you’ve decided which type of FMEA to perform, the next phase is to gather as much information as possible to describe the product/process in detail. This can be done with the help of drawings, schematics, component lists, interface information, to name a few.
This stage aims to identify systems, subsystems, assemblies, or parts to determine respective functions and interactions. By breaking the different systems apart into subsystems, a hierarchical tree or a block diagram helps gather the information that enhances traceability, problem interfaces, interrelationships between different systems, and interconnections.
Step #2: Identify potential failure modes
Information from the hierarchical tree and block diagram can be used to determine how a system, subsystem, or component might fail. Failure modes are not only part-specific but can propagate to various systems and induce other types of possible failures. Therefore, it’s essential at this stage to pay careful attention to system dependencies at a high level (primary level) and consider how lower levels (secondary/tertiary levels) of systems/components are affected.
When assessing failure modes, it is crucial to look at failure from different contexts and not superficially as just “not performing.” These different contexts can include:
part/system performing an unintended function
failure to perform the intended function
Each component/system should be analyzed one by one to ensure that all failure modes are detected and to uncover “hidden” failure modes that are not easy to spot.
Examples of failure modes include:
electrical short circuit
Step #3: Do a failure effect and cause analysis
A failure effect analysis aims to determine the consequences that the environment and internal and external customers will experience if the component/system fails to perform its intended function.
The hierarchical and block diagram will shed light on system failure effect traceability, making it easier to see how the different functions of each component/system/process fail.
The failure cause column of the FMEA worksheet should be completed by focusing on “why” a particular failure mode can occur. These causes can range from:
human error (like using the wrong type of material for load-bearing applications)
material defects (like low-quality material caused by poor manufacturing standards)
incorrect engineering requirements (like failing to follow engineering standards that stipulate minimum area requirements for a steel frame used as a support)
A root cause analysis (RCA) is an important technique that can be used at this point to uncover potential causes of function failure.
Step #4: Assign severity rankings
The severity ranking looks at the impact the failure will have on users, downstream of operations, environment, and any other person or setting that could be affected.
A ranking scale of 0 to 10 is used, where 10 means the effect is considered dangerous and requires immediate changes to mitigate risks. A ranking of one means there is no effect or the severity is insignificant; therefore, the effect does not require any specific intervention at the system or subsystem levels.
Step #5: Assign occurrence rankings
The occurrence ranking column of the FMEA worksheet uses a custom ranking list to classify the frequency of the cause of the failure. At this stage, it is important to note that the occurrence ranking is cause-focused; hence, the information provided in the cause column will have a direct influence on the probability of occurrence.
As shown in the table above, the occurrence ranking ranges from one to ten, and associated failure probabilities indicate the probability of occurrence. A rating of 1 means that the likelihood of occurrence is very low (nearly impossible), while a rating of 10 means that the probability of occurrence is extremely high.
Step #6: Evaluate and assign failure detection rankings
Detection rankings are used to assess the current control methods in place to detect the likelihood of each failure mode. These detection controls can be as simple as visible and audible indications that prompt when there is a failure.
The control mechanism used should ideally detect failure before it results in a complete system or component failure. In other words, subpart detection controls (secondary controls) are essential for optimal operation of high-level systems where primary controls are located.
The detection rating is weighed on a scale of 1 to 10. A rating of 1 means that it is certain that the failure will be detected with the current controls in place. When it is uncertain that the controls will detect the failure, then a rank of 10 is used.
If the likelihood of detection is low, then the overall system’s effects should be studied to identify how hazards and safety issues must be monitored or where a change in design or component could introduce an appropriate control mechanism. If no appropriate control mechanisms can be used, then periodic testing as a part of routine maintenance is the best way to avoid catastrophic failures that can otherwise go undetected.
Step #7: Calculate RPN
The risk priority number (RPN) is calculated by multiplying the three rankings for severity, occurrence, and detection together. The result gives another rank that is used to prioritize the decisions made to improve the design, process, or system.
The value that is considered critical is chosen by the team doing the FMEA and varies from industry to industry. Therefore, a clear decision has to be made in the early stages to establish which threshold is considered as “critical”. That being said, RPN’s that are not regarded as critical should still be assessed to see if there are interrelations with other systems that could result in propagated failure.
Step #8: Take action
Recommended actions involve lowering the RPN by adjusting one of the three factors that contribute to it: severity, occurrence, and detection. These adjustments will depend on the type of FMEA being conducted, whether changing the design, process, or system.
The occurrence ranking can be lowered by controlling the causes or, where possible, removing them altogether. The detection ranking can be improved using additional control mechanisms that either use visual or audible signals to highlight potential malfunction. For instance, one can improve failure detection by installing condition monitoring sensors that provide real-time information about the state of specific components.
When the corrective actions have been taken, the next step is to recalculate the RPN. The design team should decide at which threshold the RPN should remain where no further intervention is necessary.
As a rule, the severity rating is the most difficult to adjust and should not be lowered unless there is a significant change in design or process that considerably lowers the severity.
In the FMEA example below, we look at how the methodology can be applied to a bicycle brake cable. The first steps involve understanding the system and assigning the failure modes and effects. After calculating the RPN we note possible concerns that need to be addressed to mitigate high-risk issues.
By looking at the highest RPN we see that temperature and loading can affect the material properties of nylon, which could cause an accident if the operator is unable to close the brake calipers. We know from this concern that we are looking at a design issue (material choice) and will need to study possible nylon substitutes that perform equal or better, but do not have the same safety issues in humid conditions.
Once we have made the changes (if possible) to the design, we recalculate the RPN to see how effective the design modifications were.
How to perform FMECA analysis
FMECA is a two-step process guided by the systematic methodology of FMEA and criticality analysis (CA). The CA – which is the highlight of the method – can be done either quantitatively or qualitatively. The CA aims to rank the significance of each failure mode after conducting a bottom-up or top-down analysis.
In the subsequent sections, we take a closer look at the FMECA, starting with a short recap of FMEA.
Step #1: Perform FMEA
Before FMECA can be performed, all the necessary groundwork outlined above for FMEA has to be completed. Relevant information from the FMEA worksheet is then transferred to the FMECA worksheet. This crucial first step of FMECA is essential to determine what needs to be corrected.
Whether changes are made to the design, process, or system during the FMECA analysis depends on if a top-down approach or a bottom-up approach is used. The top-down approach (also known as the functional method) starts with the system or sub-system’s function and associated failure modes early in the design stage before the entire system is mapped out. The bottom-up approach (also known as the hardware method) is used when the entire system has been decided.
Step #2a: Determine the necessary parameters for a qualitative CA
To perform a qualitative CA, best judgment based on experience and knowledge about component failure modes for specific parts is used to arrive at failure rates for components/systems. Similar to FMEA, the aim is to assign occurrence and severity rankings that will be used to calculate the RPN.
Five common types of severity levels are described as follows:
Catastrophic: Impact of effect results in considerable losses to the environment, human life, and business operations.
Harmful: The system is compromised and cannot perform its intended function. Consequently, further operation with the component/system is harmful.
Marginal: The component/system has some level of degradation, but is not fully compromised and can still perform its functions.
Minor: There is some degree of failure, but the component /system continues to function optimally. Therefore, failure is not perceptible and does not dangerously compromise operations.
No impact: Failure has occurred, but it is difficult to be certain. This is usually the case without proper control mechanisms in place or a preventative maintenance plan that proactively checks for component/system deterioration.
In some cases, there are four severity levels for FMECA instead of five. The number of levels depends on the industry. FMECA is based on the MIL-STD-1629A standard that was specifically developed for US military operations and can be adapted according to industry requirements.
The occurrence ranking will be based on the likelihood of a particular failure rate. Examples of failure rates are shown in the table below:
Step #2b: Determine the necessary parameters for a quantitative CA
A criticality number is calculated within a quantitative CA based on known failure rates, failure modes, failure effect probabilities, and any other information that quantifies a component/system’s failure. The steps in a quantitative CA are more complicated, involving mathematical formulations with specific variables that are used to calculate the failure mode criticality number.
The failure mode criticality number represents how often a particular failure node occurs. The item’s criticality number quantifies the effects (consequences) of failure. It is calculated by summing up all the failure criticality numbers.
Step #3: Adjust failure rate for redundancy
Redundancies within a component/system ensure that there is an extra layer of protection against failure. It affects the failure rate and needs to be considered to accurately represent how failure can affect a particular system. Different formulations are used in various industries to consider redundancies. For reference, the MIL-STD-1629A standard can be used for suggestions.
Step #4: Calculate criticality number or RPN
The ranking approach of a qualitative or quantitative CA differs by the way failure criticality is considered.
In a quantitative CA case, a criticality number is calculated, which we already mentioned in Step 2b. For a quantitative CA, a risk priority number (RPN) is used.
In some cases, the RPN is calculated as we outlined earlier in the article. However, in other cases, only the product of severity and occurrence is considered. The choice of which RPN depends on the type of industry and whether detection methods are in place.
Step #5: Create a criticality matrix
The criticality number or RPN, calculated in Step 4, is used to rank failure modes and their respective frequencies to create a criticality matrix. A criticality matrix provides a graphical (visual) representation of failure modes by ranking their likelihood and severity.
The severity rank is plotted on the x-axis, while the probability (frequency) is plotted on the y-axis. Components with a low failure mode are located to the bottom right-hand side and those with high failure modes are located at the top left of the matrix. This latter group will require careful attention as they usually involve injuries, damage to the environment, and equipment loss.
Step #6: Determine critical items and take appropriate action
The critical items that are located to the left of the criticality matrix diagonal involve some level of risk that is ranked according to specific requirements. Steps should be taken to modify the likelihood of those failure modes through:
design modifications that mitigate risks associated with critical components
process changes that reduce the likelihood of failure
creating redundancies that add an extra layer of security
replacing specific components for optimum performance
enhancing control techniques that detect failure
Making these adjustments as part of a structured action plan will influence both the quantitative and qualitative CA.
After the adjustments have been made, the criticality matrix is then updated. If necessary, further action is taken until risks have been mitigated as much as possible (to the desired level).
To put all the theory into some perspective, we look at a short example where FMECA is used in the oil and gas industry.
In the example below, the FMEA part has been transferred up to the causes of the failure modes step. The water supply (the primary system) is broken into different subsystems to assign respective criticality numbers.
Immediately, we can see that the CA performed is both quantitative and qualitative (variables we mentioned in Step #2a and Step #2b are used). Redundancies are accounted for in the example for all the subsystems and the associated failure effect probability, failure mode, and failure rate are shown. The failure rate with and without redundancies is calculated, which means adjustments to failure rates have already been performed.
Further to the right of the FMECA worksheet, the RPN is calculated. In this case, we notice that all RPN values are below 10. Does this mean that the failure mode is not critical? No, as discussed before, the threshold used to classify the RPN depends on the standard adopted by the company doing the study.
We can see from the example that the severity of the storage system’s failure mode and heat exchanger is higher than the pumping system.
What can be done to remedy that? For the heat exchanger failure mode, better controls to detect scaling due to untreated water and a preventative maintenance plan are some of the ways to adjust the criticality and RPN values.
Performing FMECA and FMEA analysis is not something that should be done just for the sake of it. It is a fairly complex process that brings huge benefits, but only if it is performed right.
This piece aims to provide a comprehensive overview so you have a better idea of what to expect. For those that are doing it for the first time, it is recommended to first go through proper FMEA training.
At the end of the day, a strong failure modes and effects analysis has the potential to make the life of maintenance professionals significantly less stressful, which is why they shouldn’t shy away from participating in this process.