How do you evaluate the quality of the products you buy?
Traditional quality control in a factory will consist of performing predefined checks and tests. If the product satisfies set requirements, it is deemed good to go. However, you will never say that you bought a quality product if you had to go through the reclamation process two or more times before the warranty period expired.
Reliability and reliability engineering help determine a product’s quality by adding time to the quality equation. In other words, we no longer just want to know if a product can perform its intended function at the moment of purchase. Instead, we want to make sure that the product works without major malfunctions under normal conditions for as long as possible.
Reliability engineering does not only help organizations produce more reliable products, but it also informs maintenance teams on how to maintain them to increase MTBF (mean time between failures) and asset lifespan.
In this article, we will help you use reliability and reliability engineering by reviewing:
- the concept of reliability
- core principles of reliability engineering
- the basics of reliability assessment
- and ways in which reliability engineers can improve equipment reliability
What is reliability?
Reliability is a term used to describe the ability of a component or system to meet certain performance standards over a certain period of time, assuming normal operating conditions.
To put it in another perspective, if we have two systems that operate under the same conditions, the one that works longer with less major hiccups is the more reliable one.
Since no one can predict the future and guarantee that a product won’t fail for exactly X hours of use, calculating reliability comes with a dose of uncertainty that is expressed in the form of probability. Among other things, we can use reliability calculation to estimate what is the chance that a system will work properly after x hours or days of use. Naturally, the reliability of any system will be high in the beginning and decline over time.
Reliability is often confused with durability, quality, and availability. While the concepts are similar, they should not be used interchangeably. Here’s a short explanation for each.
Reliability vs. durability
Durability can be defined as the ability of a physical product to remain functional, without requiring excessive maintenance or repair, when faced with the challenges of normal operation over its design lifetime (definition stolen from Tim Cooper).
The main difference between reliability and durability is that durability is mostly concerned with how long a product can last despite the breakdowns it survives, while reliability is trying to reduce the overall number and frequency of those breakdowns.
Moreover, the durability component is used to describe a characteristic of physical items, while reliability can be used for virtual systems too.
Depending on the product and its field of application, durability can be expressed in hours of use, the number of operational cycles, or years of existence.
Reliability vs. quality
Quality is a concept that is hard to define. One popular way to describe it is by looking at the factors that affect product quality. This leads us to the concept of eight dimensions of quality.
This is actually an easy way to differentiate between reliability and quality as we can just consider reliability (and durability if you look closer) to be one dimension of quality.
If we take reliability as a standalone concept, another way to look at their relationship is by saying that a reliable system is one that keeps his quality over time.
Reliability vs. availability
Availability shows the percentage of time that a system is available (fully operational) to perform what it is designed to do.
The concept is very often used in IT to describe the availability of cloud infrastructure. Systems with the highest availability are in the 99.99% range (which means that a service/system is not available for only ~52 minutes out of the whole year; often just to perform scheduled maintenance).
Availability is impacted by reliability and maintainability. More reliable systems will experience fewer failures which will improve their availability. Similarly, the faster you perform scheduled maintenance, the less downtime you will have, which again leads to increased availability.
What is reliability engineering?
Reliability engineering refers to the systematic application of best engineering practices and techniques to make more reliable products in a cost-effective manner. Reliability engineering methodology can be applied across the product lifecycle: from design and manufacturing to operation and maintenance.
That being said, the main value of reliability engineering lies in the early detection of possible reliability issues. If we catch a reliability issue at an early stage of the product lifecycle like the design stage, we can greatly minimize future costs (i.e. by eliminating the need for a significant product redesign after it is already in the market). This idea is represented in the graph below.
The goals of reliability engineering are as follows:
- To use engineering knowledge and techniques to prevent certain failure modes and to reduce the likelihood and frequency of failures.
- To identify and correct the causes of failures that do occur, despite the efforts to prevent them.
- To determine ways of dealing with failures that do occur, if their causes have not been corrected.
- To apply methods for estimating the likely reliability of new designs and for analyzing reliability data.
If you look at the list more closely, you will see that the goals are ordered in a way that follows the natural progress of the application of different reliability methods. There is no sense in trying to add redundancies for all identified failures if some of them can be prevented with simple design changes. In other words, the above list represents steps that should be followed in sequential order to ensure reliability practices are applied cost-effectively.
The basics of reliability assessment
The end goal of reliability assessment is to have a robust set of qualitative and quantitative evidence that the use of our component/system will not come with an unacceptable level of risk. It is an integral part of reliability engineering.
In this context, risk can be defined as the combination of probability of failure (how likely it is that failure will happen) and failure severity (what is the fallout of the failure; can include safety risk, potential secondary damage, cost of spare parts and labor, production losses, etc.).
Understanding failure mechanisms and failure modes
It is not always easy to draw the line between cause and failure. If that wasn’t the case, there would be little need for reliability engineers and failure analysis.
To understand failure modes and failure mechanisms well enough to address them efficiently, complex systems need to be “broken down” into components. This way you can analyze them on an individual level, as well as based on how they interact with one another.
In addition to everything said, the way the system interacts with its user and the environment is another element to add to the list of things that need to be considered as both misuse and poor working conditions can reduce product reliability.
Common tasks and techniques used in reliability engineering
Depending on how complex the system is and the type of the system we are looking at, there are a variety of techniques and tasks that can be applied as a part of our reliability engineering efforts:
- Root cause analysis (RCA)
- Reliability centered maintenance (RCM)
- FMEA and FMECA
- Design FMEA and Process FMEA
- Physics of failure (PoF)
- Bulit-in self-test
- Reliability block analysis
- Field data analysis
- Fault tree analysis
- Eliminating single point of failure (SPOF)
- Human error analysis
- Operational hazard analysis
- Looking at maintenance history to analyze failure rates and collect failure data
- All kinds of data collection tests that measure how system/component perform under stress
By using all of these measures, we can find weak points of our system and see what are the chances that these weaknesses might result in malfunctions. If the perceived risk is high enough, we have to deal with them through corrective action. Common solutions come in the form of design changes (e.g., adding redundancy), detection control, maintenance guidelines, and user training.
As we mentioned in the intro of this article, reliability is often the game of chance (probability). Since you are dealing with percentages and statistical data to define risk, it is very important that the whole team is on the same page and agrees about the acceptable levels of risk that they are trying to achieve.
This is why it is very important to use precise language when describing problems and proposing solutions. Moreover, because of incomplete statistical data and other uncertainties, some reliability professionals recommend focusing on solutions rather than failure chances.
For part/system failures, reliability engineers should concentrate more on the “why and how”, rather than predicting “when”. Understanding “why” a failure has occurred (e.g. due to over-stressed components or manufacturing issues) is far more likely to lead to improvement in the designs and processes used than quantifying “when” a failure is likely to occur (e.g. via determining MTBF). To do this, first the reliability hazards relating to the part/system need to be classified and ordered (based on some form of qualitative and quantitative logic if possible) to allow for more efficient assessment and eventual improvement. O’Connor, Patrick D. T. (2002), Practical Reliability Engineering
How reliability engineers can improve equipment reliability
There are several ways in which reliability engineers can help to improve and optimize maintenance processes at their facility that will ultimately result in increased equipment reliability. We discuss a few of them below.
Helping with the design and development of spare parts
Wear and tear that comes with daily use doesn’t discriminate. Most assets will need to be fitted with spare parts on a regular basis to continue operating in an efficient manner.
Companies that have the right resources might opt in to use CNC machines or 3-D printing to create their own parts instead of constantly restocking their spare parts inventory. Furthermore, they might have an old machine with spare parts that are no longer sold or have to deal with a nasty breakdown that requires a custom part.
In these scenarios, reliability engineers can work closely with the maintenance team to design, test, and produce quality replacement parts that will improve the reliability of onsite assets.
Performing root cause analysis
One thing reliability engineers should be very good at is identifying and understanding failure causes. Because of that, they can be tasked with performing root cause analysis (RCA). They can examine OEM manuals, maintenance practices, equipment maintenance logs, and other documentation to find the reasons why specific machines are failing and suggest how to eliminate and/or mitigate each of the found failure causes.
One way to address potential causes is by applying RCM practices.
Making sure maintenance actions address the right failure modes
This is an extension of the previous point. Since the last point was concentrated on finding what you are not doing (which failure modes you are not addressing), let’s focus here on what you might be doing wrong.
Most companies will find themselves in a situation where they are performing regular maintenance on an asset, and that asset is still experiencing breakdowns. While there can be many reasons for that, one of them is that maintenance technicians are doing something wrong – like not addressing the right failure modes. This is where referring to RCA analysis can be very helpful.
Similarly, reliability engineers can occasionally check how different maintenance practices are executed and how they can be improved. They can check if the maintenance team is using outdated practices and doing preventive maintenance tasks that add value and address the right problems. All of these should be easily accessible in a good CMMS software.
To learn more about CMMS, you can check out our guide on what is a CMMS and how does it work.
Last but not least, reliability engineers can also help with choosing the right condition-monitoring sensors and equipment for the implementation of advanced maintenance strategies like Condition-based maintenance and Predictive maintenance.
Serious reliability engineering efforts bring serious results. With the right knowledge, reliability techniques can be implemented regardless of the size of your company.
Going forward, we hope that organizations will continue to invest in reliability as it helps everyone involved. Production companies benefit from producing better quality products, maintenance teams have less trouble maintaining them, and users have fewer performance issues over the lifespan of their products. It’s a win-win-win situation.