Root cause analysis (RCA) is the quality management process by which an organization searches for the root of a problem, issue or incident after it occurs.
Issues and mishaps are inevitable in any organization, even in the best of circumstances. While it could be tempting to simply address symptoms of the problem as they materialize, addressing symptoms is an inherently reactive process that all but guarantees a recurring—and often worsening—series of problems.
Ethical, proactive, well-run companies and organizations with a reactive approach will both encounter problems, but the former will experience fewer and recover faster because they prioritize root cause analyses.
Root cause analysis helps organizations decipher the root cause of the problem, identify the appropriate corrective actions and develop a plan to prevent future occurrences. It aims to implement solutions to the underlying problem for more efficient operations overall.
Learn how next-generation detection devices shift asset management services from routine maintenance regimes to predictive, AI-powered processes.
Delve into our exclusive guide to the EU's CSRD
Organizations perform root cause analyses when a problem arises or an incident occurs, but there are any number of issues that need an RCA. Triggers for a root cause analysis fall into three broad categories.
When real-world materials or equipment fails in some way (for example, a desktop computer stops working or a component from a third-party vendor delivers substandard performance).
When people make mistakes or fail to complete required tasks (for example, an employee fails to perform regular maintenance on a piece of equipment, causing it to break down).
A breakdown in a system, process or policy people use to make decisions (for example, a company fails to train team members on cybersecurity protocols, leaving the company vulnerable to cyberattacks).
Organizations can conduct root cause analyses for a range or reasons, from commonplace email service disruptions to catastrophic equipment failures. Regardless of the nature or scope of the issue, performing root cause analysis should include the same fundamental steps.
If you have decided to conduct a root cause analysis, your department or organization is likely experiencing some acute issue, or at least looking to make substantive improvements to a particular process. Therefore, the first step of the root cause analysis process should be identifying and defining the problem that you want to address. Without a clearly defined problem, it is impossible to correctly identify the root causes.
When the department has a clear idea of the problem, it’s time to draft a problem statement spelling out the issue for everyone who will help with the RCA.
Once the issue is identified and clearly articulated to all involved parties, leadership should create a project charter, which will assemble a team to complete the analysis. The team should include a facilitator to lead the team through the analysis and any team members with either personal or professional knowledge of the systems, processes and incidents that you will investigate.
Data collection is the foundation of the problem-solving process. It is vital, at this stage, to find every piece of information that can help you identify contributing factors and ultimately the root causes of the issue. This can include collecting photographs and incident reports, conducting interviews with affected parties and reviewing existing policies and procedures. Some questions that you may want to ask during data collection:
This is the most important step in the RCA process. At this point, the team has collected all necessary information and starts to brainstorm for causal factors. Effective root cause analyses require openness to all potential underlying causes of an issue, so everyone on the RCA team should enter the brainstorming stage with an open mind. Avoid attempts to determine root causes until every possibility is identified and vetted; starting the incident investigation process with preconceived notions may bias the results and make it more difficult to determine the real root cause.
Once the RCA team has an exhaustive list of possible causes and contributing factors, it is time to determine the root causes of the issue. Analyze every possible cause and examine the actual impact of each one to figure out which possibilities are the most problematic, which ones have similarities and which ones can be altogether eliminated. Be prepared for the possibility that there are multiple root causes to the issue.
After the team narrows the list of possibilities, rank the remaining potential root causes by their impact and the likelihood they are the root cause of the problem. Leadership will examine and analyze each possibility and collaborate with the RCA team to determine the actual root causes.
Once the team settles on root causes and has laid out all the details of the issue, they must start brainstorming solutions. The solution should directly address the root causes, with consideration for the logistics of executing the solution and any potential obstacles the team may encounter along the way. These elements will comprise the action plan that will help the team address the current problem and prevent recurrences.
While all RCAs will include the same basic steps, there are myriad root cause analysis methods that can help an organization collect data efficiently and effectively. Typically, a company will select a method and use root cause analysis tools, such as analysis templates and software, to complete the process.
The 5 Whys approach is rooted in the idea that asking five “Why?” questions can get you to the root cause of anything. 5 Whys implores problem solvers to avoid assumptions and continue to ask “why” until they identify the root cause of a problem. In the case of a formalized organizational root cause analysis, a team may only need to ask three whys to find the root cause, but they may also need to ask 50 or 60. The purpose of 5 Whys is to push the team to ask as many questions as is necessary to find the correct answers.
A failure mode and effects analysis is one of the most rigorous approaches to root cause analysis. Similar to a risk analysis, FMEA identifies every possibility for system/process failure and examines the potential impact of each hypothetical failure. The organization then addresses every root cause that is likely to result in failure.
Pareto charts combine the features of bar charts and line charts to understand the frequency of the organization’s most common root causes. The chart displays root causes in descending order of frequency, starting with the most common and probable. The team then addresses the root cause whose solution provides the most significant benefit to the organization.
An impact analysis allows an organization to assess both the positive and negative potential impacts of each possible root cause.
Change analyses are helpful in situations where a system or process’s performance changed significantly. When conducting this type of RCA, the department looks at how the circumstances surrounding the issue or incident have changed over time. Examining changes in personal, information, infrastructure, or data, among other factors, can help the organization understand which factors caused the change in performance.
An event analysis is commonly used to identify the cause of major, single-event problem, like an oil spill or building collapse. Event analyses rely on quick (but thorough) evidence-gathering processes to recreate the sequence of events that that led to the incident. Once the timeline is established, the organization can more easily identify the causal and contributing factors.
Also known as a causal factor analysis, a causal factor tree analysis allows an organization to record and visually display—using a causal factor tree—every decision, event or action that led to a particular problem.
An Ishikawa diagram (or Fishbone diagram) is a cause-and-effect style diagram that visualizes the circumstances surrounding a problem. The diagram resembles a fish skeleton, with a long list of causes grouped into related subcategories.
DMAIC is an acronym for the Define, Measure, Analyze, Improve and Control process. This data-driven process improvement methodology serves as a part of an organization’s Six Sigma practices.
This RCA methodology proposes finding the root cause of an issue by moving through a four-step problem solving process. The process starts with situation analysis and continues with problem analysis and solution analysis, concluding with potential problem analysis.
An FTA allows an organization to visually map potential causal relationships and identify root causes using boolean logic.
Barrier analyses are based on the idea that proper barriers can prevent problems and incidents. This type of RCA, often used in risk management, examines how the absence of appropriate barriers led to an issue and makes suggestions for installing barriers that prevent the issue from reoccurring.
Companies that use the RCA process want to put an end to “firefighting” and treating the symptoms of a problem. Instead, they want to optimize business operations, reduce risk and provide a better customer experience. Investing in the root cause analysis process provides a framework for better overall decision-making and allows an organization to benefit from:
Continuous improvement: Root cause analysis is an iterative process, seeking not only to address acute issues, but also to improve the entire system over time, starting with the underlying cause. The iterative nature of root cause analysis empowers organizations to prioritize continuous process improvement.
Increased productivity: Preventing downtime, delays, worker attrition and other production issues within an organization saves employees time, freeing up bandwidth to focus on other critical tasks.
Reduced costs: When equipment breaks down or software bugs cause delays, organizations lose money and workers get frustrated. Root cause analysis helps eliminate the cost of continually fixing a recurring issue, resulting in a more financially efficient operation overall.
Better defect detection: When companies fail to address underlying issues, they can inadvertently affect the quality of the end product. Addressing persistent problems before they snowball protects the organization from revenue and reputational losses that are associated with product defects down the line.
Reduced risks: Improving business processes and systems keeps equipment running safely and helps workers avoid safety hazards in the workplace.
Intelligent asset management, monitoring, predictive maintenance and reliability in a single platform.
Enhance your application performance monitoring to provide the context you need to resolve incidents faster.
IBM research proposes an approach to detect abnormality and analyzes root causes using Spark log files.
Learn how IBM Instana provides precision hybrid cloud observability, metrics, traces and logs.
Downer and IBM are using smart preventative maintenance to keep passengers on Australia's light and heavy rail systems moving safely, reliably, comfortably and more sustainably.