The IT Infrastructure Library™ (ITIL®) describes the steps of the root cause analysis method called Kepner-Tregoe - Define and Describe the Problem, Establish possible causes, Test the most probable cause, and Verify the true cause.
The ITIL mentions Kepner-Tregoe, but does not give enough detail to use it to solve difficult problems.
Simple as it sounds, most technicians and technical leads do not actually follow Kepner-Tregoe. They rely instead upon preconceived ideas and often skip important steps. Then, without a plan and in desperation they fall back on the good old "when in doubt swap it out" technique.
Taking the time to use Kepner-Tregoe can result in dramatic improvements in troubleshooting, and deliver permanent fixes to prevent future problems as well.
Following I provide a template for using Kepner-Tregoe that problem managers and staff can use to accelerate root cause analysis.
The actual name is Kepner-Tregoe Problem Solving and Decision Making (PSDM). Kepner-Tregoe calls the part of PSDM that ITIL refers to Problem Analysis. Problem Analysis helps the practitioner make sound decisions. It provides a process to identify and sort all the issues surrounding a decision. As a troubleshooting tool, Problem Analysis helps prevent jumping to conclusions.
Immature troubleshooters use hunches, instinct, and intuition. These individual acts of heroism may seem brilliant, but they can also result in more problems since jumping to conclusions often compounds or expands problems instead of solving them.
Problem Analysis leverages the combined knowledge, experience, intuition, and judgment of a team, resulting in faster and better decisions. Using Problem Analysis to aid Problem Management not only brings the team together, but also helps identify root cause. Problem Analysis is a problem solving and decision making framework. Six Sigma, Lean Manufacturing and ITIL all describe Problem Analysis.
The Problem Analysis process divides decision-making into five steps:
Problem Analysis begins with defining the problem. The problem management team cannot overlook this critical step. Failure to understand exactly what the issue is often results in wasting precious time. Many immature troubleshooters consider this step as wasted effort since they know what they are going to do – and this is the critical mistake made by many. Preconceived notions often result in increased outage duration and even outage expansion due to poor judgment.
Since problem management is inherently a team exercise, it is important to have a group understanding of the problem. Consider the following examples. A poor problem definition might appear as follows:
"The server crashed."
A better problem definition should include more information. A good model for clarifying statements of all sorts is the Goal Question Metric (GQM) method. It results in a statement with a clear Object, Purpose, Focus, Environment, and Viewpoint. This results in an unambiguous and easily understood statement. A clarified problem definition might be:
"The e-mail system crashed after the 3rd shift support engineer applied hot-fix XYZ to Exchange Server 123."
When developing a problem definition always use the "5 Whys technique" to arrive at the point where there is no explanation for the problem. Using 5 Whys with Kepner-Tregoe only accelerates the process.
With a clear problem definition, the next step is to describe the problem in detail. The following chart provides a nice template for this activity. You can do this using a presentation board, paper, or common office software. Table 1 describes the basic worksheet used in the process.
The worksheet describes the four aspects of any problem: what it is, where it occurs, when it occurred, and the extent to which it occurred. The IS column provides space to describe specifics about the problem -- what the problem IS. The COULD BE but IS NOT column provides space to list related but excluded specifics -- what the problem COULD BE but IS NOT. These two columns aid in eliminating "intuitive but incorrect" assumptions about the problem. With columns one and two completed, the third column provides space to detail the differences between the IS and COULD BE but IS NOT. These differences form the basis of the troubleshooting. The last column provides space to list any changes made that could account for the differences.
IS | COULD BE but IS NOT | DIFFERENCES | CHANGES | |
---|---|---|---|---|
WHAT | System failure | Similar systems/situations not failed | ? | ? |
WHERE | Failure location | Other locations that did not fail | ? | ? |
WHEN | Failure time | Other times where failure did not occur | ? | ? |
EXTENT | Other failed systems | Other systems without failure | ? | ? |
Anyone who has spent time troubleshooting knows to see "what has changed since it worked" and start troubleshooting by checking for changes. The problem is that many changes can occur, and that complicates things. Problem Analysis can help here by describing what the problem is and what the problem could be, but is not. For example:
Problem: "The e-mail system crashed after the 3rd shift support engineer applied hot-fix XYZ to Exchange Server 123."
IS | COULD BE but IS NOT | DIFFERENCES | CHANGES | |
---|---|---|---|---|
WHAT | Exchange Server 123 crashed upon application of hot-fix XYZ | Other Exchange Servers getting hot-fix XYZ | Different staff (3rd shift) applied this hot-fix | New patch procedure from vendor |
WHERE | 3rd floor production room without vendor/ contractor support | Anywhere else with vendor/ contractor support | Normally done by vendor | New procedure, first time 3rd shift applies hot-fixes |
WHEN | Last night, 1:35am | Any other time or location | None noted | |
EXTENT | Any Exchange Server on 3rd floor | Other servers |
History (and best practice) says that the root cause of the problem is probably due to some recent change.
With the completed worksheet, some new possible solutions become apparent. Shown above is becomes clear that the root cause is probably procedural, and due to the fact the vendor did not apply the hot-fix, but rather gave procedures for the hot-fix to the company.
With a short list of possible causes (recent changes evaluated and turned into a list), the next step is to think-through each possible problem. The following aid can help in this process. Ask the question:
"If ____ is the root cause of this problem does it explain the problem IS and what the problem COULD BE but IS NOT?"
If this potential solution is the root cause then the potential solution has to "map to" or "fit into" all the aspects of the Problem Analysis worksheet (figure 2.) Use a worksheet like that shown in figure 3 to help organize your thinking around the potential solutions.
Potential root cause: | True if: | Probable root cause? |
---|---|---|
Exchange Server 123 has something wrong with it | Only Exchange Server 123 has this problem | Maybe |
Procedure incorrect | Same procedure crashes another server | Probably |
Technician error | Problem did not always reoccur | Probably not |
The next step is to compare the possible root causes (Table 3) against the problem description (Table 2). Eliminate possible solutions that cannot explain the situation, and focus on the remaining items.
Before making any changes, verify that the proposed solution could be the root cause. Failure to verify the true cause invalidates the entire exercise and is no better than guessing. After verifying the true cause, you can propose the action required repair the problem.
It is important here as well to think about how to prevent similar problems from occurring in the future. The Problem Manager should consider how the issue arose in the first place by asking some questions:
The goal is to try to eliminate future occurrences of the problem.
Kepner-Tregoe is a mature process with decades of proven capabilities. There are worksheets, training programs, and consulting firms all schooled in the process. You can take courses at many local colleges as well.
Kepner-Tregoe Problem Analysis was used by NASA to troubleshoot Apollo XIII – even though the technicians did not believe the results, they followed the process and saved the mission. The rest of the story, as they say, is history...
Even without a lot of time available, using Kepner-Tregoe Problem Analysis can result in the most efficient problem resolutions. Armed with tools like 5 Whys and Ishikawa diagramming, a Problem Manager can capture the combined experience and knowledge of a team. When used with Kepner-Tregoe Problem Analysis the result is amazing.