By
Hank Marquis
UPDATED FEB. 11, 2006: ADDED LINK TO
SOA UPDATED MAR. 8, 2006: ADDED LINK TO TOP
If you are ITIL certified, you’ve heard of Fault Tree
Analysis, or FTA. But if you’re like most, you probably have no idea
how to actually perform or use FTA!
Simply put, FTA is a method for discovering the root causes of failures or
potential failures. FTA then helps you understand how to fix or
prevent the failure.
FTA is an analysis that starts with a top-level event, like a service
outage. You then work it downward to evaluate all the contributing
faults, and the causes of faults, that may ultimately lead (or have led) to
its occurrence. You use the fault tree diagram to identify
countermeasures to eliminate the causes of the failure.
FTA requires nothing more complex than paper, pencil, and an understanding
of the service at hand. You will need accurate CI contextual
information in order to get the most value from FTA. The following 6
simple steps can help you resolve tough design issues or problems quickly
and easily.
- Select a top level event for analysis. Try
to be specific, for example, “Email server down for more than 4 hours.” Sources of top level events include:
a. Problem/Known Error Records
b. Service Outage Analysis [See ‘Service
Outage Analysis in 7 Steps’ DITY Vol. 1 #7 for more on SOA]
c. Potential failures from brainstorming and a TOP
d. “what-if” scenarios based on Service Level Agreements, etc.
- Identify faults that could lead to the top
level event. Continuing the above example, some possible faults leading to
an outage lasting more than 4 hours might be “loss of power”, another
might be “hardware failure.” List all the faults under the top level
event in boxes and connect the fault boxes to the top level event box by
drawing lines.
- For each fault, list as many causes as
possible in boxes below the related fault. Continuing the example above,
in the case of “loss of power”, some causes might be “electrical outage”,
“power supply failure”, and so on. Connect the boxes to the appropriate
fault box.
-
Two logic operators – And and Or, also
known as logic gates – are used to represent the sequencing of
faults and causes. For example, “Email server down for more than 4
hours” could be caused by “loss of power”
Or “hardware fault”. Another might be “loss of building power”
And “battery backup exhausted.” Update faults and causes by
grouping logically related items using And or Or between faults and
events; and faults and causes. Re-draw the lines from top level
event to logic gates to faults to logic gates to causes. The result
is a graphical fault tree diagram as follows:
- Continue identifying causes for each fault until
you reach a root cause, or one that you can do something about. For
example, the root cause of “power supply failure” might be “filter clogged”; the root cause of “battery backup exhausted”
might be “battery backup too small”.
- A root cause is one you can do something about;
so now you need to think of the countermeasures you might apply to
each root cause. List countermeasures for each root cause in a box under
the root cause. For example, for “filter clogged” a
countermeasure might be “clean filter monthly.” Link the countermeasure
to the root cause by drawing a line.
And that's it! Now you have a fault tree! Fault trees show how an
event can occur, and what you can about it from a design or change
perspective. For Problems, you also have a possible root cause and a
solution!
As you see, FTA is very simple. Don’t let its
simplicity fool you however. If you want to get fancy, you can play with
probability statistics to try and get even more precise – determining the
“chance” that a fault or cause could occur. Very precise calculations
are possible. But even if you don’t get fancy, you will
have taken a powerful step toward preventing problems in the first place, or
resolving tough problems. Often the act of creating a fault tree generates
excellent ideas and possible solutions where before there were none.
FTA can be used by Technical Observation Post (TOP)
teams, Problem Managers, Availability Manager, and even IT Service
Continuity Management teams with a minimum of training.
[See ‘7
Steps to the TOP’ DITY Vol. 2 #10 for more on
the TOP] The graphical nature
of FTA makes it easy to understand and easy to maintain in the face of
Changes.
All in all, FTA is a powerful tool if you are trying
to “Do IT Yourself”.
--
- Subscribe to our newsletter and get
new skills delivered right to your Inbox,
click here.
- To browse back-issues of
the DITY Newsletter, click here.
|