Text Box: IT Experience.  Practical Solutions.

Text Box: DITY™ Newsletter

 

 

 

 

Text Box: fault tree analysis made easy
Text Box: Vol.  1.5, DEC.  14, 2005

 

Hank Marquis, 2006, CTO
hank

MARQUIS

Articles
E-mail
Bio


By Hank Marquis    

UPDATED FEB. 11, 2006:
ADDED LINK TO SOA
UPDATED MAR. 8, 2006: ADDED LINK TO TOP

If you are ITIL certified, you’ve heard of Fault Tree Analysis, or FTA.  But if you’re like most, you probably have no idea how to actually perform or use FTA!

Simply put, FTA is a method for discovering the root causes of failures or potential failures.  FTA then helps you understand how to fix or prevent the failure. 

FTA is an analysis that starts with a top-level event, like a service outage.  You then work it downward to evaluate all the contributing faults, and the causes of faults, that may ultimately lead (or have led) to its occurrence.  You use the fault tree diagram to identify countermeasures to eliminate the causes of the failure.

FTA requires nothing more complex than paper, pencil, and an understanding of the service at hand.  You will need accurate CI contextual information in order to get the most value from FTA.  The following 6 simple steps can help you resolve tough design issues or problems quickly and easily.
 

  1. Select a top level event for analysis.  Try to be specific, for example, “Email server down for more than 4 hours.”  Sources of top level events include:
    a.  Problem/Known Error Records
    b.  Service Outage Analysis 
    [See ‘Service Outage Analysis in 7 Steps’ DITY Vol. 1 #7 for more on SOA] 
    c.  Potential failures from brainstorming and a TOP
    d.  “what-if” scenarios based on Service Level Agreements, etc.
     
  2. Identify faults that could lead to the top level event.  Continuing the above example, some possible faults leading to an outage lasting more than 4 hours might be “loss of power”, another might be “hardware failure.”  List all the faults under the top level event in boxes and connect the fault boxes to the top level event box by drawing lines.  
     
  3. For each fault, list as many causes as possible in boxes below the related fault.  Continuing the example above, in the case of “loss of power”, some causes might be “electrical outage”, “power supply failure”, and so on.  Connect the boxes to the appropriate fault box.
     
  4. Two logic operators – And and Or, also known as logic gates – are used to represent the sequencing of faults and causes.  For example, “Email server down for more than 4 hours” could be caused by “loss of power” Or “hardware fault”.  Another might be “loss of building power” And “battery backup exhausted.”  Update faults and causes by grouping logically related items using And or Or between faults and events; and faults and causes.  Re-draw the lines from top level event to logic gates to faults to logic gates to causes.  The result is a graphical fault tree diagram as follows:

  5. Continue identifying causes for each fault until you reach a root cause, or one that you can do something about.  For example, the root cause of “power supply failure” might be “filter clogged”; the root cause of “battery backup exhausted” might be “battery backup too small”.
     
  6. A root cause is one you can do something about; so now you need to think of the countermeasures you might apply to each root cause.  List countermeasures for each root cause in a box under the root cause.  For example, for “filter clogged” a countermeasure might be “clean filter monthly.”  Link the countermeasure to the root cause by drawing a line.

And that's it!  Now you have a fault tree!  Fault trees show how an event can occur, and what you can about it from a design or change perspective.  For Problems, you also have a possible root cause and a solution!

As you see, FTA is very simple.  Don’t let its simplicity fool you however.  If you want to get fancy, you can play with probability statistics to try and get even more precise – determining the “chance” that a fault or cause could occur.  Very precise calculations are possible.  But even if you don’t get fancy, you will have taken a powerful step toward preventing problems in the first place, or resolving tough problems.  Often the act of creating a fault tree generates excellent ideas and possible solutions where before there were none.

FTA can be used by Technical Observation Post (TOP) teams, Problem Managers, Availability Manager, and even IT Service Continuity Management teams with a minimum of training.  [See ‘7 Steps to the TOP’ DITY Vol. 2 #10 for more on the TOP]  The graphical nature of FTA makes it easy to understand and easy to maintain in the face of Changes.

All in all, FTA is a powerful tool if you are trying to “Do IT Yourself”. 

--

  • Subscribe to our newsletter and get new skills delivered right to your Inbox, click here.
  • To browse back-issues of the DITY Newsletter, click here.

Entire Contents © 2006 itSM Solutions LLC.  All Rights Reserved.