By
Hank Marquis
The ITIL refers to
Service or Systems Outage Analysis (SOA) as a method to improve
availability. Unfortunately, the ITIL does not indicate how one actually
performs SOA! This article explains the benefits of SOA, and gives you a 7
step guide to performing SOA.
The objective of SOA is
to reduce the frequency and duration of outages while improving Mean Time
To Repair (MTTR). The result of SOA is clear exposure of the risk of
future outages, as well as recommendations for improvement.
SOA is a powerful
technique that requires no major investment in tools or training. The
process is straight forward. Working with Problem Management and Customers,
you examine past outages and identify related Configuration Items (products,
people or process). Then you review the impact of the organization and
infrastructure on availability.
To get started, collect
outage data, and assemble a team of people familiar with the outages. Then,
guide them through the 7 following steps.
- Group related outages together;
create groupings by vendor, product, family, application, customer, etc.
Categorize each outage as “significant” or “less significant”. Focus
only on those labeled “significant” ; monitor the “less significant” for
future outages. For each “significant” outage, review the root cause of
the unavailability. For example, faulty hardware or software. This is
probably already known since the outage is resolved.
- Using a Pareto analysis (80/20 rule),
rank the related outages and their causes. You will see that the
majority of the outages result from a select few causes. Focus on the
“80%” of the outages caused by the “20%” of the causes.
- For each grouping of similar
outages, examine the reasons for the duration of the unavailability.
For example the outage may have occurred because of faulty hardware or
software; but the duration of the unavailability might have been
extended by lack of tools, training, spares, etc.
- Remember to consider the three
“P’s” – People, Product and Process, and review:
- Existing procedures and support
policies that were invoked or used during this outage.
- The actions (or inactions) of
staff members, customers and anyone else involved in the outage or
restoration.
- Try to determine if anything might
have lessened the duration of the outage, or avoided it altogether. The
examination should locate a trend, or at least something in common with
similar outages. This is what you are looking for – the “smoking gun.”
An example might be the lack of a tool, process or similarly related
item.
- Quantify the avoidable outage
time. That is, if one hour of downtime resulted from trying to locate
the proper tool, then the avoidable outage time is 1 hour x the number
of outages so affected. Identifying the most preventable downtime
is your goal.
- Prepare a Request for Change
(RFC) to address the most significant generator of preventable downtime!
The end of the SOA is the creation of a
report summarizing the number of outages analyzed and the report timeframe;
listing of the avoidable outage time; and suggestions for improving or
avoiding the outage.
When you are done, you will have a
documented business case justifying a Change! Most importantly, SOA
provides you a clear roadmap that shows exactly how to remove a significant
source of downtime from your infrastructure.
--
- Subscribe to our newsletter and get
new skills delivered right to your Inbox,
click here.
- To browse back-issues of
the DITY Newsletter, click here.
|