Problem management foundation - Lifecycle

Based on the process
from ITIL v3
6. major incident lifecycle

ProblemManagementFoundation
Objectives
• The Major incident process
• The importance of Time
• Detection
• Diagnosis
• Checklists
• Workarounds

"I chose to race, so I chose to win."

Disaster
hh:mm:ss
Detection
hh:mm:ss
Diagnosis
hh:mm:ss
Workaround
Repair
hh:mm:ss
Recover
hh:mm:ss
Restore
hh:mm:ss
Report
Resolution
hh:mm:ss
Determination
Clients
Service Desk EscalationsHotline
Notification / feedback
Hot ticket
Declaration / Plan
Progress
Normal
Operations
Opportunistic Known
Major Incident process
Repository/Problem Mgmt Process

Time
Time is money….
Lessons from history of how time (and a clock) changed the
modern world.

Recording of time
• When working with problems time is the most crucial attribute to
record.
• The time an event happens, the time between events provide the
most significant clues into a problem’s source.
• As an example, it is important to known when the event occurred as
opposed to when it was detected. The two might not necessarily
have occurred at the same time and that in itself could be a problem.

Why record time?
An analysis of times may assist in clarifying the following:
• When was the business impacted by major incidents?
• Is it at recognised stages like month-end?
• Is the return to service being prioritised?
• Are we detecting incidents quickly?
• Are the systems being suitably managed or monitored?
• Are the incidents correctly diagnosed?
• Is this diagnosis performed within expected time parameters?
• Are investigators and technicians suitably trained?

Why record time cont.
• Are repair processes initiated within suitable time limits after diagnosis?
• Is there a logistics issue?
• Are service restore times for the client adequate?
• Is there an issue around continuity or outdated technology?
• Does the system start processing in an acceptable time period after being
restored?
• Are there cumbersome system interface issues?

Timelines
(date and
times) the
expanded
incident
lifecycle
Time when incident started
(actual – something has happened to a CI or a risk
event has occurred)
<dd/mm/yy> <hh:mm>
Time when incident was
detected (incident is detected either by
monitoring tools, IT personnel or, worse case, the
user/customer)
<dd/mm/yy> <hh:mm>
Time of diagnosis (underlying cause –
we know what happened?)
<dd/mm/yy> <hh:mm>
Time of repair (process to fix failure
started or corrective action initiated)
<dd/mm/yy> <hh:mm>
Time of recovery (component
recovered – the CI is back in production –
business ready to be resumed)
<dd/mm/yy> <hh:mm>
Time of restoration (normal
operations resume – the service is back in
production)
<dd/mm/yy> <hh:mm>
Time of workaround (Service is back
in production with workaround)
<dd/mm/yy> <hh:mm>
Time of escalation (to problem
management team)
<dd/mm/yy> <hh:mm>
Time period service was unavailable (SLA measure) <minutes>
Time period service was degraded (SLA measure) <minutes>

Measuring time
How do you improve?
Understand the different time periods from outage to full resolution and which
ones are not optimal.
• Detection time - between when outage occurred and when it was known
(does the monitoring tool work?) (Do you detect HD RAID failures?) (Do you
detect redundant network path failures?)
• Diagnostic time – working out what went wrong. How good are your
troubleshooting skills. Have you identified the correct causes?
• Ready to repair – being able to gather all required resources to fix what is
broken. (Are the parts available?)

Measuring time cont.
• Recovery time – the failed components have been fixed and are
ready to be placed back in production.
• Restoration time – the system is back in production.
• Notification times – clients and users of the system are informed
e.g. do they know they can transact?
• Risk profile completion time – time to gather and analyse risk
associated with incident.
• Counter measures implementation – time that relevant counter
measures are implement to reduce identified threats.

Representing time
• Understand where the
problem is by using
graphs.
• Useful to aggregate
these statistics over
multiple Major
incidents to
understand trends
• Extrapolate statistics
that will define and
set appropriate SLA
times

Creating Metrics
Occurrence Diagnosis
Mean time between system incidents
Mean time
between failures
Mean time to restore service
Mean time to repair
Detection Repair Restoration Closure Detection
Recovery

Measurements
The typical
values of the
above is
expressed as
9s (from two
9s to five
9s). Here is
an example:
99% availability: 5,256 minutes (87.6 hours) / year downtime
99.5% availability: 2,628 minutes (43.8 hours) / year downtime
99.9% availability: 528 minutes (8.8 hours) / year downtime
99.99% availability: 53 minutes / year downtime
99.999% availability: 5 minutes / year downtime
The above
values are
mapped to
the
following
terms by
Gartner:
Normal system availability is 99.5%
High system availability is 99.9%
Fault system resilience is 99.99%
Fault tolerance is 99.999%
Continuous
processing is
as close to
100% as
possible.

Detection
• When a disaster has occurred, it is important to record the events – numerous
mechanisms are possible dependant on the outage.
• It is possible to use video surveillance or even Smartphone cameras to take
pictures of what has occurred.
• This might help as a later diagnosis and root causation could be expedited by a
review of the material.
• A source of detection are also logs, typically SYSLOG or the logs from
applications such as web servers (use ELK to create a mission control dashboard!
• Tools like NETFLOW can assist in providing the precise time of outages and also
be a primary tool for root causation.
• Often it will assist to have screen scraping or enforce logging of access (such as
log files when using SSH access and putty).
• A disproportionate number of incidents being logged at the Service Desk are a
potential indicator for a major incident.

Tools and Retrofit
• When an outage happens it is not possible to retrofit a detection tool.
• Surveillance of IT needs to be in place.
• Gathering of SNMP metrics can provide a guideline for usage and congestion.
• ICMP provides a means of detecting failures and degradation (latency).
• Great poller for ICMP and SNMP is Opmantek’s NMIS.
• Reference the section on tools in this course.

IS / IS NOT detection tool
Description
IS
(Observation):
IS NOT
(Observation):
What is the defect?
Which processes are impacted?
Where in the processes has the
failure occurred ?
Who is affected?
When did it happen?
How frequently did it happen?
Is there a pattern?
How much is it costing?

Alternative means
• Detection from the Service Desk - display call centre queues from
Service Desk to detect increased call volumes which can be an
indication of problems.
• Use social media such as tweetdeck to view notifications from own
company clients; utilities such as power and water; local news or
traffic.

Diagnose
• One of the primary triggers for an outage is a change in the environment.
• The first step in should be to determine if there has been a change.
• The importance of recording precise times in the major incident lifecycle is now
highlighted as this is used to correlate the outage to when the last known change
was made.
• Unauthorised changes also need to be investigated by reviewing anomalies,
preferably in dashboards.
• A key part of diagnosis is referring to the system documentation to see what should
have happened.
• Put eyes on the problem as soon as possible.
• As part of the diagnosis process, it’s important to refer to previous major incident
reports to assess whether the issue has occurred previously and whether the same
actions can be followed to solve the issue.

Checklists
Doing the work right (4 minutes)

The predecessor of the Flying Fortress:
the birth of the checklist
The Air Corps faced arguments that the
aircraft was too big to handle. The Air
Corps, however, properly recognised
that the limiting factor here was human
memory, not the aircraft’s size or
complexity. To avoid another accident,
Air Corps personnel developed
checklists the crew would follow for
take-off, flight, before landing, and
after landing. The idea was so simple,
and so effective, that the checklist was
to become the future norm for aircraft
operations. The basic concept had
already been around for decades, and
was in scattered use in aviation
worldwide, but it took the Model 299
crash to institutionalize its use.
“The Checklist,” Air Force Magazine

Checklists
• Execute checklist to diagnose failures and outages.
• Checklist can evolve to include items from lessons learnt.
• The most common and often diagnosed checks should be
prioritized and executed first.
• Mechanism to transfer skill and knowledge (checklist should
reflect the knowledge base).
• Ability to improve time for diagnosis.
• Examples of areas for checklists includes
networks, data centres and information security.
• Refer to the Appendix for a Network
Troubleshooting checklist.
the original checklist

Atul Gawande: How to Make Doctors
Better
Surgeon and author Atul Gawande says
the very vastness of our knowledge gets in
the way: doctors make errors because they
simply can't remember it all.
The solution isn't fancier technology or
more training.
It's as simple as an old-fashioned checklist,
like those used by pilots, restaurateurs and
construction engineers.
When his research team introduced a
checklist in eight hospitals in 2008, major
surgery complications dropped 36% and
deaths plunged 47%.
from Time magazine

The New England Journal of Medicine supports the use of checklists during a surgical
emergency for better safety performance results.
In a study of 100Michigan
hospitals 30% of the time,
surgical teams skipped one of these
five essential steps:
• washing hands
• cleaning the site
• draping the patient
• applying a sterile dressing
• donning surgical mask, gloves
and gown
But after 15 months of using a
simple checklist, the hospitals cut
their infection rate from 4 percent
of cases to zero, saving
1,500 lives
and nearly
$200 million

Put eyes on the problem
• The process followed to solve a murder is no different to the
process followed when solving a crisis.
• The location where the problem has occurred needs to be
investigated.
• It is preferable to secure the area and gather all evidence and
log it, just like a crime scene.
• This principle is also used in production and manufacturing
environments.

Crime scene (location of problem)
Taiichi Ohno, who refined the production systems at (TPS) Toyota
Production System, would take new managers and engineers to the
factory and draw a chalk circle on the floor. The subordinate would be
told to stand in the circle and to observe and note down what he saw.
When Ohno returned he would check - if the person in the circle had
not seen enough he would be asked to keep observing. Ohno was
trying to imprint upon his future managers and engineers that the only
way to truly understand what happens in the factory was to go there. It
was here that value was added and here that waste could be observed.
This was known as Genchi Genbutsu and is a primary method used for
solving problems. If the problem exists in the factory then it needs to
be understood and solved in the factory and not on the top floors of
some office block or city skyscraper.

Genchi Genbutsu 現地現物 –
go see
• Genchi Genbutsu sets out the expectation that it is a
requirement to personally evaluate operations so that
a first-hand understanding of situations and problems
is derived.
• Genchi Genbutsu means "go and see" and it is a key
principle of the Toyota Production System. It suggests
that in order to truly understand a situation one
needs to go to gemba (現場) or, the 'real place' -
where work is done.

Recording the event
• An investigator will record the observations of eye
witnesses.
• These records serve as a basis for review.
• What seems insignificant now, might be crucial when
more becomes known about the problem.
• Determine:
• What
• Why
• When
• Who
• Where
• How

Prevailing conditions and business
impact
• Take note of the prevailing conditions.
• It is also important to take a snapshot of the prevailing conditions at the time of the
problem. If the problem remains unresolved and it happens again, a comparison of
prevailing conditions might provide significant insight.
• These might be economic or even weather related. Don’t discount prevailing
conditions.
• If it is a technical problem it is important to determine and measure the business
impact.
• This needs to be assessed from a client and an internal organisational perspective.
• When the probability of an occurrence is low, it is incorrect to assume that it will
only happen way into the future.
• Major incidents can happen anytime within the probability period and not at the
end of the probability period.

Prevailing conditions
On the morning of Monday, 29th August 2005 hurricane Katrina hit the
Gulf coast of the US.
New Orleans, Louisiana suffered the main brunt of the hurricane but
the major damage and loss of life occurred when the levee system
catastrophically failed.
Floodwaters surged into 80% of the city and lingered for weeks. At
least 1,836 people lost their lives in the hurricane and resulting floods,
making it the largest natural disaster in the history of the United States.
Video or better pic.

Prevailing conditions
On July 31, 2006 the Independent Levee Investigation Team released a report on
the Greater New Orleans area levee failures. In the report, it was noted that the
hypothetical model storm upon which storm protection plans were based, (called
the Standard Project Hurricane or SPH) model was simplistic.
The report found that an inadequate network of levees, flood walls, storm gates
and pumps were established.
The report also found that
“the creators of the standard project hurricane, in an attempt to find a
representative storm, actually excluded the fiercest storms from the database.”

Visualization
• It is one thing collecting data of a problem and recording it, but a
totally different skill is required to interpret it.
• Here you look at visual representations by graphing the data in an
appropriate fashion. As an example, bar graphs are often referred to
as Manhattan graphs.
• Just as with the Manhattan skyline where the large buildings are
prominent, so too is those significant bits of data that is represented
in a graph.
• Convert the data to a visual representation and this will aid in the
process of solving the problems.
• The visualisation present in the CMOC should always be designed to
assist in diagnosis.

uptime is
about reducing downtime
repair, recover, restore
99.999%

Workarounds (aka fire fighting)
Something that is important
especially when the crisis is significant
is to realise that you need to be
skilled in fighting fires. Meaning, the
problem might require an immediate
workaround to maintain service. As
such you might not be solving the
problem but on a temporary basis
alleviating any further negative
consequences.

Repair
Following diagnosis are the activities associated
with repairing the configuration item (CI) that
failed. Hardware may need to be ordered,
vendors contacted, consultants brought in, and
so forth. The biggest gap here is understanding
how a given CI was configured. Groups with
accurate configuration management systems
(CMS) know right away whereas others will
need to perform forensic archaeology to try
and determine that; losing valuable time in the
process.

Recover
Once the CI is repaired, it must be brought back online
including reloading any necessary images, applications and/or
data. Again, rapid accurate knowledge about CIs will speed this
up as will having standard builds/images to restore from versus
building a unique system from scratch.

Restore
This is the final step and
is known as the
restoration of the
service. It may be that
related CIs must be
rebooted in a certain
order to re-establish
connectivity, and so on.
Service design
documentation and/or
standard operating
procedures that are
readily accessible and
accurate will aid groups
restoring services.

Collation
• There is a requirement to collate the information from each of the
steps in the Major Incident lifecycle.
• This information is utilised as the basis of the Major Incident Report.
• This collation involves all members of the Tiger Team and is typically
managed and owned by the SLM/SDM or Process Owner.
• This is generally under a time constraint dictated by a service level
agreement.
• The collated report is always issued in draft first and reviewed by all
internal parties.

Major Incident reporting
• Generate the Major Incident report.
• Contain a detailed description of the outage/failure; timing; sequencing; the
actions taken; the people involved; resources; next steps and identified/remaining
actions.
• Typically a draft is issued to the business/client and discussed for agreement or
update.
• A final report is then issued to the client/business.
• There may be resulting actions which need to be dealt with as a service request;
problem; project or a Problem for further analysis.
• The CMDB (KEDB) is updated if there is one, or a suitable repository.
• If required, this may be fed into the Problem Management Process for further
analysis.

Review
• Genchi Genbutsu
• Time is money
• The art of the workaround

Problem management foundation - Lifecycle

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Similar to Problem management foundation - Lifecycle

Similar to Problem management foundation - Lifecycle (20)

More from Ronald Bartels

More from Ronald Bartels (20)

Recently uploaded

Recently uploaded (9)

Problem management foundation - Lifecycle

Editor's Notes