DS Crisis Management Foundation - Lifecycle

CrisisManagementFoundation
Crisis Management
Foundation
ds.co.za
dealing with incidents that have a severe negative
business consequence

Based on the process
from ITIL v3
6. major incident lifecycle

Objectives
• The Major incident process
• The importance of Time
• Detection
• Diagnosis
• Checklists
• Workarounds

"I chose to race, so I chose to win."

Disaster
hh:mm:ss
Detection
hh:mm:ss
Diagnosis
hh:mm:ss
Workaround
Repair
hh:mm:ss
Recover
hh:mm:ss
Restore
hh:mm:ss
Report
Resolution
hh:mm:ss
Determination
Clients
Service Desk EscalationsHotline
Notification / feedback
Hot ticket
Declaration / Plan
Progress
Normal
Operations
Opportunistic Known
Major Incident process
Repository/Problem Mgmt Process

Time
Time is money….
Lessons from history of how time (and a clock) changed
the modern world.

Recording of time
• When working with problems time is the most crucial attribute to
record.
• The time an event happens, the time between events provide the
most significant clues into a problem’s source.
• As an example, it is important to known when the event occurred
as opposed to when it was detected. The two might not
necessarily have occurred at the same time and that in itself could
be a problem.

Why record time?
An analysis of times may assist in clarifying the following:
• When was the business impacted by major incidents?
• Is it at recognised stages like month-end?
• Is the return to service being prioritised?
• Are we detecting incidents quickly?
• Are the systems being suitably managed or monitored?
• Are the incidents correctly diagnosed?
• Is this diagnosis performed within expected time parameters?
• Are investigators and technicians suitably trained?

Why record time cont.
• Are repair processes initiated within suitable time limits after
diagnosis?
• Is there a logistics issue?
• Are service restore times for the client adequate?
• Is there an issue around continuity or outdated technology?
• Does the system start processing in an acceptable time period
after being restored?
• Are there cumbersome system interface issues?

Timelines (date and times) the
expanded incident lifecycle
Time when incident started (actual – something
has happened to a CI or a risk event has occurred)
<dd/mm/yy> <hh:mm>
Time when incident was detected (incident is
detected either by monitoring tools, IT personnel or, worse case,
the user/customer)
<dd/mm/yy> <hh:mm>
Time of diagnosis (underlying cause – we know what
happened?)
<dd/mm/yy> <hh:mm>
Time of repair (process to fix failure started or corrective
action initiated)
<dd/mm/yy> <hh:mm>
Time of recovery (component recovered – the CI is back
in production – business ready to be resumed)
<dd/mm/yy> <hh:mm>
Time of restoration (normal operations resume – the
service is back in production)
<dd/mm/yy> <hh:mm>
Time of workaround (Service is back in production
with workaround)
<dd/mm/yy> <hh:mm>
Time of escalation (to problem management team) <dd/mm/yy> <hh:mm>
Time period service was unavailable (SLA measure) <minutes>
Time period service was degraded (SLA measure) <minutes>

Measuring time
How do you improve?
Understand the different time periods from outage to full resolution
and which ones are not optimal.
• Detection time - between when outage occurred and when it was
known (does the monitoring tool work?) (Do you detect HD RAID
failures?) (Do you detect redundant network path failures?)
• Diagnostic time – working out what went wrong. How good are
your troubleshooting skills. Have you identified the correct
causes?
• Ready to repair – being able to gather all required resources to fix
what is broken. (Are the parts available?)

Measuring time cont.
• Recovery time – the failed components have been fixed and are
ready to be placed back in production.
• Restoration time – the system is back in production.
• Notification times – clients and users of the system are informed
e.g. do they know they can transact?
• Risk profile completion time – time to gather and analyse risk
associated with incident.
• Counter measures implementation – time that relevant counter
measures are implement to reduce identified threats.

Representing time
• Understand where the problem is by using graphs.
• Useful to aggregate these statistics over multiple Major incidents
to understand trends
• Extrapolate statistics that will define and set appropriate SLA times

Creating Metrics
Occurrence Diagnosis
Mean time between system incidents
Mean time
between failures
Mean time to restore service
Mean time to repair
Detection Repair Restoration Closure Detection
Recovery

Measurements
• The typical values of the above is expressed as 9s (from two 9s to five
9s). Here is an example:
• 99% availability: 5,256 minutes (87.6 hours) / year downtime
• 99.5% availability: 2,628 minutes (43.8 hours) / year downtime
• 99.9% availability: 528 minutes (8.8 hours) / year downtime
• 99.99% availability: 53 minutes / year downtime
• 99.999% availability: 5 minutes / year downtime
• The above values are mapped to the following terms by Gartner:
• Normal system availability is 99.5%
• High system availability is 99.9%
• Fault system resilience is 99.99%
• Fault tolerance is 99.999%
• Continuous processing is as close to 100% as possible.

Detection
• When a disaster has occurred, it is important to record the events –
numerous mechanisms are possible dependant on the outage.
• It is possible to use video surveillance or even Smartphone cameras to
take pictures of what has occurred.
• This might help as a later diagnosis and root causation could be
expedited by a review of the material.
• A source of detection are also logs, typically SYSLOG or the logs from
applications such as web servers (use ELK to create a mission control
dashboard!
• Tools like NETFLOW can assist in providing the precise time of outages
and also be a primary tool for root causation.
• Often it will assist to have screen scraping or enforce logging of access
(such as log files when using SSH access and putty).
• A disproportionate number of incidents being logged at the Service Desk
are a potential indicator for a major incident.

Tools and Retrofit
• When an outage happens it is not possible to retrofit a detection
tool.
• Surveillance of IT needs to be in place.
• Gathering of SNMP metrics can provide a guideline for usage and
congestion.
• ICMP provides a means of detecting failures and degradation
(latency).
• Great poller for ICMP and SNMP is Opmantek’s NMIS.
• Reference the section on tools in this course.

IS / IS NOT detection tool
Description
IS
(Observation):
IS NOT
(Observation):
What is the defect?
Which processes are impacted?
Where in the processes has the
failure occurred ?
Who is affected?
When did it happen?
How frequently did it happen?
Is there a pattern?
How much is it costing?

Alternative means
• Detection from the Service Desk - display call centre queues from
Service Desk to detect increased call volumes which can be an
indication of problems.
• Use social media such as tweetdeck to view notifications from own
company clients; utilities such as power and water; local news or
traffic.

Diagnose
• One of the primary triggers for an outage is a change in the
environment.
• The first step in should be to determine if there has been a change.
• The importance of recording precise times in the major incident
lifecycle is now highlighted as this is used to correlate the outage to
when the last known change was made.
• Unauthorised changes also need to be investigated by reviewing
anomalies, preferably in dashboards.
• A key part of diagnosis is referring to the system documentation to
see what should have happened.
• Put eyes on the problem as soon as possible.
• As part of the diagnosis process, it’s important to refer to previous
major incident reports to assess whether the issue has occurred
previously and whether the same actions can be followed to solve
the issue.

Checklists
Doing the work right (4 minutes)

The predecessor of the Flying Fortress:
the birth of the checklist
The Air Corps faced arguments that the
aircraft was too big to handle. The Air
Corps, however, properly recognised
that the limiting factor here was human
memory, not the aircraft’s size or
complexity. To avoid another accident,
Air Corps personnel developed
checklists the crew would follow for
take-off, flight, before landing, and
after landing. The idea was so simple,
and so effective, that the checklist was
to become the future norm for aircraft
operations. The basic concept had
already been around for decades, and
was in scattered use in aviation
worldwide, but it took the Model 299
crash to institutionalize its use.
“The Checklist,” Air Force Magazine

Checklists
• Execute checklist to diagnose failures and outages.
• Checklist can evolve to include items from lessons learnt.
• The most common and often diagnosed checks should be
prioritized and executed first.
• Mechanism to transfer skill and knowledge (checklist should
reflect the knowledge base).
• Ability to improve time for diagnosis.
• Examples of areas for checklists includes
networks, data centres and information security.
• Refer to the Appendix for a Network
Troubleshooting checklist.
the original checklist

Atul Gawande: How to Make Doctors
Better
Surgeon and author Atul Gawande says
the very vastness of our knowledge gets in
the way: doctors make errors because they
simply can't remember it all.
The solution isn't fancier technology or
more training.
It's as simple as an old-fashioned checklist,
like those used by pilots, restaurateurs and
construction engineers.
When his research team introduced a
checklist in eight hospitals in 2008, major
surgery complications dropped 36% and
deaths plunged 47%.
from Time magazine

The New England Journal of Medicine supports the use of checklists during a surgical
emergency for better safety performance results.
In a study of 100Michigan
hospitals 30% of the time,
surgical teams skipped one of these
five essential steps:
• washing hands
• cleaning the site
• draping the patient
• applying a sterile dressing
• donning surgical mask, gloves
and gown
But after 15 months of using a
simple checklist, the hospitals cut
their infection rate from 4 percent
of cases to zero, saving
1,500 lives
and nearly
$200 million

Put eyes on the problem
• The process followed to solve a murder
is no different to the process followed
when solving a crisis.
• The location where the problem has
occurred needs to be investigated.
• It is preferable to secure the area and
gather all evidence and log it, just like a
crime scene.
• This principle is also used in production
and manufacturing environments.

Crime scene (location of problem)
Taiichi Ohno, who refined the production systems at (TPS) Toyota
Production System, would take new managers and engineers to the
factory and draw a chalk circle on the floor. The subordinate would
be told to stand in the circle and to observe and note down what
he saw. When Ohno returned he would check - if the person in the
circle had not seen enough he would be asked to keep observing.
Ohno was trying to imprint upon his future managers and
engineers that the only way to truly understand what happens in
the factory was to go there. It was here that value was added and
here that waste could be observed. This was known as Genchi
Genbutsu and is a primary method used for solving problems. If
the problem exists in the factory then it needs to be understood
and solved in the factory and not on the top floors of some office
block or city skyscraper.

Genchi Genbutsu 現地現物 – go see
• Genchi Genbutsu sets out the expectation that it is a requirement
to personally evaluate operations so that a first-hand
understanding of situations and problems is derived.
• Genchi Genbutsu means "go and see" and it is a key principle of
the Toyota Production System. It suggests that in order to truly
understand a situation one needs to go to gemba (現場) or, the
'real place' - where work is done.

Recording the event
• An investigator will record the observations
of eye witnesses.
• These records serve as a basis for review.
• What seems insignificant now, might be
crucial when more becomes known about
the problem.
• Determine:
• What
• Why
• When
• Who
• Where
• How

Prevailing conditions and business
impact
• Take note of the prevailing conditions.
• It is also important to take a snapshot of the prevailing conditions at
the time of the problem. If the problem remains unresolved and it
happens again, a comparison of prevailing conditions might provide
significant insight.
• These might be economic or even weather related. Don’t discount
prevailing conditions.
• If it is a technical problem it is important to determine and measure
the business impact.
• This needs to be assessed from a client and an internal organisational
perspective.
• When the probability of an occurrence is low, it is incorrect to assume
that it will only happen way into the future.
• Major incidents can happen anytime within the probability period and
not at the end of the probability period.

Prevailing conditions
On the morning of Monday, 29th August 2005 hurricane Katrina hit the
Gulf coast of the US.
New Orleans, Louisiana suffered the main brunt of the hurricane but the
major damage and loss of life occurred when the levee system
catastrophically failed.
Floodwaters surged into 80% of the city and lingered for weeks. At least
1,836 people lost their lives in the hurricane and resulting floods, making
it the largest natural disaster in the history of the United States. Video or
better pic.

Prevailing conditions
On July 31, 2006 the Independent Levee Investigation Team
released a report on the Greater New Orleans area levee failures.
In the report, it was noted that the hypothetical model storm upon
which storm protection plans were based, (called the Standard
Project Hurricane or SPH) model was simplistic.
The report found that an inadequate network of levees, flood walls,
storm gates and pumps were established.
The report also found that
“the creators of the standard project hurricane, in an attempt to
find a representative storm, actually excluded the fiercest storms
from the database.”
Quote source

Visualization
• It is one thing collecting data of a problem and recording it, but a
totally different skill is required to interpret it.
• Here you look at visual representations by graphing the data in
an appropriate fashion. As an example, bar graphs are often
referred to as Manhattan graphs.
• Just as with the Manhattan skyline where the large buildings are
prominent, so too is those significant bits of data that is
represented in a graph.
• Convert the data to a visual representation and this will aid in the
process of solving the problems.
• The visualisation present in the CMOC should always be designed
to assist in diagnosis.

uptime is
about reducing downtime
repair, recover, restore
99.999%

Workarounds (aka fire fighting)
Something that is important especially when the crisis is significant
is to realise that you need to be skilled in fighting fires. Meaning,
the problem might require an immediate workaround to maintain
service. As such you might not be solving the problem but on a
temporary basis alleviating any further negative consequences.

Red Adair
The professional

Repair
Following diagnosis are the activities associated with repairing the
configuration item (CI) that failed. Hardware may need to be
ordered, vendors contacted, consultants brought in, and so forth.
The biggest gap here is understanding how a given CI was
configured. Groups with accurate configuration management
systems (CMS) know right away whereas others will need to
perform forensic archaeology to try and determine that; losing
valuable time in the process.

Recover
Once the CI is repaired, it must be brought back online including
reloading any necessary images, applications and/or data. Again,
rapid accurate knowledge about CIs will speed this up as will
having standard builds/images to restore from versus building a
unique system from scratch.

Restore
This is the final step and is known as the restoration of the service.
It may be that related CIs must be rebooted in a certain order to
re-establish connectivity, and so on. Service design documentation
and/or standard operating procedures that are readily accessible
and accurate will aid groups restoring services.

Collation
• There is a requirement to collate the information from each of the
steps in the Major Incident lifecycle.
• This information is utilised as the basis of the Major Incident
Report.
• This collation involves all members of the Tiger Team and is
typically managed and owned by the SLM/SDM or Process
Owner.
• This is generally under a time constraint dictated by a service
level agreement.
• The collated report is always issued in draft first and reviewed by
all internal parties.

Major Incident reporting
• Generate the Major Incident report.
• Contain a detailed description of the outage/failure; timing;
sequencing; the actions taken; the people involved; resources; next
steps and identified/remaining actions.
• Typically a draft is issued to the business/client and discussed for
agreement or update.
• A final report is then issued to the client/business.
• There may be resulting actions which need to be dealt with as a
service request; problem; project or a Problem for further analysis.
• The CMDB (KEDB) is updated if there is one, or a suitable repository.
• If required, this may be fed into the Problem Management Process for
further analysis.

Solving the Toyota Production System
way (5 minutes)

Review
• Genchi Genbutsu
• Time is money
• The art of the workaround

DS Crisis Management Foundation - Lifecycle

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Viewers also liked

Viewers also liked (6)

Similar to DS Crisis Management Foundation - Lifecycle

Similar to DS Crisis Management Foundation - Lifecycle (20)

Recently uploaded

Recently uploaded (20)

DS Crisis Management Foundation - Lifecycle

Editor's Notes