SlideShare a Scribd company logo
CrisisManagementFoundation
Crisis Management
Foundation
ds.co.za
dealing with incidents that have a severe negative
business consequence
Based on the process
from ITIL v3
6. major incident lifecycle
CrisisManagementFoundation
Objectives
• The Major incident process
• The importance of Time
• Detection
• Diagnosis
• Checklists
• Workarounds
"I chose to race, so I chose to win."
CrisisManagementFoundation
Disaster
hh:mm:ss
Detection
hh:mm:ss
Diagnosis
hh:mm:ss
Workaround
Repair
hh:mm:ss
Recover
hh:mm:ss
Restore
hh:mm:ss
Report
Resolution
hh:mm:ss
Determination
Clients
Service Desk EscalationsHotline
Notification / feedback
Hot ticket
Declaration / Plan
Progress
Normal
Operations
Opportunistic Known
Major Incident process
Repository/Problem Mgmt Process
time
CrisisManagementFoundation
Time
Time is money….
Lessons from history of how time (and a clock) changed
the modern world.
CrisisManagementFoundation
Recording of time
• When working with problems time is the most crucial attribute to
record.
• The time an event happens, the time between events provide the
most significant clues into a problem’s source.
• As an example, it is important to known when the event occurred
as opposed to when it was detected. The two might not
necessarily have occurred at the same time and that in itself could
be a problem.
CrisisManagementFoundation
Why record time?
An analysis of times may assist in clarifying the following:
• When was the business impacted by major incidents?
• Is it at recognised stages like month-end?
• Is the return to service being prioritised?
• Are we detecting incidents quickly?
• Are the systems being suitably managed or monitored?
• Are the incidents correctly diagnosed?
• Is this diagnosis performed within expected time parameters?
• Are investigators and technicians suitably trained?
CrisisManagementFoundation
Why record time cont.
• Are repair processes initiated within suitable time limits after
diagnosis?
• Is there a logistics issue?
• Are service restore times for the client adequate?
• Is there an issue around continuity or outdated technology?
• Does the system start processing in an acceptable time period
after being restored?
• Are there cumbersome system interface issues?
CrisisManagementFoundation
Timelines (date and times) the
expanded incident lifecycle
Time when incident started (actual – something
has happened to a CI or a risk event has occurred)
<dd/mm/yy> <hh:mm>
Time when incident was detected (incident is
detected either by monitoring tools, IT personnel or, worse case,
the user/customer)
<dd/mm/yy> <hh:mm>
Time of diagnosis (underlying cause – we know what
happened?)
<dd/mm/yy> <hh:mm>
Time of repair (process to fix failure started or corrective
action initiated)
<dd/mm/yy> <hh:mm>
Time of recovery (component recovered – the CI is back
in production – business ready to be resumed)
<dd/mm/yy> <hh:mm>
Time of restoration (normal operations resume – the
service is back in production)
<dd/mm/yy> <hh:mm>
Time of workaround (Service is back in production
with workaround)
<dd/mm/yy> <hh:mm>
Time of escalation (to problem management team) <dd/mm/yy> <hh:mm>
Time period service was unavailable (SLA measure) <minutes>
Time period service was degraded (SLA measure) <minutes>
CrisisManagementFoundation
Measuring time
How do you improve?
Understand the different time periods from outage to full resolution
and which ones are not optimal.
• Detection time - between when outage occurred and when it was
known (does the monitoring tool work?) (Do you detect HD RAID
failures?) (Do you detect redundant network path failures?)
• Diagnostic time – working out what went wrong. How good are
your troubleshooting skills. Have you identified the correct
causes?
• Ready to repair – being able to gather all required resources to fix
what is broken. (Are the parts available?)
CrisisManagementFoundation
Measuring time cont.
• Recovery time – the failed components have been fixed and are
ready to be placed back in production.
• Restoration time – the system is back in production.
• Notification times – clients and users of the system are informed
e.g. do they know they can transact?
• Risk profile completion time – time to gather and analyse risk
associated with incident.
• Counter measures implementation – time that relevant counter
measures are implement to reduce identified threats.
CrisisManagementFoundation
Representing time
• Understand where the problem is by using graphs.
• Useful to aggregate these statistics over multiple Major incidents
to understand trends
• Extrapolate statistics that will define and set appropriate SLA times
CrisisManagementFoundation
Creating Metrics
Occurrence Diagnosis
Mean time between system incidents
Mean time
between failures
Mean time to restore service
Mean time to repair
Detection Repair Restoration Closure Detection
Recovery
CrisisManagementFoundation
Measurements
• The typical values of the above is expressed as 9s (from two 9s to five
9s). Here is an example:
• 99% availability: 5,256 minutes (87.6 hours) / year downtime
• 99.5% availability: 2,628 minutes (43.8 hours) / year downtime
• 99.9% availability: 528 minutes (8.8 hours) / year downtime
• 99.99% availability: 53 minutes / year downtime
• 99.999% availability: 5 minutes / year downtime
• The above values are mapped to the following terms by Gartner:
• Normal system availability is 99.5%
• High system availability is 99.9%
• Fault system resilience is 99.99%
• Fault tolerance is 99.999%
• Continuous processing is as close to 100% as possible.
Detection
CrisisManagementFoundation
Detection
• When a disaster has occurred, it is important to record the events –
numerous mechanisms are possible dependant on the outage.
• It is possible to use video surveillance or even Smartphone cameras to
take pictures of what has occurred.
• This might help as a later diagnosis and root causation could be
expedited by a review of the material.
• A source of detection are also logs, typically SYSLOG or the logs from
applications such as web servers (use ELK to create a mission control
dashboard!
• Tools like NETFLOW can assist in providing the precise time of outages
and also be a primary tool for root causation.
• Often it will assist to have screen scraping or enforce logging of access
(such as log files when using SSH access and putty).
• A disproportionate number of incidents being logged at the Service Desk
are a potential indicator for a major incident.
CrisisManagementFoundation
Tools and Retrofit
• When an outage happens it is not possible to retrofit a detection
tool.
• Surveillance of IT needs to be in place.
• Gathering of SNMP metrics can provide a guideline for usage and
congestion.
• ICMP provides a means of detecting failures and degradation
(latency).
• Great poller for ICMP and SNMP is Opmantek’s NMIS.
• Reference the section on tools in this course.
CrisisManagementFoundation
IS / IS NOT detection tool
Description
IS
(Observation):
IS NOT
(Observation):
What is the defect?
Which processes are impacted?
Where in the processes has the
failure occurred ?
Who is affected?
When did it happen?
How frequently did it happen?
Is there a pattern?
How much is it costing?
CrisisManagementFoundation
Alternative means
• Detection from the Service Desk - display call centre queues from
Service Desk to detect increased call volumes which can be an
indication of problems.
• Use social media such as tweetdeck to view notifications from own
company clients; utilities such as power and water; local news or
traffic.
Diagnosis
CrisisManagementFoundation
Diagnose
• One of the primary triggers for an outage is a change in the
environment.
• The first step in should be to determine if there has been a change.
• The importance of recording precise times in the major incident
lifecycle is now highlighted as this is used to correlate the outage to
when the last known change was made.
• Unauthorised changes also need to be investigated by reviewing
anomalies, preferably in dashboards.
• A key part of diagnosis is referring to the system documentation to
see what should have happened.
• Put eyes on the problem as soon as possible.
• As part of the diagnosis process, it’s important to refer to previous
major incident reports to assess whether the issue has occurred
previously and whether the same actions can be followed to solve
the issue.
Checklists
CrisisManagementFoundation
Checklists
Doing the work right (4 minutes)
CrisisManagementFoundation
The predecessor of the Flying Fortress:
the birth of the checklist
The Air Corps faced arguments that the
aircraft was too big to handle. The Air
Corps, however, properly recognised
that the limiting factor here was human
memory, not the aircraft’s size or
complexity. To avoid another accident,
Air Corps personnel developed
checklists the crew would follow for
take-off, flight, before landing, and
after landing. The idea was so simple,
and so effective, that the checklist was
to become the future norm for aircraft
operations. The basic concept had
already been around for decades, and
was in scattered use in aviation
worldwide, but it took the Model 299
crash to institutionalize its use.
“The Checklist,” Air Force Magazine
CrisisManagementFoundation
Checklists
• Execute checklist to diagnose failures and outages.
• Checklist can evolve to include items from lessons learnt.
• The most common and often diagnosed checks should be
prioritized and executed first.
• Mechanism to transfer skill and knowledge (checklist should
reflect the knowledge base).
• Ability to improve time for diagnosis.
• Examples of areas for checklists includes
networks, data centres and information security.
• Refer to the Appendix for a Network
Troubleshooting checklist.
the original checklist
CrisisManagementFoundation
Atul Gawande: How to Make Doctors
Better
Surgeon and author Atul Gawande says
the very vastness of our knowledge gets in
the way: doctors make errors because they
simply can't remember it all.
The solution isn't fancier technology or
more training.
It's as simple as an old-fashioned checklist,
like those used by pilots, restaurateurs and
construction engineers.
When his research team introduced a
checklist in eight hospitals in 2008, major
surgery complications dropped 36% and
deaths plunged 47%.
from Time magazine
The New England Journal of Medicine supports the use of checklists during a surgical
emergency for better safety performance results.
In a study of 100Michigan
hospitals 30% of the time,
surgical teams skipped one of these
five essential steps:
• washing hands
• cleaning the site
• draping the patient
• applying a sterile dressing
• donning surgical mask, gloves
and gown
But after 15 months of using a
simple checklist, the hospitals cut
their infection rate from 4 percent
of cases to zero, saving
1,500 lives
and nearly
$200 million
CrisisManagementFoundation
Put eyes on the problem
• The process followed to solve a murder
is no different to the process followed
when solving a crisis.
• The location where the problem has
occurred needs to be investigated.
• It is preferable to secure the area and
gather all evidence and log it, just like a
crime scene.
• This principle is also used in production
and manufacturing environments.
CrisisManagementFoundation
Crime scene (location of problem)
Taiichi Ohno, who refined the production systems at (TPS) Toyota
Production System, would take new managers and engineers to the
factory and draw a chalk circle on the floor. The subordinate would
be told to stand in the circle and to observe and note down what
he saw. When Ohno returned he would check - if the person in the
circle had not seen enough he would be asked to keep observing.
Ohno was trying to imprint upon his future managers and
engineers that the only way to truly understand what happens in
the factory was to go there. It was here that value was added and
here that waste could be observed. This was known as Genchi
Genbutsu and is a primary method used for solving problems. If
the problem exists in the factory then it needs to be understood
and solved in the factory and not on the top floors of some office
block or city skyscraper.
CrisisManagementFoundation
Genchi Genbutsu 現地現物 – go see
• Genchi Genbutsu sets out the expectation that it is a requirement
to personally evaluate operations so that a first-hand
understanding of situations and problems is derived.
• Genchi Genbutsu means "go and see" and it is a key principle of
the Toyota Production System. It suggests that in order to truly
understand a situation one needs to go to gemba (現場) or, the
'real place' - where work is done.
CrisisManagementFoundation
Recording the event
• An investigator will record the observations
of eye witnesses.
• These records serve as a basis for review.
• What seems insignificant now, might be
crucial when more becomes known about
the problem.
• Determine:
• What
• Why
• When
• Who
• Where
• How
CrisisManagementFoundation
Prevailing conditions and business
impact
• Take note of the prevailing conditions.
• It is also important to take a snapshot of the prevailing conditions at
the time of the problem. If the problem remains unresolved and it
happens again, a comparison of prevailing conditions might provide
significant insight.
• These might be economic or even weather related. Don’t discount
prevailing conditions.
• If it is a technical problem it is important to determine and measure
the business impact.
• This needs to be assessed from a client and an internal organisational
perspective.
• When the probability of an occurrence is low, it is incorrect to assume
that it will only happen way into the future.
• Major incidents can happen anytime within the probability period and
not at the end of the probability period.
CrisisManagementFoundation
Prevailing conditions
On the morning of Monday, 29th August 2005 hurricane Katrina hit the
Gulf coast of the US.
New Orleans, Louisiana suffered the main brunt of the hurricane but the
major damage and loss of life occurred when the levee system
catastrophically failed.
Floodwaters surged into 80% of the city and lingered for weeks. At least
1,836 people lost their lives in the hurricane and resulting floods, making
it the largest natural disaster in the history of the United States. Video or
better pic.
CrisisManagementFoundation
Prevailing conditions
On July 31, 2006 the Independent Levee Investigation Team
released a report on the Greater New Orleans area levee failures.
In the report, it was noted that the hypothetical model storm upon
which storm protection plans were based, (called the Standard
Project Hurricane or SPH) model was simplistic.
The report found that an inadequate network of levees, flood walls,
storm gates and pumps were established.
The report also found that
“the creators of the standard project hurricane, in an attempt to
find a representative storm, actually excluded the fiercest storms
from the database.”
Quote source
CrisisManagementFoundation
Visualization
• It is one thing collecting data of a problem and recording it, but a
totally different skill is required to interpret it.
• Here you look at visual representations by graphing the data in
an appropriate fashion. As an example, bar graphs are often
referred to as Manhattan graphs.
• Just as with the Manhattan skyline where the large buildings are
prominent, so too is those significant bits of data that is
represented in a graph.
• Convert the data to a visual representation and this will aid in the
process of solving the problems.
• The visualisation present in the CMOC should always be designed
to assist in diagnosis.
uptime is
about reducing downtime
repair, recover, restore
99.999%
CrisisManagementFoundation
Workarounds (aka fire fighting)
Something that is important especially when the crisis is significant
is to realise that you need to be skilled in fighting fires. Meaning,
the problem might require an immediate workaround to maintain
service. As such you might not be solving the problem but on a
temporary basis alleviating any further negative consequences.
CrisisManagementFoundation
Red Adair
The professional
CrisisManagementFoundation
Repair
Following diagnosis are the activities associated with repairing the
configuration item (CI) that failed. Hardware may need to be
ordered, vendors contacted, consultants brought in, and so forth.
The biggest gap here is understanding how a given CI was
configured. Groups with accurate configuration management
systems (CMS) know right away whereas others will need to
perform forensic archaeology to try and determine that; losing
valuable time in the process.
CrisisManagementFoundation
Recover
Once the CI is repaired, it must be brought back online including
reloading any necessary images, applications and/or data. Again,
rapid accurate knowledge about CIs will speed this up as will
having standard builds/images to restore from versus building a
unique system from scratch.
CrisisManagementFoundation
Restore
This is the final step and is known as the restoration of the service.
It may be that related CIs must be rebooted in a certain order to
re-establish connectivity, and so on. Service design documentation
and/or standard operating procedures that are readily accessible
and accurate will aid groups restoring services.
CrisisManagementFoundation
Collation
• There is a requirement to collate the information from each of the
steps in the Major Incident lifecycle.
• This information is utilised as the basis of the Major Incident
Report.
• This collation involves all members of the Tiger Team and is
typically managed and owned by the SLM/SDM or Process
Owner.
• This is generally under a time constraint dictated by a service
level agreement.
• The collated report is always issued in draft first and reviewed by
all internal parties.
CrisisManagementFoundation
Major Incident reporting
• Generate the Major Incident report.
• Contain a detailed description of the outage/failure; timing;
sequencing; the actions taken; the people involved; resources; next
steps and identified/remaining actions.
• Typically a draft is issued to the business/client and discussed for
agreement or update.
• A final report is then issued to the client/business.
• There may be resulting actions which need to be dealt with as a
service request; problem; project or a Problem for further analysis.
• The CMDB (KEDB) is updated if there is one, or a suitable repository.
• If required, this may be fed into the Problem Management Process for
further analysis.
CrisisManagementFoundation
Solving the Toyota Production System
way (5 minutes)
CrisisManagementFoundation
Review
• Genchi Genbutsu
• Time is money
• The art of the workaround

More Related Content

What's hot

Advanced Control of Multiple Sulfur Units
Advanced Control of Multiple Sulfur Units Advanced Control of Multiple Sulfur Units
Advanced Control of Multiple Sulfur Units
Mary Claire Simoneaux
 
The perfect st orm presentation ej lister
The perfect st orm presentation ej listerThe perfect st orm presentation ej lister
The perfect st orm presentation ej listerEdmund (Ted) Lister
 
The key to improving your availability is fracas
The key to improving your availability is fracasThe key to improving your availability is fracas
The key to improving your availability is fracas
Jim Taylor, ASQ-CRE, CPE, CPMM
 
The Perfect Storm
The Perfect StormThe Perfect Storm
The Perfect Storm
EJ (Ted) Lister
 
Deep Dive into Disaster Recovery in the Cloud
Deep Dive into Disaster Recovery in the CloudDeep Dive into Disaster Recovery in the Cloud
Deep Dive into Disaster Recovery in the Cloud
Bluelock
 
Project Management - Introduction
Project Management - IntroductionProject Management - Introduction
Project Management - Introduction
SAINBAYAR Bayarsaikhan
 
Testing Your Own Emergency Plans
Testing Your Own Emergency PlansTesting Your Own Emergency Plans
Testing Your Own Emergency Plans
Gianmario Gnecchi
 
Testing Emergency Plans
Testing Emergency PlansTesting Emergency Plans
Testing Emergency Plans
Gianmario Gnecchi
 
Varrow Madness 2014 DR Presentation
Varrow Madness 2014 DR PresentationVarrow Madness 2014 DR Presentation
Varrow Madness 2014 DR Presentation
Andrew Miller
 

What's hot (9)

Advanced Control of Multiple Sulfur Units
Advanced Control of Multiple Sulfur Units Advanced Control of Multiple Sulfur Units
Advanced Control of Multiple Sulfur Units
 
The perfect st orm presentation ej lister
The perfect st orm presentation ej listerThe perfect st orm presentation ej lister
The perfect st orm presentation ej lister
 
The key to improving your availability is fracas
The key to improving your availability is fracasThe key to improving your availability is fracas
The key to improving your availability is fracas
 
The Perfect Storm
The Perfect StormThe Perfect Storm
The Perfect Storm
 
Deep Dive into Disaster Recovery in the Cloud
Deep Dive into Disaster Recovery in the CloudDeep Dive into Disaster Recovery in the Cloud
Deep Dive into Disaster Recovery in the Cloud
 
Project Management - Introduction
Project Management - IntroductionProject Management - Introduction
Project Management - Introduction
 
Testing Your Own Emergency Plans
Testing Your Own Emergency PlansTesting Your Own Emergency Plans
Testing Your Own Emergency Plans
 
Testing Emergency Plans
Testing Emergency PlansTesting Emergency Plans
Testing Emergency Plans
 
Varrow Madness 2014 DR Presentation
Varrow Madness 2014 DR PresentationVarrow Madness 2014 DR Presentation
Varrow Madness 2014 DR Presentation
 

Viewers also liked

HR summit 2013 - Role of HR in Crisis Management & Organizational Sustainability
HR summit 2013 - Role of HR in Crisis Management & Organizational SustainabilityHR summit 2013 - Role of HR in Crisis Management & Organizational Sustainability
HR summit 2013 - Role of HR in Crisis Management & Organizational Sustainability
Marc Ronez
 
Curso Crisis Management - 2011 - versão inglês
Curso Crisis Management - 2011 - versão inglêsCurso Crisis Management - 2011 - versão inglês
Curso Crisis Management - 2011 - versão inglês
Milton R. Almeida
 
Online Crisis Management + Case Studies - Elkottab Workshop
Online Crisis Management + Case Studies - Elkottab WorkshopOnline Crisis Management + Case Studies - Elkottab Workshop
Online Crisis Management + Case Studies - Elkottab Workshop
Ahmed Maher
 
Crisis Management in Social Media Marketing
Crisis Management in Social Media MarketingCrisis Management in Social Media Marketing
Crisis Management in Social Media Marketing
Nick Westergaard
 
DS Crisis Management Foundation Introduction
DS Crisis Management Foundation IntroductionDS Crisis Management Foundation Introduction
DS Crisis Management Foundation Introduction
DS
 
DS Crisis Management Foundation Risk
DS Crisis Management Foundation RiskDS Crisis Management Foundation Risk
DS Crisis Management Foundation Risk
DS
 

Viewers also liked (6)

HR summit 2013 - Role of HR in Crisis Management & Organizational Sustainability
HR summit 2013 - Role of HR in Crisis Management & Organizational SustainabilityHR summit 2013 - Role of HR in Crisis Management & Organizational Sustainability
HR summit 2013 - Role of HR in Crisis Management & Organizational Sustainability
 
Curso Crisis Management - 2011 - versão inglês
Curso Crisis Management - 2011 - versão inglêsCurso Crisis Management - 2011 - versão inglês
Curso Crisis Management - 2011 - versão inglês
 
Online Crisis Management + Case Studies - Elkottab Workshop
Online Crisis Management + Case Studies - Elkottab WorkshopOnline Crisis Management + Case Studies - Elkottab Workshop
Online Crisis Management + Case Studies - Elkottab Workshop
 
Crisis Management in Social Media Marketing
Crisis Management in Social Media MarketingCrisis Management in Social Media Marketing
Crisis Management in Social Media Marketing
 
DS Crisis Management Foundation Introduction
DS Crisis Management Foundation IntroductionDS Crisis Management Foundation Introduction
DS Crisis Management Foundation Introduction
 
DS Crisis Management Foundation Risk
DS Crisis Management Foundation RiskDS Crisis Management Foundation Risk
DS Crisis Management Foundation Risk
 

Similar to DS Crisis Management Foundation - Lifecycle

Service Operation Processes
Service Operation ProcessesService Operation Processes
Service Operation Processesnuwulang
 
ITIL-v3-Incident-Management-Process-PPT-RED.pdf
ITIL-v3-Incident-Management-Process-PPT-RED.pdfITIL-v3-Incident-Management-Process-PPT-RED.pdf
ITIL-v3-Incident-Management-Process-PPT-RED.pdf
ManishKumar526001
 
World-Class Incident Response Management
World-Class Incident Response ManagementWorld-Class Incident Response Management
World-Class Incident Response Management
Keith Smith
 
Disaster Recovery & Business Continuity Overview
Disaster Recovery & Business Continuity Overview Disaster Recovery & Business Continuity Overview
Disaster Recovery & Business Continuity Overview
Aventis Systems, Inc.
 
(Modifid)condition m0nitoring of longwall face supportd
(Modifid)condition m0nitoring of longwall face supportd(Modifid)condition m0nitoring of longwall face supportd
(Modifid)condition m0nitoring of longwall face supportd
rockyraj19
 
Smart Maintenance engineering
Smart Maintenance engineering Smart Maintenance engineering
Smart Maintenance engineering
Michel Mafumba
 
Maintenance management
Maintenance managementMaintenance management
Maintenance management
Prerna Toshniwal
 
Transactional Blackbelts are different
Transactional Blackbelts are differentTransactional Blackbelts are different
Transactional Blackbelts are differentreachab7
 
Designing High Available Cloud Applications
Designing High Available Cloud ApplicationsDesigning High Available Cloud Applications
Designing High Available Cloud Applications
Giovanni Mazzeo
 
Ultan kinahan dr - minasi 2010
Ultan kinahan   dr - minasi 2010Ultan kinahan   dr - minasi 2010
Ultan kinahan dr - minasi 2010
Nathan Winters
 
Systeme de contrôle - Vos opérateurs sont trop occupés pour s'occuper de ce q...
Systeme de contrôle - Vos opérateurs sont trop occupés pour s'occuper de ce q...Systeme de contrôle - Vos opérateurs sont trop occupés pour s'occuper de ce q...
Systeme de contrôle - Vos opérateurs sont trop occupés pour s'occuper de ce q...
Laurentide Controls
 
Incident response
Incident responseIncident response
Incident response
Anshul Gupta
 
Maintenance management- Production Management
Maintenance management- Production ManagementMaintenance management- Production Management
Maintenance management- Production Management
shrinivas kulkarni
 
Problem management foundation - Mission control
Problem management foundation - Mission controlProblem management foundation - Mission control
Problem management foundation - Mission control
Ronald Bartels
 
Apdip disaster mgmt
Apdip disaster mgmtApdip disaster mgmt
Apdip disaster mgmt
srinivasan gopalan
 
M6 BLACKBELT PROJECT Rev 6.4
M6 BLACKBELT PROJECT Rev 6.4M6 BLACKBELT PROJECT Rev 6.4
M6 BLACKBELT PROJECT Rev 6.4Neelesh Bhagwat
 
Disaster recovery solution
Disaster recovery solutionDisaster recovery solution
Disaster recovery solution
Anton An
 
FIVE MAINTENANCE TYPES PROCEDURES
FIVE MAINTENANCE TYPES PROCEDURESFIVE MAINTENANCE TYPES PROCEDURES
FIVE MAINTENANCE TYPES PROCEDURES
kifayat ullah
 
Process Mining and Predictive Process Monitoring
Process Mining and Predictive Process MonitoringProcess Mining and Predictive Process Monitoring
Process Mining and Predictive Process Monitoring
Marlon Dumas
 

Similar to DS Crisis Management Foundation - Lifecycle (20)

Service Operation Processes
Service Operation ProcessesService Operation Processes
Service Operation Processes
 
ITIL-v3-Incident-Management-Process-PPT-RED.pdf
ITIL-v3-Incident-Management-Process-PPT-RED.pdfITIL-v3-Incident-Management-Process-PPT-RED.pdf
ITIL-v3-Incident-Management-Process-PPT-RED.pdf
 
World-Class Incident Response Management
World-Class Incident Response ManagementWorld-Class Incident Response Management
World-Class Incident Response Management
 
Disaster Recovery & Business Continuity Overview
Disaster Recovery & Business Continuity Overview Disaster Recovery & Business Continuity Overview
Disaster Recovery & Business Continuity Overview
 
(Modifid)condition m0nitoring of longwall face supportd
(Modifid)condition m0nitoring of longwall face supportd(Modifid)condition m0nitoring of longwall face supportd
(Modifid)condition m0nitoring of longwall face supportd
 
Smart Maintenance engineering
Smart Maintenance engineering Smart Maintenance engineering
Smart Maintenance engineering
 
Maintenance management
Maintenance managementMaintenance management
Maintenance management
 
Transactional Blackbelts are different
Transactional Blackbelts are differentTransactional Blackbelts are different
Transactional Blackbelts are different
 
Designing High Available Cloud Applications
Designing High Available Cloud ApplicationsDesigning High Available Cloud Applications
Designing High Available Cloud Applications
 
Ultan kinahan dr - minasi 2010
Ultan kinahan   dr - minasi 2010Ultan kinahan   dr - minasi 2010
Ultan kinahan dr - minasi 2010
 
Systeme de contrôle - Vos opérateurs sont trop occupés pour s'occuper de ce q...
Systeme de contrôle - Vos opérateurs sont trop occupés pour s'occuper de ce q...Systeme de contrôle - Vos opérateurs sont trop occupés pour s'occuper de ce q...
Systeme de contrôle - Vos opérateurs sont trop occupés pour s'occuper de ce q...
 
Incident response
Incident responseIncident response
Incident response
 
Disaster Recovery
Disaster RecoveryDisaster Recovery
Disaster Recovery
 
Maintenance management- Production Management
Maintenance management- Production ManagementMaintenance management- Production Management
Maintenance management- Production Management
 
Problem management foundation - Mission control
Problem management foundation - Mission controlProblem management foundation - Mission control
Problem management foundation - Mission control
 
Apdip disaster mgmt
Apdip disaster mgmtApdip disaster mgmt
Apdip disaster mgmt
 
M6 BLACKBELT PROJECT Rev 6.4
M6 BLACKBELT PROJECT Rev 6.4M6 BLACKBELT PROJECT Rev 6.4
M6 BLACKBELT PROJECT Rev 6.4
 
Disaster recovery solution
Disaster recovery solutionDisaster recovery solution
Disaster recovery solution
 
FIVE MAINTENANCE TYPES PROCEDURES
FIVE MAINTENANCE TYPES PROCEDURESFIVE MAINTENANCE TYPES PROCEDURES
FIVE MAINTENANCE TYPES PROCEDURES
 
Process Mining and Predictive Process Monitoring
Process Mining and Predictive Process MonitoringProcess Mining and Predictive Process Monitoring
Process Mining and Predictive Process Monitoring
 

Recently uploaded

DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 

Recently uploaded (20)

DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 

DS Crisis Management Foundation - Lifecycle

  • 1. CrisisManagementFoundation Crisis Management Foundation ds.co.za dealing with incidents that have a severe negative business consequence
  • 2. Based on the process from ITIL v3 6. major incident lifecycle
  • 3. CrisisManagementFoundation Objectives • The Major incident process • The importance of Time • Detection • Diagnosis • Checklists • Workarounds
  • 4. "I chose to race, so I chose to win."
  • 7. CrisisManagementFoundation Time Time is money…. Lessons from history of how time (and a clock) changed the modern world.
  • 8. CrisisManagementFoundation Recording of time • When working with problems time is the most crucial attribute to record. • The time an event happens, the time between events provide the most significant clues into a problem’s source. • As an example, it is important to known when the event occurred as opposed to when it was detected. The two might not necessarily have occurred at the same time and that in itself could be a problem.
  • 9. CrisisManagementFoundation Why record time? An analysis of times may assist in clarifying the following: • When was the business impacted by major incidents? • Is it at recognised stages like month-end? • Is the return to service being prioritised? • Are we detecting incidents quickly? • Are the systems being suitably managed or monitored? • Are the incidents correctly diagnosed? • Is this diagnosis performed within expected time parameters? • Are investigators and technicians suitably trained?
  • 10. CrisisManagementFoundation Why record time cont. • Are repair processes initiated within suitable time limits after diagnosis? • Is there a logistics issue? • Are service restore times for the client adequate? • Is there an issue around continuity or outdated technology? • Does the system start processing in an acceptable time period after being restored? • Are there cumbersome system interface issues?
  • 11. CrisisManagementFoundation Timelines (date and times) the expanded incident lifecycle Time when incident started (actual – something has happened to a CI or a risk event has occurred) <dd/mm/yy> <hh:mm> Time when incident was detected (incident is detected either by monitoring tools, IT personnel or, worse case, the user/customer) <dd/mm/yy> <hh:mm> Time of diagnosis (underlying cause – we know what happened?) <dd/mm/yy> <hh:mm> Time of repair (process to fix failure started or corrective action initiated) <dd/mm/yy> <hh:mm> Time of recovery (component recovered – the CI is back in production – business ready to be resumed) <dd/mm/yy> <hh:mm> Time of restoration (normal operations resume – the service is back in production) <dd/mm/yy> <hh:mm> Time of workaround (Service is back in production with workaround) <dd/mm/yy> <hh:mm> Time of escalation (to problem management team) <dd/mm/yy> <hh:mm> Time period service was unavailable (SLA measure) <minutes> Time period service was degraded (SLA measure) <minutes>
  • 12. CrisisManagementFoundation Measuring time How do you improve? Understand the different time periods from outage to full resolution and which ones are not optimal. • Detection time - between when outage occurred and when it was known (does the monitoring tool work?) (Do you detect HD RAID failures?) (Do you detect redundant network path failures?) • Diagnostic time – working out what went wrong. How good are your troubleshooting skills. Have you identified the correct causes? • Ready to repair – being able to gather all required resources to fix what is broken. (Are the parts available?)
  • 13. CrisisManagementFoundation Measuring time cont. • Recovery time – the failed components have been fixed and are ready to be placed back in production. • Restoration time – the system is back in production. • Notification times – clients and users of the system are informed e.g. do they know they can transact? • Risk profile completion time – time to gather and analyse risk associated with incident. • Counter measures implementation – time that relevant counter measures are implement to reduce identified threats.
  • 14. CrisisManagementFoundation Representing time • Understand where the problem is by using graphs. • Useful to aggregate these statistics over multiple Major incidents to understand trends • Extrapolate statistics that will define and set appropriate SLA times
  • 15. CrisisManagementFoundation Creating Metrics Occurrence Diagnosis Mean time between system incidents Mean time between failures Mean time to restore service Mean time to repair Detection Repair Restoration Closure Detection Recovery
  • 16. CrisisManagementFoundation Measurements • The typical values of the above is expressed as 9s (from two 9s to five 9s). Here is an example: • 99% availability: 5,256 minutes (87.6 hours) / year downtime • 99.5% availability: 2,628 minutes (43.8 hours) / year downtime • 99.9% availability: 528 minutes (8.8 hours) / year downtime • 99.99% availability: 53 minutes / year downtime • 99.999% availability: 5 minutes / year downtime • The above values are mapped to the following terms by Gartner: • Normal system availability is 99.5% • High system availability is 99.9% • Fault system resilience is 99.99% • Fault tolerance is 99.999% • Continuous processing is as close to 100% as possible.
  • 18. CrisisManagementFoundation Detection • When a disaster has occurred, it is important to record the events – numerous mechanisms are possible dependant on the outage. • It is possible to use video surveillance or even Smartphone cameras to take pictures of what has occurred. • This might help as a later diagnosis and root causation could be expedited by a review of the material. • A source of detection are also logs, typically SYSLOG or the logs from applications such as web servers (use ELK to create a mission control dashboard! • Tools like NETFLOW can assist in providing the precise time of outages and also be a primary tool for root causation. • Often it will assist to have screen scraping or enforce logging of access (such as log files when using SSH access and putty). • A disproportionate number of incidents being logged at the Service Desk are a potential indicator for a major incident.
  • 19. CrisisManagementFoundation Tools and Retrofit • When an outage happens it is not possible to retrofit a detection tool. • Surveillance of IT needs to be in place. • Gathering of SNMP metrics can provide a guideline for usage and congestion. • ICMP provides a means of detecting failures and degradation (latency). • Great poller for ICMP and SNMP is Opmantek’s NMIS. • Reference the section on tools in this course.
  • 20. CrisisManagementFoundation IS / IS NOT detection tool Description IS (Observation): IS NOT (Observation): What is the defect? Which processes are impacted? Where in the processes has the failure occurred ? Who is affected? When did it happen? How frequently did it happen? Is there a pattern? How much is it costing?
  • 21. CrisisManagementFoundation Alternative means • Detection from the Service Desk - display call centre queues from Service Desk to detect increased call volumes which can be an indication of problems. • Use social media such as tweetdeck to view notifications from own company clients; utilities such as power and water; local news or traffic.
  • 23. CrisisManagementFoundation Diagnose • One of the primary triggers for an outage is a change in the environment. • The first step in should be to determine if there has been a change. • The importance of recording precise times in the major incident lifecycle is now highlighted as this is used to correlate the outage to when the last known change was made. • Unauthorised changes also need to be investigated by reviewing anomalies, preferably in dashboards. • A key part of diagnosis is referring to the system documentation to see what should have happened. • Put eyes on the problem as soon as possible. • As part of the diagnosis process, it’s important to refer to previous major incident reports to assess whether the issue has occurred previously and whether the same actions can be followed to solve the issue.
  • 26. CrisisManagementFoundation The predecessor of the Flying Fortress: the birth of the checklist The Air Corps faced arguments that the aircraft was too big to handle. The Air Corps, however, properly recognised that the limiting factor here was human memory, not the aircraft’s size or complexity. To avoid another accident, Air Corps personnel developed checklists the crew would follow for take-off, flight, before landing, and after landing. The idea was so simple, and so effective, that the checklist was to become the future norm for aircraft operations. The basic concept had already been around for decades, and was in scattered use in aviation worldwide, but it took the Model 299 crash to institutionalize its use. “The Checklist,” Air Force Magazine
  • 27. CrisisManagementFoundation Checklists • Execute checklist to diagnose failures and outages. • Checklist can evolve to include items from lessons learnt. • The most common and often diagnosed checks should be prioritized and executed first. • Mechanism to transfer skill and knowledge (checklist should reflect the knowledge base). • Ability to improve time for diagnosis. • Examples of areas for checklists includes networks, data centres and information security. • Refer to the Appendix for a Network Troubleshooting checklist. the original checklist
  • 28. CrisisManagementFoundation Atul Gawande: How to Make Doctors Better Surgeon and author Atul Gawande says the very vastness of our knowledge gets in the way: doctors make errors because they simply can't remember it all. The solution isn't fancier technology or more training. It's as simple as an old-fashioned checklist, like those used by pilots, restaurateurs and construction engineers. When his research team introduced a checklist in eight hospitals in 2008, major surgery complications dropped 36% and deaths plunged 47%. from Time magazine
  • 29. The New England Journal of Medicine supports the use of checklists during a surgical emergency for better safety performance results. In a study of 100Michigan hospitals 30% of the time, surgical teams skipped one of these five essential steps: • washing hands • cleaning the site • draping the patient • applying a sterile dressing • donning surgical mask, gloves and gown But after 15 months of using a simple checklist, the hospitals cut their infection rate from 4 percent of cases to zero, saving 1,500 lives and nearly $200 million
  • 30. CrisisManagementFoundation Put eyes on the problem • The process followed to solve a murder is no different to the process followed when solving a crisis. • The location where the problem has occurred needs to be investigated. • It is preferable to secure the area and gather all evidence and log it, just like a crime scene. • This principle is also used in production and manufacturing environments.
  • 31. CrisisManagementFoundation Crime scene (location of problem) Taiichi Ohno, who refined the production systems at (TPS) Toyota Production System, would take new managers and engineers to the factory and draw a chalk circle on the floor. The subordinate would be told to stand in the circle and to observe and note down what he saw. When Ohno returned he would check - if the person in the circle had not seen enough he would be asked to keep observing. Ohno was trying to imprint upon his future managers and engineers that the only way to truly understand what happens in the factory was to go there. It was here that value was added and here that waste could be observed. This was known as Genchi Genbutsu and is a primary method used for solving problems. If the problem exists in the factory then it needs to be understood and solved in the factory and not on the top floors of some office block or city skyscraper.
  • 32. CrisisManagementFoundation Genchi Genbutsu 現地現物 – go see • Genchi Genbutsu sets out the expectation that it is a requirement to personally evaluate operations so that a first-hand understanding of situations and problems is derived. • Genchi Genbutsu means "go and see" and it is a key principle of the Toyota Production System. It suggests that in order to truly understand a situation one needs to go to gemba (現場) or, the 'real place' - where work is done.
  • 33. CrisisManagementFoundation Recording the event • An investigator will record the observations of eye witnesses. • These records serve as a basis for review. • What seems insignificant now, might be crucial when more becomes known about the problem. • Determine: • What • Why • When • Who • Where • How
  • 34. CrisisManagementFoundation Prevailing conditions and business impact • Take note of the prevailing conditions. • It is also important to take a snapshot of the prevailing conditions at the time of the problem. If the problem remains unresolved and it happens again, a comparison of prevailing conditions might provide significant insight. • These might be economic or even weather related. Don’t discount prevailing conditions. • If it is a technical problem it is important to determine and measure the business impact. • This needs to be assessed from a client and an internal organisational perspective. • When the probability of an occurrence is low, it is incorrect to assume that it will only happen way into the future. • Major incidents can happen anytime within the probability period and not at the end of the probability period.
  • 35. CrisisManagementFoundation Prevailing conditions On the morning of Monday, 29th August 2005 hurricane Katrina hit the Gulf coast of the US. New Orleans, Louisiana suffered the main brunt of the hurricane but the major damage and loss of life occurred when the levee system catastrophically failed. Floodwaters surged into 80% of the city and lingered for weeks. At least 1,836 people lost their lives in the hurricane and resulting floods, making it the largest natural disaster in the history of the United States. Video or better pic.
  • 36. CrisisManagementFoundation Prevailing conditions On July 31, 2006 the Independent Levee Investigation Team released a report on the Greater New Orleans area levee failures. In the report, it was noted that the hypothetical model storm upon which storm protection plans were based, (called the Standard Project Hurricane or SPH) model was simplistic. The report found that an inadequate network of levees, flood walls, storm gates and pumps were established. The report also found that “the creators of the standard project hurricane, in an attempt to find a representative storm, actually excluded the fiercest storms from the database.” Quote source
  • 37. CrisisManagementFoundation Visualization • It is one thing collecting data of a problem and recording it, but a totally different skill is required to interpret it. • Here you look at visual representations by graphing the data in an appropriate fashion. As an example, bar graphs are often referred to as Manhattan graphs. • Just as with the Manhattan skyline where the large buildings are prominent, so too is those significant bits of data that is represented in a graph. • Convert the data to a visual representation and this will aid in the process of solving the problems. • The visualisation present in the CMOC should always be designed to assist in diagnosis.
  • 38. uptime is about reducing downtime repair, recover, restore 99.999%
  • 39. CrisisManagementFoundation Workarounds (aka fire fighting) Something that is important especially when the crisis is significant is to realise that you need to be skilled in fighting fires. Meaning, the problem might require an immediate workaround to maintain service. As such you might not be solving the problem but on a temporary basis alleviating any further negative consequences.
  • 41. CrisisManagementFoundation Repair Following diagnosis are the activities associated with repairing the configuration item (CI) that failed. Hardware may need to be ordered, vendors contacted, consultants brought in, and so forth. The biggest gap here is understanding how a given CI was configured. Groups with accurate configuration management systems (CMS) know right away whereas others will need to perform forensic archaeology to try and determine that; losing valuable time in the process.
  • 42. CrisisManagementFoundation Recover Once the CI is repaired, it must be brought back online including reloading any necessary images, applications and/or data. Again, rapid accurate knowledge about CIs will speed this up as will having standard builds/images to restore from versus building a unique system from scratch.
  • 43. CrisisManagementFoundation Restore This is the final step and is known as the restoration of the service. It may be that related CIs must be rebooted in a certain order to re-establish connectivity, and so on. Service design documentation and/or standard operating procedures that are readily accessible and accurate will aid groups restoring services.
  • 44. CrisisManagementFoundation Collation • There is a requirement to collate the information from each of the steps in the Major Incident lifecycle. • This information is utilised as the basis of the Major Incident Report. • This collation involves all members of the Tiger Team and is typically managed and owned by the SLM/SDM or Process Owner. • This is generally under a time constraint dictated by a service level agreement. • The collated report is always issued in draft first and reviewed by all internal parties.
  • 45. CrisisManagementFoundation Major Incident reporting • Generate the Major Incident report. • Contain a detailed description of the outage/failure; timing; sequencing; the actions taken; the people involved; resources; next steps and identified/remaining actions. • Typically a draft is issued to the business/client and discussed for agreement or update. • A final report is then issued to the client/business. • There may be resulting actions which need to be dealt with as a service request; problem; project or a Problem for further analysis. • The CMDB (KEDB) is updated if there is one, or a suitable repository. • If required, this may be fed into the Problem Management Process for further analysis.
  • 46. CrisisManagementFoundation Solving the Toyota Production System way (5 minutes)
  • 47. CrisisManagementFoundation Review • Genchi Genbutsu • Time is money • The art of the workaround

Editor's Notes

  1. CM101 – Crisis Management Foundations Refer ITWeb article: https://lnkd.in/ehckK3T
  2. The Major Incident lifecycle
  3. Objectives
  4. Eddy Merckx, born 17 June 1945, is a Belgian considered to be the greatest pro-cyclist ever. He sells his own line of bicycles and I have owned one since 1997. He is one of my heroes and his never-equaled domination while cycling led to his nickname, when the daughter of one French racer said, "That Belgian guy, he doesn't even leave you the crumbs. He's a real cannibal." The French magazine Vélo described Merckx as "the most accomplished rider that cycling has ever known." Merckx, who turned professional in 1965, won the World Championship thrice, the Tour de France and Giro d'Italia five times each, and the Vuelta a España once. He also won each of the professional cycling's classic "monument" races at least twice. Merckx dominated his first Tour de France winning by 17 minutes, 54 seconds. But it was Stage 17 that was most emblematic. Though comfortably in the yellow jersey, victory assured if he merely followed his rivals as modern champions do, Merckx risked blowing up and losing the Tour when he attacked over the top of the Tourmalet then rode solo for 130 kilometres. He won the stage by nearly eight minutes. Merckx set the world hour record on 25th October 1972. Merckx covered 49.431 km at high altitude in Mexico City using a Colnago bicycle to break the record, which had been lightened to a weight of 5.75 kg. Over 15 years starting in 1984, various racers improved the record to more than 56 km. However, because of the increasingly exotic design of the bikes and position of the rider, these performances were no longer reasonably comparable to Merckx's achievement. In response, the UCI in 2000 required a standard or more traditional bike to be used. When time trial specialist Chris Boardman, who had retired from road racing and had prepared himself specifically for beating the record, had another go at Merckx's distance 28 years later, he beat it by slightly more than 10 meters (at sea level). To date, only Boardman and Ondřej Sosenka have improved on Merckx's record using traditional equipment. Although Merckx's great moments were alone, he had those leadership qualities of when it countered he was motivated to win. He didn't just win, he did the best he could, which exceeded expectations like in that first Tour de France win. He was also like Amundsen (read about him here) in that he was an expert in the use of his equipment which was highlighted when he set the benchmark for the world one hour. My Merckx bike is held in such high regard that I have it in my bedroom to prevent it being stolen! In the major incident process, timelines are the most important aspect of the process to get right. The reason is that it is the best source of data for problem management, which oversees the process from a quality viewpoint. Deviations from the norm are clear indicators of underlying issues. The timelines in the major incident process are aligned with the ITIL process as these timelines in ITIL are referred to as the Expanded Incident Lifecycle. The Expanded Incident Lifecycle has a path of Incident -&gt; Detect -&gt; Diagnose -&gt; Repair -&gt; Restore -&gt; Recover. The times of each of these events should be diligently recorded as well as the time of when a workaround becomes available and is implemented. For many IT people the times are confusing as they misunderstand the naming of the terms in the Expanded Incident Life cycle. To better explain these terms, we'll use an analogy, of riding a bike. I am riding my bike. It is a nice Sunday morning ride in the country side. The Incident happens, the rear wheel experiences a puncture. This is the time of the Incident. As it is the rear wheel I do not notice it immediately, and only detect the incident when the road starts to feel extremely bumpy. This is the detection time. I stop my bicycle and dismount. My mates with me also do the same. We discuss the issue. It is clear that it is a puncture and it was caused by a small nail which is clearly visible. We can remove the nail, and the tire will still be usable but we need to either repair the tube or replace it. I have a spare tube in my saddle bag, and we agree that replacing the tube is the quickest and best way to continue on our journey. This is the time of diagnosis. We decide that this is a good time to have some water and cool drink before we start replacing the tube. We also notice that the incident has happened at a very scenic location so we take a few pictures. Finally, we start removing the wheel. This is the time of repair. We remove the wheel, remove the tire, replace the tube and reattach the tire. We put the wheel back on the bike. This is the time of restore. At this point we all decide to answer the call of nature. We then mount our bikes and continue our ride. This is the point and time of recovery. If we analyse the time lines, in the incident above, we will notice a deviation from the norm in two time periods, i.e. time to repair and time to recover. This is the time where we had some drinks and took a pit stop. In the context of our ride this wasn't a big deal, but if we were in a competitive race we in all probability would have skipped those actions. In a actual IT incident the same principals are applied.
  5. Diagram of the Major incident process The notifications and escalations including the interaction with the service desk and clients is handled in the communications chapter.
  6. The best example of how time solved a problem is illustrated by that of Harrison, a carpenter. Time solved the problem of determining longitude and hence your exact position on Earth. Longitude a geographic coordinate that specifies the east-west position of a point on the Earth's surface and is best determined using time measurements. Galileo Galilei proposed that with accurate knowledge of the orbits of the moons of Jupiter one could use their positions as a universal clock to determine of longitude, but this was practically difficult especially at sea. An English clockmaker, John Harrison, invented the marine chronometer, helping solve the problem of accurately establishing longitude at sea, thus revolutionising safe long distance travel. Harrison’s watches were rediscovered after the First World War, restored and given the designations H1 to H5 by Rupert T. Gould. Harrison completed the manufacturing of H4 in 1759.
  7. When working with problems time is the most crucial attribute to record. The time an event happens, the time between events provide the most significant clues into a problems source. As an example, it is important to known when the event occurred as opposed to be it was detected. The two might not necessarily have occurred at the same time and could in itself be a problem.
  8. An analysis of these times will assist in clarifying some of the following potential issues: When is the business impacted by major incidents? Is it at recognised stages like month-end? Is the return to service being prioritised? Are we detecting incidents quickly? Are the systems being suitably managed or monitored? Are the incidents correctly diagnosed? Is this diagnoses performed within expected time parameters. Are technicians suitably trained? Are repair processes initiated within suitable time limits after diagnosis? Is there a logistics issue? Are restore times adequate? Is there an issue around continuity or dated technology? Does the system start processing and become functional in a useful manner to the business in an acceptable time period after being restored? Are there cumbersome interface issues?
  9. An analysis of these times will assist in clarifying some of the following potential issues: When is the business impacted by major incidents? Is it at recognised stages like month-end? Is the return to service being prioritised? Are we detecting incidents quickly? Are the systems being suitably managed or monitored? Are the incidents correctly diagnosed? Is this diagnoses performed within expected time parameters. Are technicians suitably trained? Are repair processes initiated within suitable time limits after diagnosis? Is there a logistics issue? Are restore times adequate? Is there an issue around continuity or dated technology? Does the system start processing and become functional in a useful manner to the business in an acceptable time period after being restored? Are there cumbersome interface issues?
  10. Timelines
  11. How do you improve. Understand the what makes up the time periods from outage to full resolution. Which of those were less than optimal? Detection – time between when outage occurred and when it was known (does the monitoring tool work?) (Do you detect HD RAID failures?) (Do you detect redundant network path failures?) Diagnostic time – working out what went wrong. How good are your troubleshooting skills. Have you identified the correct causes? Ready to repair – being able to gather all required resources to fix what is broken. (Are the parts available?) Recovered – the failed components have been fixed and are ready to be placed back in production Restoration time – the system is back in production and cooking on gas Notification times – customers and users of the system are informed (Do they know they cab transact?) Risk profile completion time – time to gather and analysis risk associated with incident Counter measures implementation – time that relevant counter measures are implement to reduce identified threats
  12. How do you improve. Understand the what makes up the time periods from outage to full resolution. Which of those were less than optimal? Detection – time between when outage occurred and when it was known (does the monitoring tool work?) (Do you detect HD RAID failures?) (Do you detect redundant network path failures?) Diagnostic time – working out what went wrong. How good are your troubleshooting skills. Have you identified the correct causes? Ready to repair – being able to gather all required resources to fix what is broken. (Are the parts available?) Recovered – the failed components have been fixed and are ready to be placed back in production Restoration time – the system is back in production and cooking on gas Notification times – customers and users of the system are informed (Do they know they cab transact?) Risk profile completion time – time to gather and analysis risk associated with incident Counter measures implementation – time that relevant counter measures are implement to reduce identified threats
  13. Using time to become effective and efficient
  14. Metrics
  15. Measurements
  16. Detection
  17. When disaster has occurred it is important to record the events – numerous mechanisms are possible dependant on the outage It is possible to use video surveillance or even Smartphone cameras to take pictures of what has occurred This might help as a later diagnosis and root causation could be expedited by a latter review of the material A source of detection are also logs, typically SYSLOG or the logs from applications such as web servers (use ELK to create a mission control dashboard!) Use of NETFLOW can assist in providing the precise time of outages and also be a primary tool for root causation Often it will assist to have screen scraping or enforce logging of access (such as log files when using SSH access and putty) A disproportionate number of incidents being logged at the Service Desk and a potential indicator for a major incident (but the question should be asked as to why another more automated tool hasn’t detected the problem Refer Netflow - https://en.wikipedia.org/wiki/NetFlow
  18. Tools and retrofit
  19. “IS – IS NOT” is an example of a tool that facilitates the detection of which components are involved in an outage. This technique eliminates the potential of components being identified falsely. At the end of the exercise, the components involved are confirmed which will allow diagnosis to continue.
  20. Tweetdeck – refer https://tweetdeck.twitter.com/
  21. Diagnosis
  22. Diagnose
  23. Reference: https://lnkd.in/efjZqhr
  24. The predecessor of the Flying Fortress The birth of the checklist Still, the Air Corps faced arguments that the aircraft was too big to handle. The Air Corps, however, properly recognized that the limiting factor here was human memory, not the aircraft’s size or complexity. To avoid another accident, Air Corps personnel developed checklists the crew would follow for take-off, flight, before landing, and after landing. The idea was so simple, and so effective, that the checklist was to become the future norm for aircraft operations. The basic concept had already been around for decades, and was in scattered use in aviation worldwide, but it took the Model 299 crash to institutionalize its use. “The Checklist,” Air Force Magazine
  25. In crisis management, especially during a major incident, the team that is responsible for identifying a potential repair is known as the delta team. The delta team is a tiger team (read more about them here). The team is specifically responsible for diagnoses which is the process that delivers on the potential repair. In this article we will be referring to Information technology (IT) major incidents but many of the concepts are generic to all types of crisis management. Now the team never has a live cat thrown over the wall but a dead one! The team often has to start from a clean slate in diagnosis. The first actions around diagnosis is to usually perform various checklists dependant on what type of dead cat has been thrown. In an optimized process the dead cat would have a note attached. In the context of a major incident, a preliminary checklist would have been completed and the note would be the results of that checklist. Checklists can take various forms and are used to compensate for the weaknesses of human memory to help ensure consistency and completeness in carrying out a task. Checklists came into prominence with pilots with the pilot's checklist first being used and developed in 1934 when a serious accident hampered the adoption into the armed forces of a new aircraft (the predecessor to the famous Flying Fortress). The pilots sat down and put their heads together. What was needed was some way of making sure that everything was done; that nothing was overlooked. What resulted was a pilot's checklist. Actually, four checklists were developed - take-off, flight, before landing, and after landing. The new aircraft was not "too much aeroplane for one man to fly", it was simply too complex for any one man's memory. These checklists for the pilot and co-pilot made sure that nothing was forgotten. Additionally, the plane had two pilots to ensure continuity of operations should there be a problem with one of the pilots. During operations, especially IT ones, it is important to document and record dependencies. Often these are too many for a single individual to remember and thus lists capture those critical requirements that would otherwise have slipped through the cracks.
  26. The concept of using checklists in medicine is explained by Dr Atul Gawande in this youtube video of his presentation to TED here. Although the talk focusses on medicine, it also has great relevance to IT! Strangely enough, there is no suitable checklists app available in any app store, especially for IT. Well, often the easiest repair for a dead cat, if it is really dead is to buy a new cat. But is the cat really dead? Take for example, a remote branch. If a systems, outage is reported at the branch, and a similar outage does not exist at other locations, two obvious scenarios are that the link to the branch is non-operational or that systems used for access in the remote branch aren’t functioning. If we focus for a moment on the latter, it would obviously be difficult to make a determination without an out of band mechanism. As an example, in South Africa we have a disproportionate amount of load shedding present due to the lack and oversight of grid maintenance by the electrical utility. Normal systems, such as network management systems use the same infrastructure, which is now not functioning, to determine the status. This is known as in band. Clearly this type of diagnoses is irrelevant. What is required is an out of band system. An out of band system would require a monitoring board with its own separate battery backup pack that uses a 3rd party network connection such as a mobile network to poll and sample the state of operations at the branch. This monitoring board would sample the power status and immediately the delta team would now be able to assess whether they are dealing with a power outage or potential hardware fault. Multiple power probes can determine whether it is utility related or if the cleaner has unplugged the network equipment to power up he vacuum cleaner for cleaning. The monitoring board is also a potential Swiss army knife of diagnosis. Wireless asset probes can determine whether the network switch and router has been stolen. A location device on the monitoring board itself can determine if it has been moved or is a target of theft itself. Additional probes for water floods and overheating can also be added as examples. Obviously when these initial checklist have been completed and further diagnosis is required the next important step is to put eyes on the problem. Delaying this and attempting to continue endless remote diagnoses is not productive. From personal experience, I was once dealing with intermittent outages at a remote site. The network management system and metrics were analysed till I was blue in the face. This continued for two weeks. Finally, I climbed on an aircraft and went to visit the location which was a Toyota car manufacturing plant. At the plant we went to the paint shop, where the network equipment had the symptoms of intermittent faults. The network equipment was at the top of the building near the roof and we had to climb the access gangways to the top. Once there we immediately realized what the problem was when we laid eyes on the equipment. Pigeons were roosting above the network equipment rack and over the course of a few years the pigeon poo had started to cake on the equipment. Well, poo is acidic and it started eating into the casing of the equipment and eventually it went through the casing and was now starting on the PCB boards. No amount of remote diagnosis would have solved the pigeon poo problem! Hardware failures are an obvious issue as it results in a blackout error. More difficult to diagnose is the brownout. This is a degradation in service and not a total outage. In this case, in band tools that provide insight into customer experience. Often poor customer experience is as a result of their own data pollution. It could be that malware has entered the computer system of a customer, generating excessive spam email traffic which saps the network link. Or, peer to peer file exchange may be occurring in violation of copyright laws at the same time as absorbing great network capacity. A group of customers might be viewing videos in HD. These sorts of problems can make customers think something is wrong with their network service, when in reality, the service is working fine and provides plenty of bandwidth for proper and legitimate usages. The delta team gain access to special flow analysis software and systems available for their networking equipment that provide excellent insight into the exact real-time sources of loads on the network links under investigation. Typically as a network operator a team will have access to ITU-T Y.1564 metrics. These metrics will provide insight into actual customer bandwidth (usage), latency (response), jitter (variance), loss (congestion), Service Level Agreement (SLA) compliance and availability. These are typically available as attributes of a Carrier Ethernet link and provides accelerated insight into whether and issue is customer related or network operator related. Although more will be written about diagnosis in the major incident process another large source of investigation that can assist in finding a repair is an analysis of recent changes. Additional a repository of the latest changes is also beneficial for providing the romeo team with a working configuration of a system. This will be clarified in greater detail in a future article. Checklists are an important and often overlooked tool. Tom Peters has this to say about checklists: Process & Simplicity: Checklists!! Complexifiers often rule—in part the by-product of far too many “consultants” in the world, determined to demonstrate the fact that their IQs are higher than yours or mine. Enter Johns Hopkins’ Dr Peter Pronovost. Dr P was appalled by the fact that 50% of folks in ICUs (90,000 at any point—in the U.S. alone) develop serious complications as a result of their stay in the ICU, per se. He also discovered that there were 179 steps, on average, required to sustain an ICU patient every day. His answer: Dr P “invented” the … ta-da … checklist! With the religious use of simple paper lists, prevalent ICU “line infection” errors at Hopkins dropped from 11% to zero—and stay-length was halved. (Results have been consistently replicated, from the likes of Hopkins to inner-city ERs.) “[Dr Pronovost] is focused on work that is not normally considered a significant contribution in academic medicine,” Dr Atul Gawande, wrote in “The Checklist” (New Yorker, 1210.07). “As a result, few others are venturing to extend his achievements. Yet his work has already saved more lives than that of any laboratory scientist in the last decade.”
  27. Infographic about checklists
  28. Crime scene
  29. Taiichi Ohno, who refined the production systems at (TPS) Toyota Production System, would take new managers and engineers to the factory and drawing a chalk circle on the floor. The subordinate would be told to stand in the circle and to observe and note down what he saw. When Ohno returned he would check; if the person in the circle had not seen enough he would be asked to keep observing. Ohno was trying to imprint upon his future managers and engineers that the only way to truly understand what happens in the factory was to go there. It was here that value was added and here that waste could be observed. This was known as Genchi Genbutsu and is a primary method to start solving problems. If the problem exists in the factory then it needs to be understood and solved in the factory and not on the top floors of some office block or city skyscraper.
  30. Genchi Genbutsu sets out the expectation that it is a requirement to personally evaluate operations so that a firsthand understanding of situations and problems is derived. Genchi Genbutsu means "go and see" and it is a key principle of the Toyota Production System. It suggests that in order to truly understand a situation one needs to go to gemba (現場) or, the 'real place' - where work is done.
  31. Recording the account of what happened
  32. Prevailing conditions and business impact
  33. On the morning of Monday, 29th August 2005 hurricane Katrina hit the Gulf coast of the US. New Orleans, Louisiana suffered the main brunt of the hurricane but the major damage and loss of life occurred when the levee system catastrophically failed. Floodwaters surged into 80% of the city and lingered for weeks. At least 1,836 people lost their lives in the hurricane and resulting floods making it the largest natural disaster in the history of the United States.
  34. On July 31, 2006 the Independent Levee Investigation Team released a report on the Greater New Orleans area levee failures. Their report “identified flaws in design, construction and maintenance of the levees. But underlying it all, the report stated, were the problems with the initial model used to determine how strong the system should be.” The hypothetical model storm upon which storm protection plans were based is called the Standard Project Hurricane or SPH. The model storm was simplistic, and led to an inadequate network of levees, flood walls, storm gates and pumps. The report also found that “the creators of the standard project hurricane, in an attempt to find a representative storm, actually excluded the fiercest storms from the database.”
  35. It is one thing collecting data of a problem and recording it, but a totally different skill is required to interpret it. Here you look at visual representations by graphing the data in an appropriate fashion. As an example, bar graphs are often referred to as Manhattan graphs. Just as with the Manhattan skyline where the large buildings are prominent, so too is those significant bits of data that is represent in a graph. Convert the data to a visual representation and this will aid in the process of solving the problems. The visualization present in the NOC should always be designed to assist in diagnosis. Refer to examples of graphing of times in Major Incident Lifecycle.
  36. Uptime is about reducing downtime
  37. Firefighting
  38. Video refer: https://lnkd.in/eVF7XUy
  39. Firefighting
  40. Firefighting
  41. Firefighting
  42. Firefighting
  43. Incident consequence analysis
  44. Reference: https://lnkd.in/eCZ4X5c 1.  Clarify the problem includes alignment to the Ultimate Goal or Purpose and to identify the Ideal situation, current situation and the gap 2.  Breakdown the problem requires breakdown into manageable pieces using the 4 W’s and finding the Prioritized Problem, Process, and Point of Cause 3.  Set a Target is to Set Target to the Point of Cause and determine “How much” and “By When” 4.  Analyze Root Cause is to brainstorm multiple Potential Causes  by asking WHY and to determine Root Cause by going to see the process 5.  Develop Countermeasures is to brainstorm countermeasures, narrow using criteria, develop a detailed action plan, and gain consensus 6.  See Countermeasures Through  means to share status of plan by reporting, informing and consulting and build consensus, never give up, think and act persistently 7.  Evaluate - determine if the target was achieved and evaluate 3 viewpoints, and look at process and results 8. Standardize - standardize Successful practices, share results and start the next round of kaizen
  45. Review