SlideShare a Scribd company logo
1 of 21
Major
Incident
Management
Andrew Vermes
Time for some fun!
• Why are major incidents so hard?
• Straightforward ways to share information
• A live, interactive simulation
Caught in headlights
• It’s easy to panic
• Calm atmosphere helps
Keep questions helpful
When will it be fixed?
Is everything under control?
Who’s responsible for this?
Did you change anything?
Why did we let this happen?
Too much vital information is hidden
Some more
Emails
Magic
Happens
Case Closed
Case Opened
Some
Emails & Calls
Sometimes the
magic is “Spoke
to Mike & Steve”
Most case
documentation
does not include
the actual root
cause of a problem,
therefore knowledge
reuse is minimised
Processes must be second nature
Emergency services
depend on:
•Clear processes
•Checklists
•Repeated training
Limited time to
think: needs
responses to be fast
and automatic
Four things need effective control:
• What’s happening?
• How will we restore
service?
• What risks are there?
• What investigation is
needed?
Major
Incidents
Understanding
InvestigatingRestoring
Preventing
Separate the workstreams
Four key questions:
What’s happening?
What will we do about it?
What risks are there?
How do we find the cause?
The goal is always-
effective action
DECISION
ANALYSIS
To select the
right fix or
workaround
POTENTIAL
PROBLEM
ANALYSIS
To manage
risksSITUATION
APPRAISAL
To Sort Out
Priority
Actions
PROBLEM
ANALYSIS
To Find True
Cause
Keep your working information visible
• Separate
into 4 areas
Keep information updated
• Ensure your
dashboard is
updated in
real time
Major Incident Simulation
Please go to:
http://mindstormrobotics.com/
You will find information about
the incident there
SITUATION
AUGUST
30
Tom Lewis, production manager at
Stockholm Brick Company, has contacted
you because one of their brick sorting
robots, the TX72-6, has a problem.
You repaired the conveyor belt yesterday, but now the
sorting system has broken down again. The breakdown
costs the company roughly 1,000 dollars per minute.
Tom requests your assistance to solve the problem
immediately.
Teamwork is key
In teams:
• Review your information
• Complete the Major Incident Dashboard
• Decide what to do next
• Take the necessary actions and update the
dashboard
Updating your dashboard
• Keep
information
visible to your
team
Check point
• How much has your team spent so far?
• What progress has been made in incident recovery?
• Which moves did you make that added value?
Wrap up
Start with a clear incident statement
(what process, what symptom)
Look early for similar but unaffected CIs for comparison
The solution is…
First 10 minutes
KT INCIDENT DASHBOARD time: August 30, 11:15
Incident Summary:
Customer Issues, Priorities, Impact -
What When Where
Actions Needed Due Done Who Diagnostic Data we have Possible Causes Next Investigation stepDue Done Who
Bricksorting robot TX72-6 problems
Sorting system broken down
Conveyor repaired yesterday
Impact to client= $1,000/minute
Goals fix must meet Possible fixes Best
Fit?
Due Who
Risks and Opportunities
list
Preventive Actions Contingent Actions Due Done Who
Restore fast
Certainty of fix
Avoid causing other incidents
Bricksorting robot TX72-6 problems today
Incident Overview & Action Plan Problem Investigation - Finding Cause
Decisions to be made Risk Management
After 30 minutes
KT INCIDENT DASHBOARD time: August 30
Incident Summary:
Customer Issues, Priorities, Impact -
What When Where
Actions Needed Due Done Who Diagnostic Data we have Possible Causes Next Investigation step Due Done Who
Bricksorting robot TX72-6 problems The chute assembly presses
long against the touch sensor
Application update
yesterday
Sorting system broken down IS NOT badly Sorting bricks
when initially moving right
Touchsensor sw update
Conveyor repaired yesterday Happens during reset Database slow to respond
Impact to client= $1,000/minute Started August 30
Goals fix must meet Possible fixes Best
Fit?
Due Who
Risks and Opportunities list Preventive Actions Contingent Actions Due Done Who
Restore fast Reload database
Certainty of fix Replace touch sensor
Avoid causing other incidents Repair conveyor (again)
Replace control unit
Replace Motor B
Brick sorting robot sorting incorrectly; possible causes determined.
Incident Overview & Action Plan Problem Investigation - Finding Cause
Decisions to be made Risk Management
After 60 minutesKT INCIDENT DASHBOARD time: August 30
Incident Summary:
Customer Issues, Priorities,
Impact - What When Where
Actions Needed Due Done Who Diagnostic Data we have Possible Causes Next Investigation step Due Done Who
Bricksorting robot TX72-6
problems
The chute assembly presses
long against the touch sensor
Application update
yesterday
Check log files for
evidence
Sorting system broken down IS NOT badly Sorting bricks
when initially moving right
Touchsensor sw update Check log files for
evidence
Conveyor repaired yesterday Happens during reset Database slow to respond Contact DB team
Impact to client= $1,000/minute Started August 30
Goals fix must meet Possible fixes Best
Fit?
Due Who
Risks and Opportunities list Preventive Actions Contingent Actions Due Done Who
Restore fast Reload database Replacing Touch sensor Review work instructions Contact MR support
Certainty of fix Replace touch sensor X Check part upfront for
DOAAvoid causing other incidents Repair conveyor (again) temp
Replace control unit
Replace Motor B
Brick sorting robot TX72-6 now sorting correctly
Incident Overview & Action Plan Problem Investigation - Finding Cause
Decisions to be made Risk Management
Takeaways for Major Incident Managers
(and for everyone):
1. Avoid holding people on calls for many hours
2. Keep all information visible in real time
3. Require specific information from participants
4. Run regular simulations to keep skills high
5. Set clear update times and stick to them
Leaders in Problem Solving
twitter.com @KepnerTregoe
facebook.com/KepnerTregoe
linkedin.com/company/kepner-tregoe
Andrew Vermes
Senior Consultant
+44 (0) 7973506628
avermes@kepner-tregoe.com
Andrew Vermes: Major Incident Management

More Related Content

What's hot

ITIL Incident management
ITIL Incident managementITIL Incident management
ITIL Incident management
ManageEngine
 
Incident Mgmt Process Guideand Standards
Incident Mgmt Process Guideand StandardsIncident Mgmt Process Guideand Standards
Incident Mgmt Process Guideand Standards
Edward Paul Pagsanhan
 

What's hot (20)

Incident management with jira
Incident management with jiraIncident management with jira
Incident management with jira
 
Incident Management
 Incident Management Incident Management
Incident Management
 
ITIL Incident management
ITIL Incident managementITIL Incident management
ITIL Incident management
 
IT Infrastructure Managed Services and RIMS
IT Infrastructure Managed Services and RIMSIT Infrastructure Managed Services and RIMS
IT Infrastructure Managed Services and RIMS
 
Major Incident Management
Major Incident ManagementMajor Incident Management
Major Incident Management
 
Incident and Problem management simplified
Incident and Problem management simplifiedIncident and Problem management simplified
Incident and Problem management simplified
 
ITIL Incident Management Workflow - Process Guide
	 ITIL Incident Management Workflow - Process Guide	 ITIL Incident Management Workflow - Process Guide
ITIL Incident Management Workflow - Process Guide
 
Incident Management Powerpoint Presentation Slides
Incident Management Powerpoint Presentation SlidesIncident Management Powerpoint Presentation Slides
Incident Management Powerpoint Presentation Slides
 
ITIL Introduction
ITIL IntroductionITIL Introduction
ITIL Introduction
 
Making Problem Management Work for Your Organization
Making Problem Management Work for Your OrganizationMaking Problem Management Work for Your Organization
Making Problem Management Work for Your Organization
 
Managed Services Presentation
Managed Services PresentationManaged Services Presentation
Managed Services Presentation
 
Managed it services
Managed it servicesManaged it services
Managed it services
 
Managed Services Presentation
Managed Services PresentationManaged Services Presentation
Managed Services Presentation
 
Incident Mgmt Process Guideand Standards
Incident Mgmt Process Guideand StandardsIncident Mgmt Process Guideand Standards
Incident Mgmt Process Guideand Standards
 
ITIL Practical Guide - Service Operation
ITIL Practical Guide - Service OperationITIL Practical Guide - Service Operation
ITIL Practical Guide - Service Operation
 
Change Management ITIL
Change Management ITILChange Management ITIL
Change Management ITIL
 
Incident Management
Incident Management Incident Management
Incident Management
 
A Starter Guide to IT Managed Services
A Starter Guide to IT Managed ServicesA Starter Guide to IT Managed Services
A Starter Guide to IT Managed Services
 
A Comprehensive Approach to Application Portfolio Rationalization
A Comprehensive Approach to Application Portfolio RationalizationA Comprehensive Approach to Application Portfolio Rationalization
A Comprehensive Approach to Application Portfolio Rationalization
 
Network Operations Center (NOC)
Network Operations Center (NOC)Network Operations Center (NOC)
Network Operations Center (NOC)
 

Similar to Andrew Vermes: Major Incident Management

Gap Analysis & Improvement Tactics for Your EH&S Program
Gap Analysis & Improvement Tactics for Your EH&S ProgramGap Analysis & Improvement Tactics for Your EH&S Program
Gap Analysis & Improvement Tactics for Your EH&S Program
Triumvirate Environmental
 
Three primary steps in maintenance reliability engineering
Three primary steps in maintenance reliability engineeringThree primary steps in maintenance reliability engineering
Three primary steps in maintenance reliability engineering
Jim Taylor, ASQ-CRE, CPE, CPMM
 
APRA_Contact Reports_2016_Turner_Hrubik_IJM
APRA_Contact Reports_2016_Turner_Hrubik_IJMAPRA_Contact Reports_2016_Turner_Hrubik_IJM
APRA_Contact Reports_2016_Turner_Hrubik_IJM
Thomas Turner
 

Similar to Andrew Vermes: Major Incident Management (20)

Gap Analysis & Improvement Tactics for Your EH&S Program
Gap Analysis & Improvement Tactics for Your EH&S ProgramGap Analysis & Improvement Tactics for Your EH&S Program
Gap Analysis & Improvement Tactics for Your EH&S Program
 
Domains and data analytics
Domains and data analyticsDomains and data analytics
Domains and data analytics
 
5 forces incident problem mgmt-presentation
5 forces incident problem mgmt-presentation5 forces incident problem mgmt-presentation
5 forces incident problem mgmt-presentation
 
Problem management foundation - Lifecycle
Problem management foundation - Lifecycle Problem management foundation - Lifecycle
Problem management foundation - Lifecycle
 
Process Mining and AI for Continuous Process Improvement
Process Mining and AI for Continuous Process ImprovementProcess Mining and AI for Continuous Process Improvement
Process Mining and AI for Continuous Process Improvement
 
Process Mining and Predictive Process Monitoring
Process Mining and Predictive Process MonitoringProcess Mining and Predictive Process Monitoring
Process Mining and Predictive Process Monitoring
 
Process Mining in Action: Self-service data science for business teams
Process Mining in Action: Self-service data science for business teamsProcess Mining in Action: Self-service data science for business teams
Process Mining in Action: Self-service data science for business teams
 
Effective CAPA Implementation in a Management System - Praneet Surti
Effective CAPA Implementation in a Management System - Praneet SurtiEffective CAPA Implementation in a Management System - Praneet Surti
Effective CAPA Implementation in a Management System - Praneet Surti
 
ITlecture1.ppt
ITlecture1.pptITlecture1.ppt
ITlecture1.ppt
 
Drupalcamp Scotland - Usability testing in an agile development process
Drupalcamp Scotland - Usability testing in an agile development processDrupalcamp Scotland - Usability testing in an agile development process
Drupalcamp Scotland - Usability testing in an agile development process
 
Prescriptive Process Monitoring for Cost-Aware Cycle Time Reduction
Prescriptive Process Monitoring for Cost-Aware Cycle Time ReductionPrescriptive Process Monitoring for Cost-Aware Cycle Time Reduction
Prescriptive Process Monitoring for Cost-Aware Cycle Time Reduction
 
Doing Analytics Right - Designing and Automating Analytics
Doing Analytics Right - Designing and Automating AnalyticsDoing Analytics Right - Designing and Automating Analytics
Doing Analytics Right - Designing and Automating Analytics
 
Outpost24 webinar - The economics of penetration testing in the new threat la...
Outpost24 webinar - The economics of penetration testing in the new threat la...Outpost24 webinar - The economics of penetration testing in the new threat la...
Outpost24 webinar - The economics of penetration testing in the new threat la...
 
Jack Nichelson - Information Security Metrics - Practical Security Metrics
Jack Nichelson - Information Security Metrics - Practical Security MetricsJack Nichelson - Information Security Metrics - Practical Security Metrics
Jack Nichelson - Information Security Metrics - Practical Security Metrics
 
Information Security Metrics - Practical Security Metrics
Information Security Metrics - Practical Security MetricsInformation Security Metrics - Practical Security Metrics
Information Security Metrics - Practical Security Metrics
 
Three primary steps in maintenance reliability engineering
Three primary steps in maintenance reliability engineeringThree primary steps in maintenance reliability engineering
Three primary steps in maintenance reliability engineering
 
Root Cause Analysis
Root Cause Analysis Root Cause Analysis
Root Cause Analysis
 
Using data science to automate event correlation - June 2016 - Dan Turchin - ...
Using data science to automate event correlation - June 2016 - Dan Turchin - ...Using data science to automate event correlation - June 2016 - Dan Turchin - ...
Using data science to automate event correlation - June 2016 - Dan Turchin - ...
 
APRA_Contact Reports_2016_Turner_Hrubik_IJM
APRA_Contact Reports_2016_Turner_Hrubik_IJMAPRA_Contact Reports_2016_Turner_Hrubik_IJM
APRA_Contact Reports_2016_Turner_Hrubik_IJM
 
Backups and Disaster Recovery for Nonprofits
Backups and Disaster Recovery for NonprofitsBackups and Disaster Recovery for Nonprofits
Backups and Disaster Recovery for Nonprofits
 

More from itSMF UK

More from itSMF UK (20)

Nicola Reeves and John McDermott: Value Creation in a Hybrid World
Nicola Reeves and John McDermott: Value Creation in a Hybrid WorldNicola Reeves and John McDermott: Value Creation in a Hybrid World
Nicola Reeves and John McDermott: Value Creation in a Hybrid World
 
Gary Gamp: The 21st Century Service Manager
Gary Gamp: The 21st Century Service ManagerGary Gamp: The 21st Century Service Manager
Gary Gamp: The 21st Century Service Manager
 
Martin Huddleston: No Service Management, No Security
Martin Huddleston: No Service Management, No SecurityMartin Huddleston: No Service Management, No Security
Martin Huddleston: No Service Management, No Security
 
Rebecca Ulyatt: People Power – Crack the Code, One Conversation at a Time
Rebecca Ulyatt: People Power – Crack the Code, One Conversation at a TimeRebecca Ulyatt: People Power – Crack the Code, One Conversation at a Time
Rebecca Ulyatt: People Power – Crack the Code, One Conversation at a Time
 
Chris Bryan: Continuous Service Improvement in a SIAM Environment
Chris Bryan: Continuous Service Improvement in a SIAM EnvironmentChris Bryan: Continuous Service Improvement in a SIAM Environment
Chris Bryan: Continuous Service Improvement in a SIAM Environment
 
Johann Diaz: The New Management of Service – Joining Up the Enterprise
Johann Diaz: The New Management of Service – Joining Up the EnterpriseJohann Diaz: The New Management of Service – Joining Up the Enterprise
Johann Diaz: The New Management of Service – Joining Up the Enterprise
 
David D'Agostino and Tony Price: Kicking the KPI Habit
David D'Agostino and Tony Price: Kicking the KPI HabitDavid D'Agostino and Tony Price: Kicking the KPI Habit
David D'Agostino and Tony Price: Kicking the KPI Habit
 
Peter Hubbard: Don't Get Stuck in a Silo – Going Digital isn't Transformation
Peter Hubbard: Don't Get Stuck in a Silo – Going Digital isn't TransformationPeter Hubbard: Don't Get Stuck in a Silo – Going Digital isn't Transformation
Peter Hubbard: Don't Get Stuck in a Silo – Going Digital isn't Transformation
 
Simone Jo Moore: Machine Humanity
Simone Jo Moore: Machine HumanitySimone Jo Moore: Machine Humanity
Simone Jo Moore: Machine Humanity
 
Hayley Butler and Spenser Arnold: Agile Service Management
Hayley Butler and Spenser Arnold: Agile Service ManagementHayley Butler and Spenser Arnold: Agile Service Management
Hayley Butler and Spenser Arnold: Agile Service Management
 
Network Rail: Intelligent Infrastructure
Network Rail: Intelligent InfrastructureNetwork Rail: Intelligent Infrastructure
Network Rail: Intelligent Infrastructure
 
Clare McAleese: Verism at Vocalink Mastercard... Our Journey so Far
Clare McAleese: Verism at Vocalink Mastercard... Our Journey so FarClare McAleese: Verism at Vocalink Mastercard... Our Journey so Far
Clare McAleese: Verism at Vocalink Mastercard... Our Journey so Far
 
Lynda Cooper: ISO/IEC 20000 - The Launch of the Revised Standard
Lynda Cooper: ISO/IEC 20000 - The Launch of the Revised StandardLynda Cooper: ISO/IEC 20000 - The Launch of the Revised Standard
Lynda Cooper: ISO/IEC 20000 - The Launch of the Revised Standard
 
Owen Appleton: FitSM
Owen Appleton: FitSMOwen Appleton: FitSM
Owen Appleton: FitSM
 
Dave Wheable: Can We Manage the Future
Dave Wheable: Can We Manage the FutureDave Wheable: Can We Manage the Future
Dave Wheable: Can We Manage the Future
 
Stuart Howitt: Honey, I Shrunk the Incident
Stuart Howitt: Honey, I Shrunk the IncidentStuart Howitt: Honey, I Shrunk the Incident
Stuart Howitt: Honey, I Shrunk the Incident
 
Akshay Anand: The Future is Built on ITIL – Get Ready for ITIL 4
Akshay Anand: The Future is Built on ITIL – Get Ready for ITIL 4Akshay Anand: The Future is Built on ITIL – Get Ready for ITIL 4
Akshay Anand: The Future is Built on ITIL – Get Ready for ITIL 4
 
Sanjeev NC: 5 Game Techniques to Immediately Apply in Your Service Desk
Sanjeev NC: 5 Game Techniques to Immediately Apply in Your Service DeskSanjeev NC: 5 Game Techniques to Immediately Apply in Your Service Desk
Sanjeev NC: 5 Game Techniques to Immediately Apply in Your Service Desk
 
Alice Doyne: Service Design Meets Service
Alice Doyne: Service Design Meets ServiceAlice Doyne: Service Design Meets Service
Alice Doyne: Service Design Meets Service
 
Jon Terry: Respect for People Lean's Neglected Pillar
Jon Terry: Respect for People Lean's Neglected PillarJon Terry: Respect for People Lean's Neglected Pillar
Jon Terry: Respect for People Lean's Neglected Pillar
 

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 

Recently uploaded (20)

Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 

Andrew Vermes: Major Incident Management

  • 2. Time for some fun! • Why are major incidents so hard? • Straightforward ways to share information • A live, interactive simulation
  • 3. Caught in headlights • It’s easy to panic • Calm atmosphere helps
  • 4. Keep questions helpful When will it be fixed? Is everything under control? Who’s responsible for this? Did you change anything? Why did we let this happen?
  • 5. Too much vital information is hidden Some more Emails Magic Happens Case Closed Case Opened Some Emails & Calls Sometimes the magic is “Spoke to Mike & Steve” Most case documentation does not include the actual root cause of a problem, therefore knowledge reuse is minimised
  • 6. Processes must be second nature Emergency services depend on: •Clear processes •Checklists •Repeated training Limited time to think: needs responses to be fast and automatic
  • 7. Four things need effective control: • What’s happening? • How will we restore service? • What risks are there? • What investigation is needed? Major Incidents Understanding InvestigatingRestoring Preventing
  • 8. Separate the workstreams Four key questions: What’s happening? What will we do about it? What risks are there? How do we find the cause? The goal is always- effective action DECISION ANALYSIS To select the right fix or workaround POTENTIAL PROBLEM ANALYSIS To manage risksSITUATION APPRAISAL To Sort Out Priority Actions PROBLEM ANALYSIS To Find True Cause
  • 9. Keep your working information visible • Separate into 4 areas
  • 10. Keep information updated • Ensure your dashboard is updated in real time
  • 11. Major Incident Simulation Please go to: http://mindstormrobotics.com/ You will find information about the incident there SITUATION AUGUST 30 Tom Lewis, production manager at Stockholm Brick Company, has contacted you because one of their brick sorting robots, the TX72-6, has a problem. You repaired the conveyor belt yesterday, but now the sorting system has broken down again. The breakdown costs the company roughly 1,000 dollars per minute. Tom requests your assistance to solve the problem immediately.
  • 12. Teamwork is key In teams: • Review your information • Complete the Major Incident Dashboard • Decide what to do next • Take the necessary actions and update the dashboard
  • 13. Updating your dashboard • Keep information visible to your team
  • 14. Check point • How much has your team spent so far? • What progress has been made in incident recovery? • Which moves did you make that added value?
  • 15. Wrap up Start with a clear incident statement (what process, what symptom) Look early for similar but unaffected CIs for comparison The solution is…
  • 16. First 10 minutes KT INCIDENT DASHBOARD time: August 30, 11:15 Incident Summary: Customer Issues, Priorities, Impact - What When Where Actions Needed Due Done Who Diagnostic Data we have Possible Causes Next Investigation stepDue Done Who Bricksorting robot TX72-6 problems Sorting system broken down Conveyor repaired yesterday Impact to client= $1,000/minute Goals fix must meet Possible fixes Best Fit? Due Who Risks and Opportunities list Preventive Actions Contingent Actions Due Done Who Restore fast Certainty of fix Avoid causing other incidents Bricksorting robot TX72-6 problems today Incident Overview & Action Plan Problem Investigation - Finding Cause Decisions to be made Risk Management
  • 17. After 30 minutes KT INCIDENT DASHBOARD time: August 30 Incident Summary: Customer Issues, Priorities, Impact - What When Where Actions Needed Due Done Who Diagnostic Data we have Possible Causes Next Investigation step Due Done Who Bricksorting robot TX72-6 problems The chute assembly presses long against the touch sensor Application update yesterday Sorting system broken down IS NOT badly Sorting bricks when initially moving right Touchsensor sw update Conveyor repaired yesterday Happens during reset Database slow to respond Impact to client= $1,000/minute Started August 30 Goals fix must meet Possible fixes Best Fit? Due Who Risks and Opportunities list Preventive Actions Contingent Actions Due Done Who Restore fast Reload database Certainty of fix Replace touch sensor Avoid causing other incidents Repair conveyor (again) Replace control unit Replace Motor B Brick sorting robot sorting incorrectly; possible causes determined. Incident Overview & Action Plan Problem Investigation - Finding Cause Decisions to be made Risk Management
  • 18. After 60 minutesKT INCIDENT DASHBOARD time: August 30 Incident Summary: Customer Issues, Priorities, Impact - What When Where Actions Needed Due Done Who Diagnostic Data we have Possible Causes Next Investigation step Due Done Who Bricksorting robot TX72-6 problems The chute assembly presses long against the touch sensor Application update yesterday Check log files for evidence Sorting system broken down IS NOT badly Sorting bricks when initially moving right Touchsensor sw update Check log files for evidence Conveyor repaired yesterday Happens during reset Database slow to respond Contact DB team Impact to client= $1,000/minute Started August 30 Goals fix must meet Possible fixes Best Fit? Due Who Risks and Opportunities list Preventive Actions Contingent Actions Due Done Who Restore fast Reload database Replacing Touch sensor Review work instructions Contact MR support Certainty of fix Replace touch sensor X Check part upfront for DOAAvoid causing other incidents Repair conveyor (again) temp Replace control unit Replace Motor B Brick sorting robot TX72-6 now sorting correctly Incident Overview & Action Plan Problem Investigation - Finding Cause Decisions to be made Risk Management
  • 19. Takeaways for Major Incident Managers (and for everyone): 1. Avoid holding people on calls for many hours 2. Keep all information visible in real time 3. Require specific information from participants 4. Run regular simulations to keep skills high 5. Set clear update times and stick to them
  • 20. Leaders in Problem Solving twitter.com @KepnerTregoe facebook.com/KepnerTregoe linkedin.com/company/kepner-tregoe Andrew Vermes Senior Consultant +44 (0) 7973506628 avermes@kepner-tregoe.com

Editor's Notes

  1. KT discovered early on that effective problem solvers focused on one thing at a time: Understanding what’s happening, the impact is a necessary prelude to the other four colours; in some cases we can go directly to a solution; in others we need to step back and understand the risk before we take action. If there are choices about the fix, we need to consider the goals and environment in a quick Decision Analysis. Sometimes, it’s unwise to move forward if we have no idea about the cause: we might make things worse- so Problem Analys needs to be done. This sessions is about speeding that path to root cause, so now a challenge for you.