SlideShare a Scribd company logo
1 of 58
Download to read offline
What the NTSB teaches us about
incident management & postmortems
Michael Kehoe
Staff Site Reliability Engineer
Agenda and Vision
Today’s
agenda
1 Introductions
2 Background on the NTSB
3 NTSB: Investigative Process
4 Recommendations & Most Wanted List
5 How this applies to us?
6 Final thoughts
Michael Kehoe
$ WHOAMI
• Staff Site Reliability Engineer @ LinkedIn
• Production-SRE Team;
• Disaster Recovery
• Incident Response
• Visibility Engineering
• Reliability Principles
• Find me online at:
• @matrixtek
• https://michael-kehoe.io
• linkedin.com/in/michaelkkehoe
Production-SRE Team @ LinkedIn
$ /USR/BIN/WHOAMI
● Disaster Recovery - Planning & Automation
● Incident Response – Process & Automation
● Visibility Engineering – Making use of
operational data
● Reliability Principles – Defining best practice
& automating it
Incident Command System (ICS)
https://training.fema.gov/emiweb/is/icsresource/assets/reviewmaterials.pdf
Background on the NTSB
Background on the NTSB
JURISDICTION
● Aviation
● Surface Transportation
● Marine
● Pipeline
● Assistance to other agencies/ governments
“The NTSB shall investigate or have investigated and
establish the facts, circumstances, and cause or
probable cause of accidents…”
U.S. Code § 1131
“… The Board shall report on the facts and
circumstances of each accident investigated…The
Board shall make each report available to the public
at reasonable cost…”
U.S. Code § 1131
“The NTSB does not assign fault or blame for an
accident or incident…accident/incident
investigations are fact-finding proceedings with no
formal issues and no adverse parties … and are not
conducted for the purpose of determining the rights
or liabilities of any person.”
U.S. Code § 1154
Similar Organizations
● Italy –Agenzia nazionale per la
Sicurezza del Volo (ANSV)
● Canada – Transportation Safety Board
of Canada (TSB)
● Indonesia- Komite Nasional
Keselamatan Transportasi (NTSC)
● Netherlands – Dutch Safety Board
(DSB)
● Australia – Australian Transport Safety
Bureau (ATSB)
● United Kingdom – Air Accidents
Investigation Branch (AAIB)
● Germany – Bundesstelle für
Flugunfalluntersuchung
● France –Bureau d’Enquetes et
d’Analyses pour la Securite de
l’Aviation Civile (BEA)
NTSB Investigation Process
NTSB Investigation Process
1. Pre-Investigation Preparation
2. Notification & Initial Response
3. On-Scene Activities
4. Post-On-Scene Activities
1. Pre-Investigation
Preparation
Pre-Investigation Preparation
GO TEAM
● Go team: On call investigators ready for
assignments
● Investigator-In-Change (IIC) pre-assigned
● Full Go team may contain several subject
matter experts; e.g.
○ Human performance
○ Aircraft performance
○ Air Traffic Control
Pre-Investigation Preparation
GO TEAM ROSTER
● Oncall roster made available internally
○ Phone & Pager numbers
● Updated weekly
● All personnel should be able to arrive at an
airport 2 hours after notification
○ Should have essentials on them if they
live far away from an airport
● Division Chiefs responsible for testing pager
2. Notification & Initial
Response
Notification & Initial Response
REGIONAL RESPONSE
1. Regional office notifies headquarters of
incident
2. Closest regional office to accident will
provide at least one investigator to perform
PR & “stakedown”
Notification & Initial Response
HEADQUARTERS RESPONSE
1. After incident occurs: communication center
advises IIC and chief of Major Investigations
(who subsequently inform their superiors)
2. OAS director decides whether to launch a
Go-Team
3. Other executives are made aware by Chief of
Major Investigations
Notification & Initial Response
NOTIFICATION & ASSIGNMENTS
● Go-Team composition determined by
incident circumstances
● Send more specialists if in doubt
Notification & Initial Response
PARTY NOTIFICATION
● IIC gives party status to organizations that
can provide technical assistance (airlines,
aircraft manufacturers etc.)
● Communication center will help with travel
arrangements and on-site administrative
support
● Go-Team will travel together to accident site
3. On-Scene Activities
On-Scene Activities
COMMAND ROOMS
● Have meeting rooms to accommodate at least
30 people
● Have space for media
● Ensure you have equipment in command
room
○ PCs
○ Telephone systems
○ Forms
● IIC is responsible for managing this
On-Scene Activities
COMMAND ROOMS
● For Major investigations, Administrative
support is provided
● Government purchase card is available for
goods or services
On-Scene Activities
ORGANIZATIONAL MEETING
● Share preliminary information
● Organize (assign) participants
● Organize observers
● Establish lines of authority
“The manner in which the IIC conducts the
organizational meeting will establish the tone of the
investigation. Therefore, the importance of being
organized, articulate, assertive, composed, and
understanding cannot be overstated”
Major Investigations Manual Sec 3.2
On-Scene Activities
ACCIDENT SITE SAFETY PRECAUTIONS
● Safety officer identifies & classifies risks and
then develops counter-measures
● Safety officer performs daily briefings to
accident site team.
On-Scene Activities
OBSERVERS
● Observers may be allowed if they do not have
self-interest
● May include:
○ Congressional oversight committee(s)
○ Military personnel
○ Foreign Governments
○ Federal Agencies
On-Scene Activities
LINE OF AUTHORITY
● IIC is the most senior person on-scene and all
investigative activity is under his/ her control
● If IIC cannot resolve an issue, IIC may talk to
Chief of Major Investigations
● Ability to escalate further if required
On-Scene Activities
PROGRESS MEETINGS
● On-site progress meetings are held daily to:
○ Disseminate information obtained
○ Plan the day’s activities
○ Discuss plans for subsequent
investigative activities
● Generally start at 6pm
● Plan next day’s meeting
On-Scene Activities
DAILY ACTIVITIES OF IIC
● Headquarters briefing
● Safety board staff meeting
● Party coordinator meeting
● Site visit
4. Post-On-Scene Activities
NTSB Report Structure
Gathering facts
about the incident
Factual
Information
Extra information
Appendices
Analyze how the
facts contribution to
the incident
Analysis
Draw conclusions
about what
happened
Conclusions
Write detailed
recommendations
Recommendations
Post-On-Scene Activities
WORK PLANNING
● Discuss activities that will follow the on-scene
phase of investigation
● Build timelines for work
● Provides avenues for various teams to work
together
Post-On-Scene Activities
FACTS & ANALYSIS REPORT
● A factual report based on the field notes and
subsequent investigation activities
● Each group chairman shall submit an analysis
report based on the information contained in
his or her factual report.
Post-On-Scene Activities
PUBLIC HEARING
● Led by IIC/ Hearing Officer
● Identify witnesses whose testimony is
appropriate
● The witnesses may be from the parties to the
investigation or can be suggested by one or
more of the parties.
● Purpose: To ensure all relevant information is
gathered before writing the report
Post-On-Scene Activities
TECHNICAL REVIEW
● Provides an additional opportunity for all
parties to review all factual information
● Ensures all issues are resolved
● Technical Review is held as soon as possible
after public hearing
Post-On-Scene Activities
PREPARATION OF FINAL REPORT
● Dedicated department to help write report
● Follows a standard template
○ Annex 13 to the International Civil
Aviation Organization (ICAO)
● Contains formal recommendations to
manufacturers/ transportation authorities
Recommendations &
Most Wanted List
Recommendations & Most Wanted List
● NTSB advocates for particular action items
based on report(s):
○ Generally directed towards Transport
bodies/ manufacturers
● NTSB publicly tracks response of the
responsible body
https://www.ntsb.gov/safety/mwl/Pages/default.aspx
How this relates to all of us?
1. Pre-Investigation
Preparation
Applying this to operations
PRE-INCIDENT PREPARATION
● Have an Incident commander pre-assigned
● Publish on-call schedules
○ Manager is responsible
● Test on-call pagers regularly
● Ensure that you can respond within SLA
● Printed copy of Oncall contact info
● DR
http://i.imgur.com/wvg8IDq.gif
2. Notification & Initial
Response
Applying this to operations
NOTIFICATION & INITIAL RESPONSE
● NOC/ SiteOps teams notifies incident
commander + manager
○ Prod-SRE gets engaged
● Prod-SRE Manager/Oncall
○ Access, Engage, Notify, Mitigate
https://docs.microsoft.com/en-us/windows/uwp/design/shell/tiles-and-notifications/images/toast-mirroring.gif
Applying this to operations
NOTIFICATION & INITIAL RESPONSE
● Once verified, we launch full response for Major
Incident
● Incident commander gives “party status” to
observers
● Manager informs executives & PR
○ Periodic updates
● Mitigate
http://www.roadrunneremaillogin.com/wp-content/uploads/2018/06/RoadRunner-Email.jpg
3. On-Scene Activities
Applying this to operations
ON-SCENE ACTIVITIES
● Private + Public slack work-channels
● IC is empowered to make decisions
● Organizational call to ensure:
○ Problem is understood
○ Area of investigations assigned
http://www.gpla.com/static/img/projects/ubisofts-e3-social-media-war-room/war-room.gif
Applying this to operations
ON-SCENE ACTIVITIES
● War room
○ Incident commander drives the war-
room
○ Roles & responsibilities assigned to each
“party”
○ Communication at regular cadence to
execs
○ Admin ensures supplies and food
● Gathering data and updating timeline doc
http://www.gpla.com/static/img/projects/ubisofts-e3-social-media-war-room/war-room.gif
4. Post-On-Scene Activities
Applying this to operations
POST ON-SCENE ACTIVITIES
● Post mortem
○ Dedicated team
○ PM Template
○ Blameless
● “Postmortem rollup”
○ Action items are prioritized
○ Weekly reporting on status of action-
items
https://www.economist.com/sites/default/files/imagecache/1280-width/20180414_OFP021.gif
Recommendations:
Most Wanted List
Applying this to operations
MOST WANTED LIST
● Use the post-incident process to improve
and hold people accountable for action
items
● Keep track of recurring issues/ repeaters
https://clip2art.com/images/meeting-clipart-animated-gif-2.gif
Final Thoughts
Final Thoughts
Complete Incident +
Postmortem process
NTSB Investigative
Process
The more you put in,
the more you’ll get
out
Invest
Accountability for
improvements/
action items
Accountability
Questions?
AllDayDevops: What the NTSB teaches us about incident management & postmortems

More Related Content

Similar to AllDayDevops: What the NTSB teaches us about incident management & postmortems

New events presentation
New events presentationNew events presentation
New events presentationEunice Parcz
 
PROCUREMENT: Expediting How To - General
PROCUREMENT: Expediting How To - GeneralPROCUREMENT: Expediting How To - General
PROCUREMENT: Expediting How To - GeneralSierra Romeo
 
Attending Emergency/Accident and Incident Reporting (With first aid)
Attending Emergency/Accident and Incident Reporting (With first aid)Attending Emergency/Accident and Incident Reporting (With first aid)
Attending Emergency/Accident and Incident Reporting (With first aid)RevanuruSubramanyam
 
Is it Necessary to Document the BCMS plan?
Is it Necessary to Document the BCMS plan?Is it Necessary to Document the BCMS plan?
Is it Necessary to Document the BCMS plan?PECB
 
autonomy for hazardous scene assessment themed competition 22 September 2016
autonomy for hazardous scene assessment themed competition 22 September 2016autonomy for hazardous scene assessment themed competition 22 September 2016
autonomy for hazardous scene assessment themed competition 22 September 2016Defence and Security Accelerator
 
NIGEL DIXON CV 260716
NIGEL DIXON CV 260716NIGEL DIXON CV 260716
NIGEL DIXON CV 260716Nigel Dixon
 
Event infrastructure
Event infrastructure Event infrastructure
Event infrastructure M. C.
 
Purple Team Use Case - Security Weekly
Purple Team Use Case - Security WeeklyPurple Team Use Case - Security Weekly
Purple Team Use Case - Security WeeklyJorge Orchilles
 
Manuel Neto- Resume 2016
Manuel Neto-  Resume 2016Manuel Neto-  Resume 2016
Manuel Neto- Resume 2016Neto Manuel
 
KC_SAFETY CV UPDATED 2 HSE ENGR
KC_SAFETY CV UPDATED 2 HSE ENGRKC_SAFETY CV UPDATED 2 HSE ENGR
KC_SAFETY CV UPDATED 2 HSE ENGROsmond Okonkwo
 
C shea 21 ctto presentaion - 1
C shea   21 ctto presentaion - 1C shea   21 ctto presentaion - 1
C shea 21 ctto presentaion - 1Colin Shea
 
C shea 21 ctto presentaion
C shea   21 ctto presentaionC shea   21 ctto presentaion
C shea 21 ctto presentaionColin Shea
 
Business recovery with Smart Strategies
Business recovery with Smart StrategiesBusiness recovery with Smart Strategies
Business recovery with Smart StrategiesPECB
 

Similar to AllDayDevops: What the NTSB teaches us about incident management & postmortems (20)

APT Event - New York
APT Event - New YorkAPT Event - New York
APT Event - New York
 
New events presentation
New events presentationNew events presentation
New events presentation
 
PROCUREMENT: Expediting How To - General
PROCUREMENT: Expediting How To - GeneralPROCUREMENT: Expediting How To - General
PROCUREMENT: Expediting How To - General
 
ROGEL resume up date as of AUG.
ROGEL resume up date as of AUG.ROGEL resume up date as of AUG.
ROGEL resume up date as of AUG.
 
Attending Emergency/Accident and Incident Reporting (With first aid)
Attending Emergency/Accident and Incident Reporting (With first aid)Attending Emergency/Accident and Incident Reporting (With first aid)
Attending Emergency/Accident and Incident Reporting (With first aid)
 
PM Symposium 2009 Apply Risk Techniques on RAI Prj
PM Symposium 2009 Apply Risk Techniques on RAI PrjPM Symposium 2009 Apply Risk Techniques on RAI Prj
PM Symposium 2009 Apply Risk Techniques on RAI Prj
 
Is it Necessary to Document the BCMS plan?
Is it Necessary to Document the BCMS plan?Is it Necessary to Document the BCMS plan?
Is it Necessary to Document the BCMS plan?
 
autonomy for hazardous scene assessment themed competition 22 September 2016
autonomy for hazardous scene assessment themed competition 22 September 2016autonomy for hazardous scene assessment themed competition 22 September 2016
autonomy for hazardous scene assessment themed competition 22 September 2016
 
NIGEL DIXON CV 260716
NIGEL DIXON CV 260716NIGEL DIXON CV 260716
NIGEL DIXON CV 260716
 
Event infrastructure
Event infrastructure Event infrastructure
Event infrastructure
 
Sandeep Bhaskar Resume 2016
Sandeep Bhaskar Resume 2016Sandeep Bhaskar Resume 2016
Sandeep Bhaskar Resume 2016
 
JCC ARF DiREx 2013
JCC ARF DiREx 2013JCC ARF DiREx 2013
JCC ARF DiREx 2013
 
Purple Team Use Case - Security Weekly
Purple Team Use Case - Security WeeklyPurple Team Use Case - Security Weekly
Purple Team Use Case - Security Weekly
 
Manuel Neto- Resume 2016
Manuel Neto-  Resume 2016Manuel Neto-  Resume 2016
Manuel Neto- Resume 2016
 
KC_SAFETY CV UPDATED 2 HSE ENGR
KC_SAFETY CV UPDATED 2 HSE ENGRKC_SAFETY CV UPDATED 2 HSE ENGR
KC_SAFETY CV UPDATED 2 HSE ENGR
 
9780840024220 ppt ch11
9780840024220 ppt ch119780840024220 ppt ch11
9780840024220 ppt ch11
 
Incident response
Incident responseIncident response
Incident response
 
C shea 21 ctto presentaion - 1
C shea   21 ctto presentaion - 1C shea   21 ctto presentaion - 1
C shea 21 ctto presentaion - 1
 
C shea 21 ctto presentaion
C shea   21 ctto presentaionC shea   21 ctto presentaion
C shea 21 ctto presentaion
 
Business recovery with Smart Strategies
Business recovery with Smart StrategiesBusiness recovery with Smart Strategies
Business recovery with Smart Strategies
 

More from Michael Kehoe

Code Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart wayCode Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart wayMichael Kehoe
 
QConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready ApplicationsQConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready ApplicationsMichael Kehoe
 
Helping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayMichael Kehoe
 
Linux Container Basics
Linux Container BasicsLinux Container Basics
Linux Container BasicsMichael Kehoe
 
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet DropsPapers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet DropsMichael Kehoe
 
PyBay 2018: Production-Ready Python Applications
PyBay 2018: Production-Ready Python ApplicationsPyBay 2018: Production-Ready Python Applications
PyBay 2018: Production-Ready Python ApplicationsMichael Kehoe
 
Helping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayMichael Kehoe
 
The Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringMichael Kehoe
 
Building Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSFBuilding Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSFMichael Kehoe
 
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...Michael Kehoe
 
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...Michael Kehoe
 
SRECon-Europe-2017: Networks for SREs
SRECon-Europe-2017: Networks for SREsSRECon-Europe-2017: Networks for SREs
SRECon-Europe-2017: Networks for SREsMichael Kehoe
 
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scaleVelocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scaleMichael Kehoe
 
Reducing MTTR and False Escalations: Event Correlation at LinkedIn
Reducing MTTR and False Escalations: Event Correlation at LinkedInReducing MTTR and False Escalations: Event Correlation at LinkedIn
Reducing MTTR and False Escalations: Event Correlation at LinkedInMichael Kehoe
 
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...Michael Kehoe
 
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedInCouchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedInMichael Kehoe
 
Couchbase Connect 2016
Couchbase Connect 2016Couchbase Connect 2016
Couchbase Connect 2016Michael Kehoe
 
Using SaltStack to Auto Triage and Remediate Production Systems
Using SaltStack to Auto Triage and Remediate Production SystemsUsing SaltStack to Auto Triage and Remediate Production Systems
Using SaltStack to Auto Triage and Remediate Production SystemsMichael Kehoe
 

More from Michael Kehoe (20)

eBPF Workshop
eBPF WorkshopeBPF Workshop
eBPF Workshop
 
eBPF Basics
eBPF BasicseBPF Basics
eBPF Basics
 
Code Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart wayCode Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart way
 
QConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready ApplicationsQConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready Applications
 
Helping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart way
 
Linux Container Basics
Linux Container BasicsLinux Container Basics
Linux Container Basics
 
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet DropsPapers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
 
PyBay 2018: Production-Ready Python Applications
PyBay 2018: Production-Ready Python ApplicationsPyBay 2018: Production-Ready Python Applications
PyBay 2018: Production-Ready Python Applications
 
Helping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart way
 
The Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability Engineering
 
Building Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSFBuilding Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSF
 
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
 
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
 
SRECon-Europe-2017: Networks for SREs
SRECon-Europe-2017: Networks for SREsSRECon-Europe-2017: Networks for SREs
SRECon-Europe-2017: Networks for SREs
 
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scaleVelocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
 
Reducing MTTR and False Escalations: Event Correlation at LinkedIn
Reducing MTTR and False Escalations: Event Correlation at LinkedInReducing MTTR and False Escalations: Event Correlation at LinkedIn
Reducing MTTR and False Escalations: Event Correlation at LinkedIn
 
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
 
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedInCouchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
 
Couchbase Connect 2016
Couchbase Connect 2016Couchbase Connect 2016
Couchbase Connect 2016
 
Using SaltStack to Auto Triage and Remediate Production Systems
Using SaltStack to Auto Triage and Remediate Production SystemsUsing SaltStack to Auto Triage and Remediate Production Systems
Using SaltStack to Auto Triage and Remediate Production Systems
 

Recently uploaded

NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...Amil Baba Dawood bangali
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
The SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsThe SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsDILIPKUMARMONDAL6
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptJasonTagapanGulla
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfRajuKanojiya4
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 

Recently uploaded (20)

NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
The SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsThe SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teams
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.ppt
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdf
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 

AllDayDevops: What the NTSB teaches us about incident management & postmortems

  • 1. What the NTSB teaches us about incident management & postmortems Michael Kehoe Staff Site Reliability Engineer
  • 3. Today’s agenda 1 Introductions 2 Background on the NTSB 3 NTSB: Investigative Process 4 Recommendations & Most Wanted List 5 How this applies to us? 6 Final thoughts
  • 4. Michael Kehoe $ WHOAMI • Staff Site Reliability Engineer @ LinkedIn • Production-SRE Team; • Disaster Recovery • Incident Response • Visibility Engineering • Reliability Principles • Find me online at: • @matrixtek • https://michael-kehoe.io • linkedin.com/in/michaelkkehoe
  • 5. Production-SRE Team @ LinkedIn $ /USR/BIN/WHOAMI ● Disaster Recovery - Planning & Automation ● Incident Response – Process & Automation ● Visibility Engineering – Making use of operational data ● Reliability Principles – Defining best practice & automating it
  • 6. Incident Command System (ICS) https://training.fema.gov/emiweb/is/icsresource/assets/reviewmaterials.pdf
  • 8. Background on the NTSB JURISDICTION ● Aviation ● Surface Transportation ● Marine ● Pipeline ● Assistance to other agencies/ governments
  • 9. “The NTSB shall investigate or have investigated and establish the facts, circumstances, and cause or probable cause of accidents…” U.S. Code § 1131
  • 10. “… The Board shall report on the facts and circumstances of each accident investigated…The Board shall make each report available to the public at reasonable cost…” U.S. Code § 1131
  • 11. “The NTSB does not assign fault or blame for an accident or incident…accident/incident investigations are fact-finding proceedings with no formal issues and no adverse parties … and are not conducted for the purpose of determining the rights or liabilities of any person.” U.S. Code § 1154
  • 12. Similar Organizations ● Italy –Agenzia nazionale per la Sicurezza del Volo (ANSV) ● Canada – Transportation Safety Board of Canada (TSB) ● Indonesia- Komite Nasional Keselamatan Transportasi (NTSC) ● Netherlands – Dutch Safety Board (DSB) ● Australia – Australian Transport Safety Bureau (ATSB) ● United Kingdom – Air Accidents Investigation Branch (AAIB) ● Germany – Bundesstelle für Flugunfalluntersuchung ● France –Bureau d’Enquetes et d’Analyses pour la Securite de l’Aviation Civile (BEA)
  • 14. NTSB Investigation Process 1. Pre-Investigation Preparation 2. Notification & Initial Response 3. On-Scene Activities 4. Post-On-Scene Activities
  • 16. Pre-Investigation Preparation GO TEAM ● Go team: On call investigators ready for assignments ● Investigator-In-Change (IIC) pre-assigned ● Full Go team may contain several subject matter experts; e.g. ○ Human performance ○ Aircraft performance ○ Air Traffic Control
  • 17. Pre-Investigation Preparation GO TEAM ROSTER ● Oncall roster made available internally ○ Phone & Pager numbers ● Updated weekly ● All personnel should be able to arrive at an airport 2 hours after notification ○ Should have essentials on them if they live far away from an airport ● Division Chiefs responsible for testing pager
  • 18. 2. Notification & Initial Response
  • 19. Notification & Initial Response REGIONAL RESPONSE 1. Regional office notifies headquarters of incident 2. Closest regional office to accident will provide at least one investigator to perform PR & “stakedown”
  • 20. Notification & Initial Response HEADQUARTERS RESPONSE 1. After incident occurs: communication center advises IIC and chief of Major Investigations (who subsequently inform their superiors) 2. OAS director decides whether to launch a Go-Team 3. Other executives are made aware by Chief of Major Investigations
  • 21. Notification & Initial Response NOTIFICATION & ASSIGNMENTS ● Go-Team composition determined by incident circumstances ● Send more specialists if in doubt
  • 22. Notification & Initial Response PARTY NOTIFICATION ● IIC gives party status to organizations that can provide technical assistance (airlines, aircraft manufacturers etc.) ● Communication center will help with travel arrangements and on-site administrative support ● Go-Team will travel together to accident site
  • 24. On-Scene Activities COMMAND ROOMS ● Have meeting rooms to accommodate at least 30 people ● Have space for media ● Ensure you have equipment in command room ○ PCs ○ Telephone systems ○ Forms ● IIC is responsible for managing this
  • 25. On-Scene Activities COMMAND ROOMS ● For Major investigations, Administrative support is provided ● Government purchase card is available for goods or services
  • 26. On-Scene Activities ORGANIZATIONAL MEETING ● Share preliminary information ● Organize (assign) participants ● Organize observers ● Establish lines of authority
  • 27. “The manner in which the IIC conducts the organizational meeting will establish the tone of the investigation. Therefore, the importance of being organized, articulate, assertive, composed, and understanding cannot be overstated” Major Investigations Manual Sec 3.2
  • 28. On-Scene Activities ACCIDENT SITE SAFETY PRECAUTIONS ● Safety officer identifies & classifies risks and then develops counter-measures ● Safety officer performs daily briefings to accident site team.
  • 29. On-Scene Activities OBSERVERS ● Observers may be allowed if they do not have self-interest ● May include: ○ Congressional oversight committee(s) ○ Military personnel ○ Foreign Governments ○ Federal Agencies
  • 30. On-Scene Activities LINE OF AUTHORITY ● IIC is the most senior person on-scene and all investigative activity is under his/ her control ● If IIC cannot resolve an issue, IIC may talk to Chief of Major Investigations ● Ability to escalate further if required
  • 31. On-Scene Activities PROGRESS MEETINGS ● On-site progress meetings are held daily to: ○ Disseminate information obtained ○ Plan the day’s activities ○ Discuss plans for subsequent investigative activities ● Generally start at 6pm ● Plan next day’s meeting
  • 32. On-Scene Activities DAILY ACTIVITIES OF IIC ● Headquarters briefing ● Safety board staff meeting ● Party coordinator meeting ● Site visit
  • 34. NTSB Report Structure Gathering facts about the incident Factual Information Extra information Appendices Analyze how the facts contribution to the incident Analysis Draw conclusions about what happened Conclusions Write detailed recommendations Recommendations
  • 35. Post-On-Scene Activities WORK PLANNING ● Discuss activities that will follow the on-scene phase of investigation ● Build timelines for work ● Provides avenues for various teams to work together
  • 36. Post-On-Scene Activities FACTS & ANALYSIS REPORT ● A factual report based on the field notes and subsequent investigation activities ● Each group chairman shall submit an analysis report based on the information contained in his or her factual report.
  • 37. Post-On-Scene Activities PUBLIC HEARING ● Led by IIC/ Hearing Officer ● Identify witnesses whose testimony is appropriate ● The witnesses may be from the parties to the investigation or can be suggested by one or more of the parties. ● Purpose: To ensure all relevant information is gathered before writing the report
  • 38. Post-On-Scene Activities TECHNICAL REVIEW ● Provides an additional opportunity for all parties to review all factual information ● Ensures all issues are resolved ● Technical Review is held as soon as possible after public hearing
  • 39. Post-On-Scene Activities PREPARATION OF FINAL REPORT ● Dedicated department to help write report ● Follows a standard template ○ Annex 13 to the International Civil Aviation Organization (ICAO) ● Contains formal recommendations to manufacturers/ transportation authorities
  • 41. Recommendations & Most Wanted List ● NTSB advocates for particular action items based on report(s): ○ Generally directed towards Transport bodies/ manufacturers ● NTSB publicly tracks response of the responsible body https://www.ntsb.gov/safety/mwl/Pages/default.aspx
  • 42. How this relates to all of us?
  • 44. Applying this to operations PRE-INCIDENT PREPARATION ● Have an Incident commander pre-assigned ● Publish on-call schedules ○ Manager is responsible ● Test on-call pagers regularly ● Ensure that you can respond within SLA ● Printed copy of Oncall contact info ● DR http://i.imgur.com/wvg8IDq.gif
  • 45. 2. Notification & Initial Response
  • 46. Applying this to operations NOTIFICATION & INITIAL RESPONSE ● NOC/ SiteOps teams notifies incident commander + manager ○ Prod-SRE gets engaged ● Prod-SRE Manager/Oncall ○ Access, Engage, Notify, Mitigate https://docs.microsoft.com/en-us/windows/uwp/design/shell/tiles-and-notifications/images/toast-mirroring.gif
  • 47. Applying this to operations NOTIFICATION & INITIAL RESPONSE ● Once verified, we launch full response for Major Incident ● Incident commander gives “party status” to observers ● Manager informs executives & PR ○ Periodic updates ● Mitigate http://www.roadrunneremaillogin.com/wp-content/uploads/2018/06/RoadRunner-Email.jpg
  • 49. Applying this to operations ON-SCENE ACTIVITIES ● Private + Public slack work-channels ● IC is empowered to make decisions ● Organizational call to ensure: ○ Problem is understood ○ Area of investigations assigned http://www.gpla.com/static/img/projects/ubisofts-e3-social-media-war-room/war-room.gif
  • 50. Applying this to operations ON-SCENE ACTIVITIES ● War room ○ Incident commander drives the war- room ○ Roles & responsibilities assigned to each “party” ○ Communication at regular cadence to execs ○ Admin ensures supplies and food ● Gathering data and updating timeline doc http://www.gpla.com/static/img/projects/ubisofts-e3-social-media-war-room/war-room.gif
  • 52. Applying this to operations POST ON-SCENE ACTIVITIES ● Post mortem ○ Dedicated team ○ PM Template ○ Blameless ● “Postmortem rollup” ○ Action items are prioritized ○ Weekly reporting on status of action- items https://www.economist.com/sites/default/files/imagecache/1280-width/20180414_OFP021.gif
  • 54. Applying this to operations MOST WANTED LIST ● Use the post-incident process to improve and hold people accountable for action items ● Keep track of recurring issues/ repeaters https://clip2art.com/images/meeting-clipart-animated-gif-2.gif
  • 56. Final Thoughts Complete Incident + Postmortem process NTSB Investigative Process The more you put in, the more you’ll get out Invest Accountability for improvements/ action items Accountability