Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What the NTSB teaches us about incident management & postmortems

114 views

Published on

The National Transport Safety Bureau is one of the most widely known Government bodies in the world. It’s their role to run into an incident, secure the scene and understand everything that happened. Given the important and unpredictable nature of their work, they have an extensive manual that sets out how incidents should be attended to and how the investigation should progress.

This session will detail how the NTSB’s approach to its work and the procedure that drives it, is transferable to us as incident responders. We’ll talk about the NTSB’s pre-incident preparation, incident notification, attending it, collecting information from the field and writing up a report and holding hearings. We’ll consistently draw parallels to IT incident management and how to create applicable process and procedures that mimic those of the NTSB.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

What the NTSB teaches us about incident management & postmortems

  1. 1. What the NTSB teaches us about incident management & postmortems ​Jeff Weiner ​Chief Executive Officer ​Michael Kehoe ​Staff Site Reliability Engineer ​Nina Mushiana ​Sr Site Reliability Manager
  2. 2. Agenda and Vision
  3. 3. Today’s agenda 1 Introductions 2 Background on the NTSB 3 NTSB: Investigative Process 4 Recommendations & Most Wanted List 5 How this applies to us? 6 Final thoughts
  4. 4. Michael Kehoe ​$ /USR/BIN/WHOAMI ● Staff Site Reliability Engineer @ LinkedIn ● Production-SRE Team ● Funny accent = Australian + 4 years American
  5. 5. Nina Mushiana ​$ /USR/BIN/WHOAMI ● Sr Site Reliability Engineer Manager @ LinkedIn ● Production-SRE Team & Site-Ops
  6. 6. Production-SRE Team @ LinkedIn ​$ /USR/BIN/WHOAMI ● Disaster Recovery - Planning & Automation ● Incident Response – Process & Automation ● Visibility Engineering – Making use of operational data ● Reliability Principles – Defining best practice & automating it
  7. 7. Incident Command System (ICS) https://training.fema.gov/emiweb/is/icsresource/assets/reviewmaterials.pdf
  8. 8. Background on the NTSB
  9. 9. Background on the NTSB ​JURISDICTION ● Aviation ● Surface Transportation ● Marine ● Pipeline ● Assistance to other agencies/ governments
  10. 10. “The NTSB shall investigate or have investigated and establish the facts, circumstances, and cause or probable cause of accidents…” U.S. Code § 1131
  11. 11. “… The Board shall report on the facts and circumstances of each accident investigated…The Board shall make each report available to the public at reasonable cost…” U.S. Code § 1131
  12. 12. “The NTSB does not assign fault or blame for an accident or incident…accident/incident investigations are fact-finding proceedings with no formal issues and no adverse parties … and are not conducted for the purpose of determining the rights or liabilities of any person.” U.S. Code § 1154
  13. 13. Similar Organizations ● Italy –Agenzia nazionale per la Sicurezza del Volo (ANSV) ● Canada – Transportation Safety Board of Canada (TSB) ● Indonesia- Komite Nasional Keselamatan Transportasi (NTSC) ● Netherlands – Dutch Safety Board (DSB) ● Australia – Australian Transport Safety Bureau (ATSB) ● United Kingdom – Air Accidents Investigation Branch (AAIB) ● Germany – Bundesstelle für Flugunfalluntersuchung ● France –Bureau d’Enquetes et d’Analyses pour la Securite de l’Aviation Civile (BEA)
  14. 14. NTSB Investigation Process
  15. 15. NTSB Investigation Process 1. Pre-Investigation Preparation 2. Notification & Initial Response 3. On-Scene Activities 4. Post-On-Scene Activities
  16. 16. 1. Pre-Investigation Preparation
  17. 17. Pre-Investigation Preparation ​GO TEAM ● Go team: On call investigators ready for assignments ● Investigator-In-Change (IIC) pre-assigned ● Full Go team may contain several subject matter experts; e.g. ○ Human performance ○ Aircraft performance ○ Air Traffic Control
  18. 18. Pre-Investigation Preparation ​GO TEAM ROSTER ● Oncall roster made available internally ○ Phone & Pager numbers ● Updated weekly ● All personnel should be able to arrive at an airport 2 hours after notification ○ Should have essentials on them if they live far away from an airport ● Division Chiefs responsible for testing pager
  19. 19. 2. Notification & Initial Response
  20. 20. Notification & Initial Response ​REGIONAL RESPONSE 1. Regional office notifies headquarters of incident 2. Closest regional office to accident will provide at least one investigator to perform PR & “stakedown”
  21. 21. Notification & Initial Response ​HEADQUARTERS RESPONSE 1. After incident occurs: communication center advises IIC and chief of Major Investigations (who subsequently inform their superiors) 2. OAS director decides whether to launch a Go-Team 3. Other executives are made aware by Chief of Major Investigations
  22. 22. Notification & Initial Response ​NOTIFICATION & ASSIGNMENTS ● Go-Team composition determined by incident circumstances ● Send more specialists if in doubt
  23. 23. Notification & Initial Response ​PARTY NOTIFICATION ● IIC gives party status to organizations that can provide technical assistance (airlines, aircraft manufacturers etc.) ● Communication center will help with travel arrangements and on-site administrative support ● Go-Team will travel together to accident site
  24. 24. 3. On-Scene Activities
  25. 25. On-Scene Activities ​COMMAND ROOMS ● Have meeting rooms to accommodate at least 30 people ● Have space for media ● Ensure you have equipment in command room ○ PCs ○ Telephone systems ○ Forms ● IIC is responsible for managing this
  26. 26. On-Scene Activities ​COMMAND ROOMS ● For Major investigations, Administrative support is provided ● Government purchase card is available for goods or services
  27. 27. On-Scene Activities ​ORGANIZATIONAL MEETING ● Share preliminary information ● Organize (assign) participants ● Organize observers ● Establish lines of authority
  28. 28. “The manner in which the IIC conducts the organizational meeting will establish the tone of the investigation. Therefore, the importance of being organized, articulate, assertive, composed, and understanding cannot be overstated” Major Investigations Manual Sec 3.2
  29. 29. On-Scene Activities ​ACCIDENT SITE SAFETY PRECAUTIONS ● Safety officer identifies & classifies risks and then develops counter-measures ● Safety officer performs daily briefings to accident site team.
  30. 30. On-Scene Activities ​OBSERVERS ● Observers may be allowed if they do not have self-interest ● May include: ○ Congressional oversight committee(s) ○ Military personnel ○ Foreign Governments ○ Federal Agencies
  31. 31. On-Scene Activities ​LINE OF AUTHORITY ● IIC is the most senior person on-scene and all investigative activity is under his/ her control ● If IIC cannot resolve an issue, IIC may talk to Chief of Major Investigations ● Ability to escalate further if required
  32. 32. On-Scene Activities ​PROGRESS MEETINGS ● On-site progress meetings are held daily to: ○ Disseminate information obtained ○ Plan the day’s activities ○ Discuss plans for subsequent investigative activities ● Generally start at 6pm ● Plan next day’s meeting
  33. 33. On-Scene Activities ​DAILY ACTIVITIES OF IIC ● Headquarters briefing ● Safety board staff meeting ● Party coordinator meeting ● Site visit
  34. 34. 4. Post-On-Scene Activities
  35. 35. NTSB Report Structure Gathering facts about the incident Factual Information Extra information Appendices Analyze how the facts contribution to the incident Analysis Draw conclusions about what happened Conclusions Write detailed recommendations Recommendation s
  36. 36. Post-On-Scene Activities ​WORK PLANNING ● Discuss activities that will follow the on-scene phase of investigation ● Build timelines for work ● Provides avenues for various teams to work together
  37. 37. Post-On-Scene Activities ​FACTS & ANALYSIS REPORT ● A factual report based on the field notes and subsequent investigation activities ● Each group chairman shall submit an analysis report based on the information contained in his or her factual report.
  38. 38. Post-On-Scene Activities ​PUBLIC HEARING ● Led by IIC/ Hearing Officer ● Identify witnesses whose testimony is appropriate ● The witnesses may be from the parties to the investigation or can be suggested by one or more of the parties. ● Purpose: To ensure all relevant information is gathered before writing the report
  39. 39. Post-On-Scene Activities ​TECHNICAL REVIEW ● Provides an additional opportunity for all parties to review all factual information ● Ensures all issues are resolved ● Technical Review is held as soon as possible after public hearing
  40. 40. Post-On-Scene Activities ​PREPARATION OF FINAL REPORT ● Dedicated department to help write report ● Follows a standard template ○ Annex 13 to the International Civil Aviation Organization (ICAO) ● Contains formal recommendations to manufacturers/ transportation authorities
  41. 41. Recommendations & Most Wanted List
  42. 42. Recommendations & Most Wanted List ● NTSB advocates for particular action items based on report(s): ○ Generally directed towards Transport bodies/ manufacturers ● NTSB publicly tracks response of the responsible body https://www.ntsb.gov/safety/mwl/Pages/default.aspx
  43. 43. How this relates to all of us?
  44. 44. 1. Pre-Investigation Preparation
  45. 45. Applying this to operations ​PRE-INCIDENT PREPARATION ● Have an Incident commander pre-assigned ● Publish on-call schedules ○ Manager is responsible ● Test on-call pagers regularly ● Ensure that you can respond within SLA ● Printed copy of Oncall contact info ● DR http://i.imgur.com/wvg8IDq.gif
  46. 46. 2. Notification & Initial Response
  47. 47. Applying this to operations ​NOTIFICATION & INITIAL RESPONSE ● NOC/ SiteOps teams notifies incident commander + manager ○ Prod-SRE gets engaged ● Prod-SRE Manager/Oncall ○ Access, Engage, Notify, Mitigate https://docs.microsoft.com/en-us/windows/uwp/design/shell/tiles-and-notifications/images/toast-mirroring.gif
  48. 48. Applying this to operations ​NOTIFICATION & INITIAL RESPONSE ● Once verified, we launch full response for Major Incident ● Incident commander gives “party status” to observers ● Manager informs executives & PR ○ Periodic updates ● Mitigate http://www.roadrunneremaillogin.com/wp-content/uploads/2018/06/RoadRunner-Email.jpg
  49. 49. 3. On-Scene Activities
  50. 50. Applying this to operations ​ON-SCENE ACTIVITIES ● Private + Public slack work-channels ● IC is empowered to make decisions ● Organizational call to ensure: ○ Problem is understood ○ Area of investigations assigned http://www.gpla.com/static/img/projects/ubisofts-e3-social-media-war-room/war-room.gif
  51. 51. Applying this to operations ​ON-SCENE ACTIVITIES ● War room ○ Incident commander drives the war-room ○ Roles & responsibilities assigned to each “party” ○ Communication at regular cadence to execs ○ Admin ensures supplies and food ● Gathering data and updating timeline doc http://www.gpla.com/static/img/projects/ubisofts-e3-social-media-war-room/war-room.gif
  52. 52. 4. Post-On-Scene Activities
  53. 53. Applying this to operations ​POST ON-SCENE ACTIVITIES ● Post mortem ○ Dedicated team ○ PM Template ○ Blameless ● “Postmortem rollup” ○ Action items are prioritized ○ Weekly reporting on status of action-items https://www.economist.com/sites/default/files/imagecache/1280-width/20180414_OFP021.gif
  54. 54. Recommendations: Most Wanted List
  55. 55. Applying this to operations ​MOST WANTED LIST ● Use the post-incident process to improve and hold people accountable for action items ● Keep track of recurring issues/ repeaters https://clip2art.com/images/meeting-clipart-animated-gif-2.gif
  56. 56. Final Thoughts
  57. 57. Final Thoughts Complete Incident + Postmortem process NTSB Investigative Process The more you put in, the more you’ll get out Invest Accountability for improvements/ action items Accountability
  58. 58. Questions?

×