Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AllDayDevops: What the NTSB teaches us about incident management & postmortems

138 views

Published on

The National Transport Safety Bureau is one of the most widely known Government bodies in the world. It’s their role to run into an incident, secure the scene and understand everything that happened. Given the important and unpredictable nature of their work, they have an extensive manual that sets out how incidents should be attended to and how the investigation should progress.

This session will detail how the NTSB’s approach to its work and the procedure that drives it, is transferable to us as incident responders. We’ll talk about the NTSB’s pre-incident preparation, incident notification, attending it, collecting information from the field and writing up a report and holding hearings. We’ll consistently draw parallels to IT incident management and how to create applicable process and procedures that mimic those of the NTSB.

Published in: Engineering
  • Be the first to comment

AllDayDevops: What the NTSB teaches us about incident management & postmortems

  1. 1. What the NTSB teaches us about incident management & postmortems Michael Kehoe Staff Site Reliability Engineer
  2. 2. Agenda and Vision
  3. 3. Today’s agenda 1 Introductions 2 Background on the NTSB 3 NTSB: Investigative Process 4 Recommendations & Most Wanted List 5 How this applies to us? 6 Final thoughts
  4. 4. Michael Kehoe $ WHOAMI • Staff Site Reliability Engineer @ LinkedIn • Production-SRE Team; • Disaster Recovery • Incident Response • Visibility Engineering • Reliability Principles • Find me online at: • @matrixtek • https://michael-kehoe.io • linkedin.com/in/michaelkkehoe
  5. 5. Production-SRE Team @ LinkedIn $ /USR/BIN/WHOAMI ● Disaster Recovery - Planning & Automation ● Incident Response – Process & Automation ● Visibility Engineering – Making use of operational data ● Reliability Principles – Defining best practice & automating it
  6. 6. Incident Command System (ICS) https://training.fema.gov/emiweb/is/icsresource/assets/reviewmaterials.pdf
  7. 7. Background on the NTSB
  8. 8. Background on the NTSB JURISDICTION ● Aviation ● Surface Transportation ● Marine ● Pipeline ● Assistance to other agencies/ governments
  9. 9. “The NTSB shall investigate or have investigated and establish the facts, circumstances, and cause or probable cause of accidents…” U.S. Code § 1131
  10. 10. “… The Board shall report on the facts and circumstances of each accident investigated…The Board shall make each report available to the public at reasonable cost…” U.S. Code § 1131
  11. 11. “The NTSB does not assign fault or blame for an accident or incident…accident/incident investigations are fact-finding proceedings with no formal issues and no adverse parties … and are not conducted for the purpose of determining the rights or liabilities of any person.” U.S. Code § 1154
  12. 12. Similar Organizations ● Italy –Agenzia nazionale per la Sicurezza del Volo (ANSV) ● Canada – Transportation Safety Board of Canada (TSB) ● Indonesia- Komite Nasional Keselamatan Transportasi (NTSC) ● Netherlands – Dutch Safety Board (DSB) ● Australia – Australian Transport Safety Bureau (ATSB) ● United Kingdom – Air Accidents Investigation Branch (AAIB) ● Germany – Bundesstelle für Flugunfalluntersuchung ● France –Bureau d’Enquetes et d’Analyses pour la Securite de l’Aviation Civile (BEA)
  13. 13. NTSB Investigation Process
  14. 14. NTSB Investigation Process 1. Pre-Investigation Preparation 2. Notification & Initial Response 3. On-Scene Activities 4. Post-On-Scene Activities
  15. 15. 1. Pre-Investigation Preparation
  16. 16. Pre-Investigation Preparation GO TEAM ● Go team: On call investigators ready for assignments ● Investigator-In-Change (IIC) pre-assigned ● Full Go team may contain several subject matter experts; e.g. ○ Human performance ○ Aircraft performance ○ Air Traffic Control
  17. 17. Pre-Investigation Preparation GO TEAM ROSTER ● Oncall roster made available internally ○ Phone & Pager numbers ● Updated weekly ● All personnel should be able to arrive at an airport 2 hours after notification ○ Should have essentials on them if they live far away from an airport ● Division Chiefs responsible for testing pager
  18. 18. 2. Notification & Initial Response
  19. 19. Notification & Initial Response REGIONAL RESPONSE 1. Regional office notifies headquarters of incident 2. Closest regional office to accident will provide at least one investigator to perform PR & “stakedown”
  20. 20. Notification & Initial Response HEADQUARTERS RESPONSE 1. After incident occurs: communication center advises IIC and chief of Major Investigations (who subsequently inform their superiors) 2. OAS director decides whether to launch a Go-Team 3. Other executives are made aware by Chief of Major Investigations
  21. 21. Notification & Initial Response NOTIFICATION & ASSIGNMENTS ● Go-Team composition determined by incident circumstances ● Send more specialists if in doubt
  22. 22. Notification & Initial Response PARTY NOTIFICATION ● IIC gives party status to organizations that can provide technical assistance (airlines, aircraft manufacturers etc.) ● Communication center will help with travel arrangements and on-site administrative support ● Go-Team will travel together to accident site
  23. 23. 3. On-Scene Activities
  24. 24. On-Scene Activities COMMAND ROOMS ● Have meeting rooms to accommodate at least 30 people ● Have space for media ● Ensure you have equipment in command room ○ PCs ○ Telephone systems ○ Forms ● IIC is responsible for managing this
  25. 25. On-Scene Activities COMMAND ROOMS ● For Major investigations, Administrative support is provided ● Government purchase card is available for goods or services
  26. 26. On-Scene Activities ORGANIZATIONAL MEETING ● Share preliminary information ● Organize (assign) participants ● Organize observers ● Establish lines of authority
  27. 27. “The manner in which the IIC conducts the organizational meeting will establish the tone of the investigation. Therefore, the importance of being organized, articulate, assertive, composed, and understanding cannot be overstated” Major Investigations Manual Sec 3.2
  28. 28. On-Scene Activities ACCIDENT SITE SAFETY PRECAUTIONS ● Safety officer identifies & classifies risks and then develops counter-measures ● Safety officer performs daily briefings to accident site team.
  29. 29. On-Scene Activities OBSERVERS ● Observers may be allowed if they do not have self-interest ● May include: ○ Congressional oversight committee(s) ○ Military personnel ○ Foreign Governments ○ Federal Agencies
  30. 30. On-Scene Activities LINE OF AUTHORITY ● IIC is the most senior person on-scene and all investigative activity is under his/ her control ● If IIC cannot resolve an issue, IIC may talk to Chief of Major Investigations ● Ability to escalate further if required
  31. 31. On-Scene Activities PROGRESS MEETINGS ● On-site progress meetings are held daily to: ○ Disseminate information obtained ○ Plan the day’s activities ○ Discuss plans for subsequent investigative activities ● Generally start at 6pm ● Plan next day’s meeting
  32. 32. On-Scene Activities DAILY ACTIVITIES OF IIC ● Headquarters briefing ● Safety board staff meeting ● Party coordinator meeting ● Site visit
  33. 33. 4. Post-On-Scene Activities
  34. 34. NTSB Report Structure Gathering facts about the incident Factual Information Extra information Appendices Analyze how the facts contribution to the incident Analysis Draw conclusions about what happened Conclusions Write detailed recommendations Recommendations
  35. 35. Post-On-Scene Activities WORK PLANNING ● Discuss activities that will follow the on-scene phase of investigation ● Build timelines for work ● Provides avenues for various teams to work together
  36. 36. Post-On-Scene Activities FACTS & ANALYSIS REPORT ● A factual report based on the field notes and subsequent investigation activities ● Each group chairman shall submit an analysis report based on the information contained in his or her factual report.
  37. 37. Post-On-Scene Activities PUBLIC HEARING ● Led by IIC/ Hearing Officer ● Identify witnesses whose testimony is appropriate ● The witnesses may be from the parties to the investigation or can be suggested by one or more of the parties. ● Purpose: To ensure all relevant information is gathered before writing the report
  38. 38. Post-On-Scene Activities TECHNICAL REVIEW ● Provides an additional opportunity for all parties to review all factual information ● Ensures all issues are resolved ● Technical Review is held as soon as possible after public hearing
  39. 39. Post-On-Scene Activities PREPARATION OF FINAL REPORT ● Dedicated department to help write report ● Follows a standard template ○ Annex 13 to the International Civil Aviation Organization (ICAO) ● Contains formal recommendations to manufacturers/ transportation authorities
  40. 40. Recommendations & Most Wanted List
  41. 41. Recommendations & Most Wanted List ● NTSB advocates for particular action items based on report(s): ○ Generally directed towards Transport bodies/ manufacturers ● NTSB publicly tracks response of the responsible body https://www.ntsb.gov/safety/mwl/Pages/default.aspx
  42. 42. How this relates to all of us?
  43. 43. 1. Pre-Investigation Preparation
  44. 44. Applying this to operations PRE-INCIDENT PREPARATION ● Have an Incident commander pre-assigned ● Publish on-call schedules ○ Manager is responsible ● Test on-call pagers regularly ● Ensure that you can respond within SLA ● Printed copy of Oncall contact info ● DR http://i.imgur.com/wvg8IDq.gif
  45. 45. 2. Notification & Initial Response
  46. 46. Applying this to operations NOTIFICATION & INITIAL RESPONSE ● NOC/ SiteOps teams notifies incident commander + manager ○ Prod-SRE gets engaged ● Prod-SRE Manager/Oncall ○ Access, Engage, Notify, Mitigate https://docs.microsoft.com/en-us/windows/uwp/design/shell/tiles-and-notifications/images/toast-mirroring.gif
  47. 47. Applying this to operations NOTIFICATION & INITIAL RESPONSE ● Once verified, we launch full response for Major Incident ● Incident commander gives “party status” to observers ● Manager informs executives & PR ○ Periodic updates ● Mitigate http://www.roadrunneremaillogin.com/wp-content/uploads/2018/06/RoadRunner-Email.jpg
  48. 48. 3. On-Scene Activities
  49. 49. Applying this to operations ON-SCENE ACTIVITIES ● Private + Public slack work-channels ● IC is empowered to make decisions ● Organizational call to ensure: ○ Problem is understood ○ Area of investigations assigned http://www.gpla.com/static/img/projects/ubisofts-e3-social-media-war-room/war-room.gif
  50. 50. Applying this to operations ON-SCENE ACTIVITIES ● War room ○ Incident commander drives the war- room ○ Roles & responsibilities assigned to each “party” ○ Communication at regular cadence to execs ○ Admin ensures supplies and food ● Gathering data and updating timeline doc http://www.gpla.com/static/img/projects/ubisofts-e3-social-media-war-room/war-room.gif
  51. 51. 4. Post-On-Scene Activities
  52. 52. Applying this to operations POST ON-SCENE ACTIVITIES ● Post mortem ○ Dedicated team ○ PM Template ○ Blameless ● “Postmortem rollup” ○ Action items are prioritized ○ Weekly reporting on status of action- items https://www.economist.com/sites/default/files/imagecache/1280-width/20180414_OFP021.gif
  53. 53. Recommendations: Most Wanted List
  54. 54. Applying this to operations MOST WANTED LIST ● Use the post-incident process to improve and hold people accountable for action items ● Keep track of recurring issues/ repeaters https://clip2art.com/images/meeting-clipart-animated-gif-2.gif
  55. 55. Final Thoughts
  56. 56. Final Thoughts Complete Incident + Postmortem process NTSB Investigative Process The more you put in, the more you’ll get out Invest Accountability for improvements/ action items Accountability
  57. 57. Questions?

×