AA261: DevOps lessons in collaborative maintenance

1,195 views

Published on

On January 31, 2000, Alaska Airlines Flight 261 plunged into the Pacific ocean in an extreme "nose down" position, killing all 88 crew and passengers on board. The NTSB concluded AA261's horizontal stabiliser trim system's jackscrew was inadequately maintained, causing the pilots to lose all control of the plane.

There are striking parallels with the problems we face daily in IT operations & software development, and the 30 years of give and take between the aircraft manufacturer's engineers, airline maintenance staff, and federal regulators that preceded AA261's simple mechanical failure.

In this talk, Lindsay looks at the complex interplay between the parties in the AA261 crash through a DevOps lens, investigating the collaborative approach to maintenance and operation of the MD-83 aircraft, and relating the complexities back to the complex IT systems we build and maintain.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,195
On SlideShare
0
From Embeds
0
Number of Embeds
28
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

AA261: DevOps lessons in collaborative maintenance

  1. 1. AA261 DevOps lessons incollaborative maintenance
  2. 2. Lindsay Holmwood @auxesis
  3. 3. Software Manager @Bulletproof Networks
  4. 4. Trigger warning: death
  5. 5. January 31, 2000Puerto Vallarta
  6. 6. Seattle
  7. 7. Departed PVR at 13.37 PST
  8. 8. Ascended to 31,000ft
  9. 9. 2 hours into flight:Jammed horizontal stabiliser
  10. 10. No trim control
  11. 11. Redirected to LAX
  12. 12. Pilots unjammedhorizontal stabilisers
  13. 13. 2 pilots3 crew83 passengers
  14. 14. This is a maintenance accident. AlaskaAirlines maintenance and inspection of itshorizontal stabilizer activation system waspoorly conceived and woefully executed. Thefailure was compounded by poor oversight...had any of the managers, mechanics,inspectors, supervisors or FAA overseerswhose job it was to protect this mechanismdone their job conscientiously, this accidentcannot happen. -- John J. Goglia, NTSB Board Member
  15. 15. hindsight != foresight
  16. 16. [hindsight] converts a oncevague, unlikely future into an immediate, certain past -- Sidney Dekker
  17. 17. This is a maintenance accident. AlaskaAirlines maintenance and inspection of itshorizontal stabilizer activation system waspoorly conceived and woefully executed. Thefailure was compounded by poor oversight...had any of the managers, mechanics,inspectors, supervisors or FAA overseerswhose job it was to protect this mechanismdone their job conscientiously, this accidentcannot happen. -- John J. Goglia, NTSB Board Member
  18. 18. This is a maintenance accident. AlaskaAirlines maintenance and inspection of itshorizontal stabilizer activation system waspoorly conceived and woefully executed. Thefailure was compounded by poor oversight...had any of the managers, mechanics,inspectors, supervisors or FAA overseerswhose job it was to protect this mechanismdone their job conscientiously, this accidentcannot happen. -- John J. Goglia, NTSB Board Member
  19. 19. “poorly conceived andwoefully executed”
  20. 20. DC-9 -> MD-80 -> MD-83
  21. 21. Evolutionaryproduct development
  22. 22. Appropriatedmaintenance schedules
  23. 23. Jackscrewlubrication interval
  24. 24. 1965 every 300-350 hours launch of DC-91985 every 700 hours industry deregulation1987 every 1000 hours industry standardisation1991 every 1200 hours industry standardisation1994 every 1600 hours industry standardisation1996 every 8 months (2550 hours) Alaska Airlines policy change
  25. 25. 1965 every 300-350 hours launch of DC-91985 every 700 hours industry deregulation1987 every 1000 hours industry standardisation1991 every 1200 hours industry standardisation1994 every 1600 hours industry standardisation1996 every 8 months (2550 hours) Alaska Airlines policy change
  26. 26. 1965 every 300-350 hours launch of DC-91985 every 700 hours industry deregulation1987 every 1000 hours industry standardisation1991 every 1200 hours industry standardisation1994 every 1600 hours industry standardisation1996 every 8 months (2550 hours) Alaska Airlines policy change
  27. 27. 1965 every 300-350 hours launch of DC-91985 every 700 hours industry deregulation1987 every 1000 hours industry standardisation1991 every 1200 hours industry standardisation1994 every 1600 hours industry standardisation1996 every 8 months (2550 hours) Alaska Airlines policy change
  28. 28. 1965 every 300-350 hours launch of DC-91985 every 700 hours industry deregulation1987 every 1000 hours industry standardisation1991 every 1200 hours industry standardisation1994 every 1600 hours industry standardisation1996 every 8 months (2550 hours) Alaska Airlines policy change
  29. 29. 1965 every 300-350 hours launch of DC-91985 every 700 hours industry deregulation1987 every 1000 hours industry standardisation1991 every 1200 hours industry standardisation1994 every 1600 hours industry standardisation1996 every 8 months (2550 hours) Alaska Airlines policy change
  30. 30. 1965 every 300-350 hours launch of DC-91985 every 700 hours industry deregulation1987 every 1000 hours industry standardisation1991 every 1200 hours industry standardisation1994 every 1600 hours industry standardisation1996 every 8 months (2550 hours) Alaska Airlines policy change
  31. 31. Decrementalism
  32. 32. Complex system constraints Jens Rasmussen
  33. 33. wo rkl oad
  34. 34. wo rkl oadeconomy
  35. 35. wo rkl oadeconomy saf ety
  36. 36. tim e
  37. 37. etim cost
  38. 38. ty aliqu coste tim
  39. 39. wo rkl oadeconomy saf ety
  40. 40. wo rkl oadeconomy saf ety
  41. 41. wo rkl oadeconomy saf ety
  42. 42. wo rkl oadeconomy saf ety
  43. 43. wo rkl oadeconomy saf ety
  44. 44. wo rkl oadeconomy saf ety
  45. 45. wo rkl oadeconomy saf ety
  46. 46. wo rkl oadeconomy saf ety
  47. 47. wo rkl oadeconomy saf ety
  48. 48. outside: failure of foresight oad safrkl etywo economy
  49. 49. outside: failure of foresight oad safrkl inside: ety trade-offswo in direction of greater efficiency economy
  50. 50. trade-offs in direction ofgreater efficiency
  51. 51. trade-offs in direction ofgreater efficiency
  52. 52. Constraints on knowledge
  53. 53. Why would they make baddecisions intentionally?
  54. 54. Decisions seemed rational
  55. 55. Local rationalisation
  56. 56. “people make what they consider to be the bestdecision based on available knowledge at the time”
  57. 57. This is a maintenance accident. AlaskaAirlines maintenance and inspection of itshorizontal stabilizer activation system waspoorly conceived and woefully executed. Thefailure was compounded by poor oversight...had any of the managers, mechanics,inspectors, supervisors or FAA overseerswhose job it was to protect this mechanismdone their job conscientiously, this accidentcannot happen. -- John J. Goglia, NTSB Board Member
  58. 58. wo rkl oadeconomy saf ety
  59. 59. ty aliqu coste tim
  60. 60. Devops constraints
  61. 61. “God, our ops team are arseholes. I just wantto deploy this change and go home!”
  62. 62. “God, our ops team are arseholes. I just wantto deploy this change and go home!” oad saf rkl ety wo economy
  63. 63. “God, our ops team are arseholes. I just wantto deploy this change and go home!” oad oad saf saf rkl rkl ety ety wo wo economy economy
  64. 64. What are the circumstances?
  65. 65. Where are the tensions?
  66. 66. Have ops been burnt before?
  67. 67. Is there deployment friction? Why?
  68. 68. Is deployment high-risk?
  69. 69. Is deployment time consuming?
  70. 70. Is deployment important to the business?
  71. 71. “It’s 3am an the pager has gone off again. Whycan’t these devs just write code that works?”
  72. 72. “It’s 3am an the pager has gone off again. Whycan’t these devs just write code that works?” oad saf rkl ety wo economy
  73. 73. “It’s 3am an the pager has gone off again. Whycan’t these devs just write code that works?” oad oad saf saf rkl rkl ety ety wo wo economy economy
  74. 74. [hindsight] converts a oncevague, unlikely future into an immediate, certain past -- Sidney Dekker
  75. 75. What are the circumstances?
  76. 76. Where are the tensions?
  77. 77. Why didn’t the dev know the code would fail like this?
  78. 78. Why weren’t you involvedwhen the code was written?
  79. 79. How is code reviewed?
  80. 80. Is the infrastructure anti-fragile?
  81. 81. Is the code anti-fragile?
  82. 82. Hindsight bias
  83. 83. [hindsight] converts a oncevague, unlikely future into an immediate, certain past -- Sidney Dekker
  84. 84. What are the motivations?
  85. 85. “amoral actors”
  86. 86. wo rkl oadeconomy saf ety
  87. 87. wo rkl oadeconomy saf ety
  88. 88. “root cause” is simply thepoint you stop looking -- Sidney Dekker
  89. 89. What are the circumstances?
  90. 90. Where are the tensions?
  91. 91. Thank you!
  92. 92. Thank you!Liked the talk? Let @auxesis know!
  93. 93. Sidney Dekker [books]Field Guide to Understand Human ErrorDrift Into FailureJust CultureDan Manges [blog]How incidents affect infrastructure priorities

×