AA261
     DevOps lessons in
collaborative maintenance
Lindsay Holmwood
    @auxesis
Software Manager
          @
Bulletproof Networks
Trigger warning: death
January 31, 2000




Puerto Vallarta
Seattle
Departed PVR at 13.37 PST
Ascended to 31,000ft
2 hours into flight:
Jammed horizontal stabiliser
No trim control
Redirected to LAX
Pilots unjammed
horizontal stabilisers
2 pilots
3 crew
83 passengers
This is a maintenance accident. Alaska
Airlines' maintenance and inspection of its
horizontal stabilizer activation system was
poorly conceived and woefully executed. The
failure was compounded by poor oversight...
had any of the managers, mechanics,
inspectors, supervisors or FAA overseers
whose job it was to protect this mechanism
done their job conscientiously, this accident
cannot happen.
                  -- John J. Goglia, NTSB Board Member
hindsight != foresight
[hindsight] converts a once
vague, unlikely future into an
   immediate, certain past
                   -- Sidney Dekker
This is a maintenance accident. Alaska
Airlines' maintenance and inspection of its
horizontal stabilizer activation system was
poorly conceived and woefully executed. The
failure was compounded by poor oversight...
had any of the managers, mechanics,
inspectors, supervisors or FAA overseers
whose job it was to protect this mechanism
done their job conscientiously, this accident
cannot happen.
                  -- John J. Goglia, NTSB Board Member
This is a maintenance accident. Alaska
Airlines' maintenance and inspection of its
horizontal stabilizer activation system was
poorly conceived and woefully executed. The
failure was compounded by poor oversight...
had any of the managers, mechanics,
inspectors, supervisors or FAA overseers
whose job it was to protect this mechanism
done their job conscientiously, this accident
cannot happen.
                  -- John J. Goglia, NTSB Board Member
“poorly conceived
       and
woefully executed”
DC-9 -> MD-80 -> MD-83
Evolutionary
product development
Appropriated
maintenance schedules
Jackscrew
lubrication interval
1965   every 300-350 hours           launch of DC-9


1985   every 700 hours               industry deregulation


1987   every 1000 hours              industry standardisation


1991   every 1200 hours              industry standardisation


1994   every 1600 hours              industry standardisation


1996   every 8 months (2550 hours)   Alaska Airlines policy change
1965   every 300-350 hours           launch of DC-9


1985   every 700 hours               industry deregulation


1987   every 1000 hours              industry standardisation


1991   every 1200 hours              industry standardisation


1994   every 1600 hours              industry standardisation


1996   every 8 months (2550 hours)   Alaska Airlines policy change
1965   every 300-350 hours           launch of DC-9


1985   every 700 hours               industry deregulation


1987   every 1000 hours              industry standardisation


1991   every 1200 hours              industry standardisation


1994   every 1600 hours              industry standardisation


1996   every 8 months (2550 hours)   Alaska Airlines policy change
1965   every 300-350 hours           launch of DC-9


1985   every 700 hours               industry deregulation


1987   every 1000 hours              industry standardisation


1991   every 1200 hours              industry standardisation


1994   every 1600 hours              industry standardisation


1996   every 8 months (2550 hours)   Alaska Airlines policy change
1965   every 300-350 hours           launch of DC-9


1985   every 700 hours               industry deregulation


1987   every 1000 hours              industry standardisation


1991   every 1200 hours              industry standardisation


1994   every 1600 hours              industry standardisation


1996   every 8 months (2550 hours)   Alaska Airlines policy change
1965   every 300-350 hours           launch of DC-9


1985   every 700 hours               industry deregulation


1987   every 1000 hours              industry standardisation


1991   every 1200 hours              industry standardisation


1994   every 1600 hours              industry standardisation


1996   every 8 months (2550 hours)   Alaska Airlines policy change
1965   every 300-350 hours           launch of DC-9


1985   every 700 hours               industry deregulation


1987   every 1000 hours              industry standardisation


1991   every 1200 hours              industry standardisation


1994   every 1600 hours              industry standardisation


1996   every 8 months (2550 hours)   Alaska Airlines policy change
Decrementalism
Complex system constraints
     Jens Rasmussen
wo
  rkl
     oad
wo
            rkl
               oad


economy
wo
            rkl
               oad


economy        saf
           ety
tim
   e
e
tim



      cost
ty
     ali
qu            cost
e
 tim
wo
            rkl
               oad


economy        saf
           ety
wo
            rkl
               oad


economy        saf
           ety
wo
            rkl
               oad


economy        saf
           ety
wo
            rkl
               oad


economy        saf
           ety
wo
            rkl
               oad


economy        saf
           ety
wo
            rkl
               oad


economy        saf
           ety
wo
            rkl
               oad


economy        saf
           ety
wo
            rkl
               oad


economy        saf
           ety
wo
            rkl
               oad


economy        saf
          ety
outside: failure of foresight




   oad



                           saf
rkl




                                ety
wo



         economy
outside: failure of foresight




   oad



                           saf
rkl            inside:




                                ety
             trade-offs
wo
          in direction of
         greater efficiency



         economy
trade-offs
 in direction of
greater efficiency
trade-offs
 in direction of
greater efficiency
Constraints on knowledge
Why would they make bad
decisions intentionally?
Decisions seemed rational
Local rationalisation
“people make what they
  consider to be the best
decision based on available
 knowledge at the time”
This is a maintenance accident. Alaska
Airlines' maintenance and inspection of its
horizontal stabilizer activation system was
poorly conceived and woefully executed. The
failure was compounded by poor oversight...
had any of the managers, mechanics,
inspectors, supervisors or FAA overseers
whose job it was to protect this mechanism
done their job conscientiously, this accident
cannot happen.
                  -- John J. Goglia, NTSB Board Member
wo
            rkl
               oad


economy        saf
          ety
ty
     ali
qu            cost
e
 tim
Devops constraints
“God, our ops team are arseholes. I just want
to deploy this change and go home!”
“God, our ops team are arseholes. I just want
to deploy this change and go home!”
        oad




                        saf
        rkl




                         ety
      wo




              economy
“God, our ops team are arseholes. I just want
to deploy this change and go home!”
        oad




                                 oad
                        saf




                                                 saf
        rkl




                                 rkl
                         ety




                                                  ety
      wo




                               wo
              economy                  economy
What are the circumstances?
Where are the tensions?
Have ops been burnt before?
Is there deployment friction?
            Why?
Is deployment high-risk?
Is deployment time consuming?
Is deployment important
    to the business?
“It’s 3am an the pager has gone off again. Why
can’t these devs just write code that works?”
“It’s 3am an the pager has gone off again. Why
can’t these devs just write code that works?”
        oad




                        saf
        rkl




                         ety
      wo




              economy
“It’s 3am an the pager has gone off again. Why
can’t these devs just write code that works?”
        oad




                                 oad
                        saf




                                                 saf
        rkl




                                 rkl
                         ety




                                                  ety
      wo




                               wo
              economy                  economy
[hindsight] converts a once
vague, unlikely future into an
   immediate, certain past
                   -- Sidney Dekker
What are the circumstances?
Where are the tensions?
Why didn’t the dev know the
 code would fail like this?
Why weren’t you involved
when the code was written?
How is code reviewed?
Is the infrastructure anti-fragile?
Is the code anti-fragile?
Hindsight bias
[hindsight] converts a once
vague, unlikely future into an
   immediate, certain past
                   -- Sidney Dekker
What are the motivations?
“amoral actors”
wo
            rkl
               oad


economy
               saf
           ety
wo
            rkl
               oad


economy
               saf
           ety
“root cause” is simply the
point you stop looking
                    -- Sidney Dekker
What are the circumstances?
Where are the tensions?
Thank you!
Thank you!
Liked the talk? Let @auxesis know!
Sidney Dekker [books]
Field Guide to Understand Human Error
Drift Into Failure
Just Culture

Dan Manges [blog]
How incidents affect infrastructure priorities

AA261: DevOps lessons in collaborative maintenance

  • 1.
    AA261 DevOps lessons in collaborative maintenance
  • 2.
  • 3.
    Software Manager @ Bulletproof Networks
  • 5.
  • 6.
  • 7.
  • 8.
    Departed PVR at13.37 PST
  • 9.
  • 10.
    2 hours intoflight: Jammed horizontal stabiliser
  • 11.
  • 12.
  • 14.
  • 17.
  • 22.
    This is amaintenance accident. Alaska Airlines' maintenance and inspection of its horizontal stabilizer activation system was poorly conceived and woefully executed. The failure was compounded by poor oversight... had any of the managers, mechanics, inspectors, supervisors or FAA overseers whose job it was to protect this mechanism done their job conscientiously, this accident cannot happen. -- John J. Goglia, NTSB Board Member
  • 23.
  • 24.
    [hindsight] converts aonce vague, unlikely future into an immediate, certain past -- Sidney Dekker
  • 25.
    This is amaintenance accident. Alaska Airlines' maintenance and inspection of its horizontal stabilizer activation system was poorly conceived and woefully executed. The failure was compounded by poor oversight... had any of the managers, mechanics, inspectors, supervisors or FAA overseers whose job it was to protect this mechanism done their job conscientiously, this accident cannot happen. -- John J. Goglia, NTSB Board Member
  • 26.
    This is amaintenance accident. Alaska Airlines' maintenance and inspection of its horizontal stabilizer activation system was poorly conceived and woefully executed. The failure was compounded by poor oversight... had any of the managers, mechanics, inspectors, supervisors or FAA overseers whose job it was to protect this mechanism done their job conscientiously, this accident cannot happen. -- John J. Goglia, NTSB Board Member
  • 28.
    “poorly conceived and woefully executed”
  • 32.
    DC-9 -> MD-80-> MD-83
  • 34.
  • 35.
  • 36.
  • 38.
    1965 every 300-350 hours launch of DC-9 1985 every 700 hours industry deregulation 1987 every 1000 hours industry standardisation 1991 every 1200 hours industry standardisation 1994 every 1600 hours industry standardisation 1996 every 8 months (2550 hours) Alaska Airlines policy change
  • 39.
    1965 every 300-350 hours launch of DC-9 1985 every 700 hours industry deregulation 1987 every 1000 hours industry standardisation 1991 every 1200 hours industry standardisation 1994 every 1600 hours industry standardisation 1996 every 8 months (2550 hours) Alaska Airlines policy change
  • 40.
    1965 every 300-350 hours launch of DC-9 1985 every 700 hours industry deregulation 1987 every 1000 hours industry standardisation 1991 every 1200 hours industry standardisation 1994 every 1600 hours industry standardisation 1996 every 8 months (2550 hours) Alaska Airlines policy change
  • 41.
    1965 every 300-350 hours launch of DC-9 1985 every 700 hours industry deregulation 1987 every 1000 hours industry standardisation 1991 every 1200 hours industry standardisation 1994 every 1600 hours industry standardisation 1996 every 8 months (2550 hours) Alaska Airlines policy change
  • 42.
    1965 every 300-350 hours launch of DC-9 1985 every 700 hours industry deregulation 1987 every 1000 hours industry standardisation 1991 every 1200 hours industry standardisation 1994 every 1600 hours industry standardisation 1996 every 8 months (2550 hours) Alaska Airlines policy change
  • 43.
    1965 every 300-350 hours launch of DC-9 1985 every 700 hours industry deregulation 1987 every 1000 hours industry standardisation 1991 every 1200 hours industry standardisation 1994 every 1600 hours industry standardisation 1996 every 8 months (2550 hours) Alaska Airlines policy change
  • 44.
    1965 every 300-350 hours launch of DC-9 1985 every 700 hours industry deregulation 1987 every 1000 hours industry standardisation 1991 every 1200 hours industry standardisation 1994 every 1600 hours industry standardisation 1996 every 8 months (2550 hours) Alaska Airlines policy change
  • 45.
  • 46.
  • 48.
  • 49.
    wo rkl oad economy
  • 50.
    wo rkl oad economy saf ety
  • 52.
  • 53.
    e tim cost
  • 54.
    ty ali qu cost e tim
  • 55.
    wo rkl oad economy saf ety
  • 56.
    wo rkl oad economy saf ety
  • 57.
    wo rkl oad economy saf ety
  • 58.
    wo rkl oad economy saf ety
  • 59.
    wo rkl oad economy saf ety
  • 60.
    wo rkl oad economy saf ety
  • 61.
    wo rkl oad economy saf ety
  • 62.
    wo rkl oad economy saf ety
  • 63.
    wo rkl oad economy saf ety
  • 64.
    outside: failure offoresight oad saf rkl ety wo economy
  • 65.
    outside: failure offoresight oad saf rkl inside: ety trade-offs wo in direction of greater efficiency economy
  • 66.
    trade-offs in directionof greater efficiency
  • 67.
    trade-offs in directionof greater efficiency
  • 68.
  • 69.
    Why would theymake bad decisions intentionally?
  • 70.
  • 71.
  • 72.
    “people make whatthey consider to be the best decision based on available knowledge at the time”
  • 73.
    This is amaintenance accident. Alaska Airlines' maintenance and inspection of its horizontal stabilizer activation system was poorly conceived and woefully executed. The failure was compounded by poor oversight... had any of the managers, mechanics, inspectors, supervisors or FAA overseers whose job it was to protect this mechanism done their job conscientiously, this accident cannot happen. -- John J. Goglia, NTSB Board Member
  • 74.
    wo rkl oad economy saf ety
  • 75.
    ty ali qu cost e tim
  • 77.
  • 78.
    “God, our opsteam are arseholes. I just want to deploy this change and go home!”
  • 79.
    “God, our opsteam are arseholes. I just want to deploy this change and go home!” oad saf rkl ety wo economy
  • 80.
    “God, our opsteam are arseholes. I just want to deploy this change and go home!” oad oad saf saf rkl rkl ety ety wo wo economy economy
  • 81.
    What are thecircumstances?
  • 82.
    Where are thetensions?
  • 83.
    Have ops beenburnt before?
  • 84.
    Is there deploymentfriction? Why?
  • 85.
  • 86.
  • 87.
    Is deployment important to the business?
  • 89.
    “It’s 3am anthe pager has gone off again. Why can’t these devs just write code that works?”
  • 90.
    “It’s 3am anthe pager has gone off again. Why can’t these devs just write code that works?” oad saf rkl ety wo economy
  • 91.
    “It’s 3am anthe pager has gone off again. Why can’t these devs just write code that works?” oad oad saf saf rkl rkl ety ety wo wo economy economy
  • 92.
    [hindsight] converts aonce vague, unlikely future into an immediate, certain past -- Sidney Dekker
  • 93.
    What are thecircumstances?
  • 94.
    Where are thetensions?
  • 95.
    Why didn’t thedev know the code would fail like this?
  • 96.
    Why weren’t youinvolved when the code was written?
  • 97.
    How is codereviewed?
  • 98.
    Is the infrastructureanti-fragile?
  • 99.
    Is the codeanti-fragile?
  • 101.
  • 102.
    [hindsight] converts aonce vague, unlikely future into an immediate, certain past -- Sidney Dekker
  • 104.
    What are themotivations?
  • 105.
  • 106.
    wo rkl oad economy saf ety
  • 107.
    wo rkl oad economy saf ety
  • 108.
    “root cause” issimply the point you stop looking -- Sidney Dekker
  • 109.
    What are thecircumstances?
  • 110.
    Where are thetensions?
  • 111.
  • 112.
    Thank you! Liked thetalk? Let @auxesis know!
  • 113.
    Sidney Dekker [books] FieldGuide to Understand Human Error Drift Into Failure Just Culture Dan Manges [blog] How incidents affect infrastructure priorities