SlideShare a Scribd company logo
1 of 18
Building_the_Coded_Enterprise
chef.io
chef.io
Understanding Common Failures
System Accidents
3 chef.io
Name
Title
Company
Who is Galen?
● Chef Software for 6 Years
● Based in San Diego
● Current: Lead Compliance & Security Architect
● Previously: Solutions & Customer Architect
● Global Traveler
chef.io
Key Components
● Systems are Complex
● Systems are Tightly Coupled
● Systems may have Common-mode Components
● Systems are inclusive of the technical, mechanical and human
components
● Systems have Potential for Catastrophic Failure
chef.io
Unanticipated interaction of multiple failures
System Accident:
6 chef.io
● Complexity: Nuclear reactors are inherently
complex. They are transformational systems that
are non-linear.
● Coupling: The safety systems are tightly coupled to
reactor operation. The safety systems are used as
the primary steam generation
● Common-mode: A pressure relief valve became
stuck open, resulting in the secondary system’s
pumping of water into the reactor to be less
effective
● Safety-System Failure: A critical backup water
system was inoperable due to valves being closed
For images that do not extend to the edge of the slide
Example: Three Mile Island
7 chef.io
● Core issue: Misconfiguration removed core Google
network services from multiple regions
● Complexity: Google Cloud operates with multiple
regions, with redundancy for “defense in depth”
● Coupling: Systems are designed to operate for a
time after changes, but unable to find and fix in time
● Common-Mode: Debugging tools rely upon the same
network infrastructure that is now unresponsive due
to congestion. Significantly increased the time to
troubleshoot and fix
Example: Google Cloud Catch-22 (Incident #19009)
8 chef.io
● Core issue: An OS re-image is scheduled for a set
of desktop systems, but is accidentally applied to
all systems
● Complexity: The tool for config mgmt is complex
and limited only by operator experience
● Coupling: Once the order is given to re-image, all
systems immediately take that order with no
buffer
● Process: The definitions for systems are built
manually in a UI, and applied through the same
UI. There is no ability to check and review
changes short of implementing people process
Example: Emory University/CommBank Config Mgr
9 chef.io
● Core issue: The NotPetya virus infects the Maersk
network, taking down the shipping giant for weeks
● Complexity: Initially simple network design (flat
network) results in a large spread of the virus
before systems can be shut down. Complexity in
backups of critical infrastructure caused massive
delays
● Common Mode: Active Directory is the lynch-pin for
almost all enterprise services. There was no full
backup of Active Directory
● Process: Without a test of a full systems failure, no
clear remediation was in place for this type of event
Example: Maersk Shipping
chef.io
Managing Complexity
● Are your interactions documented?
● Are they automated or manual?
● Are your systems scalable?
● What complexity is irreducible?
● How do you handle dependencies?
chef.io
Managing Complexity
● Are your interactions documented?
● Are they automated or manual?
● Are your systems scalable?
● What complexity is irreducible?
● How do you handle dependencies?
● One Path to Production
● API-Oriented Architecture
● Microservices
● Versioned Artifacts
● Codified Interactions (CI/CD/DevOps)
chef.io
● What services rely upon other
services?
● How do we test those interactions?
● What happens if a service cannot
reach a necessary service?
● Are we relying upon the product of
the service or upon specific systems?
Example: “Dagobah: 10.45.20.15”
Reducing Coupling
chef.io
● Buffer. Ensure that systems (&
subsystems) have time to recover
from events
● Resilience. Build in service discovery
● Linearity. Ensure a linear flow of
changes to the system (One Path)
● Observability. Have effective
monitoring around system state
● What services rely upon other
services?
● How do we test those interactions?
● What happens if a service cannot
reach a necessary service?
● Are we relying upon the product of
the service or upon specific systems?
Example: “Dagobah: 10.45.20.15”
Reducing Coupling
chef.io
● What core components do all
systems rely upon?
● What components do you rely upon
that you don’t own?
● How tightly coupled are they to each
other?
Common Mode Failures
chef.io
● Identify Risk. Identify Common Mode
Components and perform a risk
assessment
● Failure untested is failure unknown.
Test failure regularly (Chaos
Engineering)
● Focus on resilience, not redundancy
● Build agnostic systems (loosely-
coupled, easily replaceable)
● What core components do all
systems rely upon?
● What components do you rely upon
that you don’t own?
● How tightly coupled are they to each
other?
Common Mode Failures
16 chef.io
● Small, resilient agile teams work
○ Team of Teams (Special Forces)
○ Lean / Kaizen “The Goal” (Toyota Production System)
○ Agile Software Delivery (Continuous Delivery)
○ Black Box Thinking (All Major Airlines)
● Command & Control structures are too slow
○ Decision-making cycles are too slow, too rigid
○ Sunk-cost and other fallacies often prevent critical redesign
● Focus on Minimum Viable Product
○ Short Sprints
○ Outcome and Business Oriented Objectives
People & Process
17 chef.io
● Component failure is normal in complex systems
● Managing Complexity
○ One Path to Production
○ Define all interactions in code (Coded Enterprise)
● Reducing Coupling
○ Design systems to buffer/queue
○ Focus on the output (API-Oriented)
○ Look for changes to the system state (Observability)
● Common-Mode
○ Test Failure (Chaos Engineering)
○ Risk Analysis on critical services
○ Build for resiliency
● People & Process: Agile
System Accidents
(Recap)
chef.io

More Related Content

Similar to System Accidents: Understanding Common Accidents

Cynthia Wu: Satisfaction Not Guaranteed
Cynthia Wu: Satisfaction Not GuaranteedCynthia Wu: Satisfaction Not Guaranteed
Cynthia Wu: Satisfaction Not GuaranteedAnna Royzman
 
Technical Practices for Agile Engineering - PNSQC 2019
Technical Practices for Agile Engineering - PNSQC 2019Technical Practices for Agile Engineering - PNSQC 2019
Technical Practices for Agile Engineering - PNSQC 2019Moss Drake
 
Bridging the Gap Between Development and Regulatory Teams
Bridging the Gap Between Development and Regulatory TeamsBridging the Gap Between Development and Regulatory Teams
Bridging the Gap Between Development and Regulatory TeamsICS
 
Concurrency - Why it's hard ?
Concurrency - Why it's hard ?Concurrency - Why it's hard ?
Concurrency - Why it's hard ?Ramith Jayasinghe
 
Project management part 2
Project management part 2Project management part 2
Project management part 2Anjan Mahanta
 
Bridging the Gap Between Development and Regulatory Teams
Bridging the Gap Between Development and Regulatory TeamsBridging the Gap Between Development and Regulatory Teams
Bridging the Gap Between Development and Regulatory TeamsICS
 
Ahmadabad mule soft_meetup_11_october_2020_errorhanlingandmonitoringalerts
Ahmadabad mule soft_meetup_11_october_2020_errorhanlingandmonitoringalertsAhmadabad mule soft_meetup_11_october_2020_errorhanlingandmonitoringalerts
Ahmadabad mule soft_meetup_11_october_2020_errorhanlingandmonitoringalertsShekh Muenuddeen
 
Ahmadabad mule soft_meetup_11_october_2020_errorhanlingandmonitoringalerts
Ahmadabad mule soft_meetup_11_october_2020_errorhanlingandmonitoringalertsAhmadabad mule soft_meetup_11_october_2020_errorhanlingandmonitoringalerts
Ahmadabad mule soft_meetup_11_october_2020_errorhanlingandmonitoringalertsShekh Muenuddeen
 
Growing Object Oriented Software
Growing Object Oriented SoftwareGrowing Object Oriented Software
Growing Object Oriented SoftwareAnnmarie Lanesey
 
DevOps Days Vancouver 2014 Slides
DevOps Days Vancouver 2014 SlidesDevOps Days Vancouver 2014 Slides
DevOps Days Vancouver 2014 SlidesAlex Cruise
 
Reactive - Is it really a Magic Pill?
Reactive - Is it really a Magic Pill?Reactive - Is it really a Magic Pill?
Reactive - Is it really a Magic Pill?Tech Triveni
 
Introduction to Continuous Integration
Introduction to Continuous IntegrationIntroduction to Continuous Integration
Introduction to Continuous IntegrationHùng Nguyễn Huy
 
Structured Software Design
Structured Software DesignStructured Software Design
Structured Software DesignGiorgio Zoppi
 
Monitoring & alerting presentation sabin&mustafa
Monitoring & alerting presentation sabin&mustafaMonitoring & alerting presentation sabin&mustafa
Monitoring & alerting presentation sabin&mustafaLama K Banna
 
Observability in highly distributed systems
Observability in highly distributed systemsObservability in highly distributed systems
Observability in highly distributed systemsDevOps Indonesia
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready AppsVMware Tanzu
 
Fifteen Years of DevOps -- LISA 2012 keynote
Fifteen Years of DevOps -- LISA 2012 keynoteFifteen Years of DevOps -- LISA 2012 keynote
Fifteen Years of DevOps -- LISA 2012 keynoteGeoff Halprin
 
1. introducción a la Ingeniería de Software (UTM 2071)
1. introducción a la Ingeniería de Software (UTM 2071)1. introducción a la Ingeniería de Software (UTM 2071)
1. introducción a la Ingeniería de Software (UTM 2071)Mario A Moreno Rocha
 

Similar to System Accidents: Understanding Common Accidents (20)

Cynthia Wu: Satisfaction Not Guaranteed
Cynthia Wu: Satisfaction Not GuaranteedCynthia Wu: Satisfaction Not Guaranteed
Cynthia Wu: Satisfaction Not Guaranteed
 
Technical Practices for Agile Engineering - PNSQC 2019
Technical Practices for Agile Engineering - PNSQC 2019Technical Practices for Agile Engineering - PNSQC 2019
Technical Practices for Agile Engineering - PNSQC 2019
 
Bridging the Gap Between Development and Regulatory Teams
Bridging the Gap Between Development and Regulatory TeamsBridging the Gap Between Development and Regulatory Teams
Bridging the Gap Between Development and Regulatory Teams
 
Concurrency - Why it's hard ?
Concurrency - Why it's hard ?Concurrency - Why it's hard ?
Concurrency - Why it's hard ?
 
Project management part 2
Project management part 2Project management part 2
Project management part 2
 
OOP 2014 - Lifecycle By Design
OOP 2014 - Lifecycle By DesignOOP 2014 - Lifecycle By Design
OOP 2014 - Lifecycle By Design
 
Bridging the Gap Between Development and Regulatory Teams
Bridging the Gap Between Development and Regulatory TeamsBridging the Gap Between Development and Regulatory Teams
Bridging the Gap Between Development and Regulatory Teams
 
Ahmadabad mule soft_meetup_11_october_2020_errorhanlingandmonitoringalerts
Ahmadabad mule soft_meetup_11_october_2020_errorhanlingandmonitoringalertsAhmadabad mule soft_meetup_11_october_2020_errorhanlingandmonitoringalerts
Ahmadabad mule soft_meetup_11_october_2020_errorhanlingandmonitoringalerts
 
Ahmadabad mule soft_meetup_11_october_2020_errorhanlingandmonitoringalerts
Ahmadabad mule soft_meetup_11_october_2020_errorhanlingandmonitoringalertsAhmadabad mule soft_meetup_11_october_2020_errorhanlingandmonitoringalerts
Ahmadabad mule soft_meetup_11_october_2020_errorhanlingandmonitoringalerts
 
Growing Object Oriented Software
Growing Object Oriented SoftwareGrowing Object Oriented Software
Growing Object Oriented Software
 
DevOps Days Vancouver 2014 Slides
DevOps Days Vancouver 2014 SlidesDevOps Days Vancouver 2014 Slides
DevOps Days Vancouver 2014 Slides
 
Reactive - Is it really a Magic Pill?
Reactive - Is it really a Magic Pill?Reactive - Is it really a Magic Pill?
Reactive - Is it really a Magic Pill?
 
Continuous Delivery Maturity Model
Continuous Delivery Maturity ModelContinuous Delivery Maturity Model
Continuous Delivery Maturity Model
 
Introduction to Continuous Integration
Introduction to Continuous IntegrationIntroduction to Continuous Integration
Introduction to Continuous Integration
 
Structured Software Design
Structured Software DesignStructured Software Design
Structured Software Design
 
Monitoring & alerting presentation sabin&mustafa
Monitoring & alerting presentation sabin&mustafaMonitoring & alerting presentation sabin&mustafa
Monitoring & alerting presentation sabin&mustafa
 
Observability in highly distributed systems
Observability in highly distributed systemsObservability in highly distributed systems
Observability in highly distributed systems
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready Apps
 
Fifteen Years of DevOps -- LISA 2012 keynote
Fifteen Years of DevOps -- LISA 2012 keynoteFifteen Years of DevOps -- LISA 2012 keynote
Fifteen Years of DevOps -- LISA 2012 keynote
 
1. introducción a la Ingeniería de Software (UTM 2071)
1. introducción a la Ingeniería de Software (UTM 2071)1. introducción a la Ingeniería de Software (UTM 2071)
1. introducción a la Ingeniería de Software (UTM 2071)
 

Recently uploaded

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 

Recently uploaded (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 

System Accidents: Understanding Common Accidents

  • 3. 3 chef.io Name Title Company Who is Galen? ● Chef Software for 6 Years ● Based in San Diego ● Current: Lead Compliance & Security Architect ● Previously: Solutions & Customer Architect ● Global Traveler
  • 4. chef.io Key Components ● Systems are Complex ● Systems are Tightly Coupled ● Systems may have Common-mode Components ● Systems are inclusive of the technical, mechanical and human components ● Systems have Potential for Catastrophic Failure
  • 5. chef.io Unanticipated interaction of multiple failures System Accident:
  • 6. 6 chef.io ● Complexity: Nuclear reactors are inherently complex. They are transformational systems that are non-linear. ● Coupling: The safety systems are tightly coupled to reactor operation. The safety systems are used as the primary steam generation ● Common-mode: A pressure relief valve became stuck open, resulting in the secondary system’s pumping of water into the reactor to be less effective ● Safety-System Failure: A critical backup water system was inoperable due to valves being closed For images that do not extend to the edge of the slide Example: Three Mile Island
  • 7. 7 chef.io ● Core issue: Misconfiguration removed core Google network services from multiple regions ● Complexity: Google Cloud operates with multiple regions, with redundancy for “defense in depth” ● Coupling: Systems are designed to operate for a time after changes, but unable to find and fix in time ● Common-Mode: Debugging tools rely upon the same network infrastructure that is now unresponsive due to congestion. Significantly increased the time to troubleshoot and fix Example: Google Cloud Catch-22 (Incident #19009)
  • 8. 8 chef.io ● Core issue: An OS re-image is scheduled for a set of desktop systems, but is accidentally applied to all systems ● Complexity: The tool for config mgmt is complex and limited only by operator experience ● Coupling: Once the order is given to re-image, all systems immediately take that order with no buffer ● Process: The definitions for systems are built manually in a UI, and applied through the same UI. There is no ability to check and review changes short of implementing people process Example: Emory University/CommBank Config Mgr
  • 9. 9 chef.io ● Core issue: The NotPetya virus infects the Maersk network, taking down the shipping giant for weeks ● Complexity: Initially simple network design (flat network) results in a large spread of the virus before systems can be shut down. Complexity in backups of critical infrastructure caused massive delays ● Common Mode: Active Directory is the lynch-pin for almost all enterprise services. There was no full backup of Active Directory ● Process: Without a test of a full systems failure, no clear remediation was in place for this type of event Example: Maersk Shipping
  • 10. chef.io Managing Complexity ● Are your interactions documented? ● Are they automated or manual? ● Are your systems scalable? ● What complexity is irreducible? ● How do you handle dependencies?
  • 11. chef.io Managing Complexity ● Are your interactions documented? ● Are they automated or manual? ● Are your systems scalable? ● What complexity is irreducible? ● How do you handle dependencies? ● One Path to Production ● API-Oriented Architecture ● Microservices ● Versioned Artifacts ● Codified Interactions (CI/CD/DevOps)
  • 12. chef.io ● What services rely upon other services? ● How do we test those interactions? ● What happens if a service cannot reach a necessary service? ● Are we relying upon the product of the service or upon specific systems? Example: “Dagobah: 10.45.20.15” Reducing Coupling
  • 13. chef.io ● Buffer. Ensure that systems (& subsystems) have time to recover from events ● Resilience. Build in service discovery ● Linearity. Ensure a linear flow of changes to the system (One Path) ● Observability. Have effective monitoring around system state ● What services rely upon other services? ● How do we test those interactions? ● What happens if a service cannot reach a necessary service? ● Are we relying upon the product of the service or upon specific systems? Example: “Dagobah: 10.45.20.15” Reducing Coupling
  • 14. chef.io ● What core components do all systems rely upon? ● What components do you rely upon that you don’t own? ● How tightly coupled are they to each other? Common Mode Failures
  • 15. chef.io ● Identify Risk. Identify Common Mode Components and perform a risk assessment ● Failure untested is failure unknown. Test failure regularly (Chaos Engineering) ● Focus on resilience, not redundancy ● Build agnostic systems (loosely- coupled, easily replaceable) ● What core components do all systems rely upon? ● What components do you rely upon that you don’t own? ● How tightly coupled are they to each other? Common Mode Failures
  • 16. 16 chef.io ● Small, resilient agile teams work ○ Team of Teams (Special Forces) ○ Lean / Kaizen “The Goal” (Toyota Production System) ○ Agile Software Delivery (Continuous Delivery) ○ Black Box Thinking (All Major Airlines) ● Command & Control structures are too slow ○ Decision-making cycles are too slow, too rigid ○ Sunk-cost and other fallacies often prevent critical redesign ● Focus on Minimum Viable Product ○ Short Sprints ○ Outcome and Business Oriented Objectives People & Process
  • 17. 17 chef.io ● Component failure is normal in complex systems ● Managing Complexity ○ One Path to Production ○ Define all interactions in code (Coded Enterprise) ● Reducing Coupling ○ Design systems to buffer/queue ○ Focus on the output (API-Oriented) ○ Look for changes to the system state (Observability) ● Common-Mode ○ Test Failure (Chaos Engineering) ○ Risk Analysis on critical services ○ Build for resiliency ● People & Process: Agile System Accidents (Recap)

Editor's Notes

  1. Examples Three Mile Island Primary Cooling System is the water within the reactor (High Pressure, High Heat, Radioactive) Heats Secondary cooling system which turns steam turbines A cupful of water leaked from the secondary system. Location of the leak tripped 2 pumps, which triggered a pump stoppage. When this flow is interrupted, the steam turbine shuts down automatically as a safety device However, the heat from the core still has to be dissipated, but now lacks its normal cooling mechanism (secondary system/steam turbine) Emergency pumps turn on to pump water through the secondary system; in this instance they were blocked by a closed valve. (Valve is required to never be closed during normal operation, but it was) Operator verified the pumps came on, but didn't know the valve was closed and water wasn’t flowing into the reactor. (Complexity/Observability) Reactor scrammed because no heat was being removed (tightly coupled) Another Automatic Safety Device triggers, the pressure relief valve to reduce pressure in the core. This dumps water out into an overflow tank. After sufficient pressure was relieved, the operators ordered the pressure valve to close. The indicator on the control panel only shows if the order was given, not the actual valve state. (Common-Mode/Observability) This stuck valve resulted in ⅓ of the reactor coolant to drain through the valve -This all happened in 13 seconds To recap: False Signal caused pumps to fail, emergency cooling out of position, indicator obscured, and a relief valve failed to reseat with a failed indicator --Not in direct sequence --Notices of radioactive water were deemed from "an unknown source", because the relief valve couldn't be open --That water went into a different tank than intended, due to complexity of the system Now, that’s an industrial example but very few of us run industrial systems. So lets look at some more relatable items.
  2. Incident Report: https://status.cloud.google.com/incident/cloud-networking/19009 Google makes maintenance changes constantly. Their services are globally distributed and they employ a defense-in-depth approach to security and resiliency. The short version of the root cause of this outage is; a change was pushed to their network management systems and a bug allowed multiple, independent cluster management systems to be pulled offline for maintenance at the same time. Complexity: Google’s network management clusters are distributed logically across regions, but the maintenance event triggered clusters to go offline even though the entire cluster was not in scope for the planned maintenance. Coupling: The system is designed to “fail-static” aka, operate without cluster management for a period of time in order to give Google Engineers a chance to fix issues before they become incidents. Common-Mode: Google figured out the issue relatively quickly, but were hampered when the tools they use to troubleshoot and resolve were unusable due to the extremely high network congestion on the remaining networks. This significantly increased the time to resolve the issue. Complexity: Once Google determined the issue, got the tools working enough to apply the correct configuration, they determined that due to all network control clusters being down, the previous configuration state was lost. So Google had to rebuild it. This increased the downtime by an hour. As we can see, the outage isn’t just about complexity or coupling or process. But a combination of all of those things together creates this large scale outage. Okay, but I’m not Google. I’m just Financial Company X, or a University. How could I possibly have a system that is complex enough for this to apply?
  3. Example: System Center Config Manager (SCCM) https://myitforum.com/sccm-task-sequence-blew-up-australias-commbank/ Same story, 2 (that we know of) victims. Configuration Manager is a really powerful tool that can push patches, updates, modify users, settings, etc. It is however a UI-based tool and has the ability to be used as a very very big ‘foot-gun’. In both instances, an operator was creating a change to re-image a set of systems. In both cases these were desktop/laptop systems either due for an OS upgrade or just a planned re-image. However, the targeted scope of systems was not applied correctly and instead this change was applied, immediately to every single system System Center manages. This is all desktops. Laptops connected to the network. Servers. Servers. Running Active Directory. Exchange. CRM applications. Banking applications. You name it, its reimaged. Including the System Center server itself. So, in one instant every system in the network has been told to format itself in preparation for being imaged. And some systems even started the imaging process before System Center server went offline. Complexity: The tool used is UI based, has the ability to select all systems and has no direct feedback loop telling you which systems you are going to be affecting. Coupling: Once the order is given, there is no undo or rollback to revert to the previous state. It isn’t stored somewhere, reviewed, etc. Okay, so we have some examples of how these things can occur and there are some themes we can identify around and start designing systems.
  4. What we want to do here is think about the questions we have on the components that would allow us to manage complexity. So, think about, can we understand the system? If we can, do we have to get that information from someone’s head or can we pull it from documentation? Ideally it’d be an API that operates exactly as its documented. The interactions in the system, are they known? Are they automatic, or are there people in the process? Moving code from one area to another? Turning off and on services, etc? Scalability is a component of complexity. How scalable is the system? Does it need to be? What parts of our complexity are irreducible? Aka, which components are necessary? Which ones are unnecessary? Where can we simplify the system? Dependency management is a critical, often overlooked component of complexity.
  5. What we want to do here is think about the questions we have on the components that would allow us to manage complexity. So, think about, can we understand the system? If we can, do we have to get that information from someone’s head or can we pull it from documentation? Ideally it’d be an API that operates exactly as its documented. The interactions in the system, are they known? Are they automatic, or are there people in the process? Moving code from one area to another? Turning off and on services, etc? Scalability is a component of complexity. How scalable is the system? Does it need to be? What parts of our complexity are irreducible? Aka, which components are necessary? Which ones are unnecessary? Where can we simplify the system? Dependency management is a critical, often overlooked component of complexity. ======== These are the high level components necessary to manage complexity. Simplify the pipeline. One path to production, regardless of the change Documentation via API. If its an API, its a) documented (or at least queryable) and automatic versus manual. Microservices are a way to reduce the system complexity into smaller bits that are more manageable. Allowing for us to influence scalability. Creating a versioned artifact creates a moment of time record of our system state. Working with that artifact(s) allows us to ensure a known system state Codifying interactions between teams, adopting that API-oriented architecture is critical
  6. Same idea as complexity. How do we allow systems to be more loosely associated? Are they talking to a specific hostname or IP address? Or are they grabbing their configuration from a central location?
  7. Same idea as complexity. How do we allow systems to be more loosely associated? Are they talking to a specific hostname or IP address? Or are they grabbing their configuration from a central location?