System Accidents: Understanding Common Accidents

Building_the_Coded_Enterprise
chef.io

chef.io
Understanding Common Failures
System Accidents

3 chef.io
Name
Title
Company
Who is Galen?
● Chef Software for 6 Years
● Based in San Diego
● Current: Lead Compliance & Security Architect
● Previously: Solutions & Customer Architect
● Global Traveler

chef.io
Key Components
● Systems are Complex
● Systems are Tightly Coupled
● Systems may have Common-mode Components
● Systems are inclusive of the technical, mechanical and human
components
● Systems have Potential for Catastrophic Failure

chef.io
Unanticipated interaction of multiple failures
System Accident:

6 chef.io
● Complexity: Nuclear reactors are inherently
complex. They are transformational systems that
are non-linear.
● Coupling: The safety systems are tightly coupled to
reactor operation. The safety systems are used as
the primary steam generation
● Common-mode: A pressure relief valve became
stuck open, resulting in the secondary system’s
pumping of water into the reactor to be less
effective
● Safety-System Failure: A critical backup water
system was inoperable due to valves being closed
For images that do not extend to the edge of the slide
Example: Three Mile Island

7 chef.io
● Core issue: Misconfiguration removed core Google
network services from multiple regions
● Complexity: Google Cloud operates with multiple
regions, with redundancy for “defense in depth”
● Coupling: Systems are designed to operate for a
time after changes, but unable to find and fix in time
● Common-Mode: Debugging tools rely upon the same
network infrastructure that is now unresponsive due
to congestion. Significantly increased the time to
troubleshoot and fix
Example: Google Cloud Catch-22 (Incident #19009)

8 chef.io
● Core issue: An OS re-image is scheduled for a set
of desktop systems, but is accidentally applied to
all systems
● Complexity: The tool for config mgmt is complex
and limited only by operator experience
● Coupling: Once the order is given to re-image, all
systems immediately take that order with no
buffer
● Process: The definitions for systems are built
manually in a UI, and applied through the same
UI. There is no ability to check and review
changes short of implementing people process
Example: Emory University/CommBank Config Mgr

9 chef.io
● Core issue: The NotPetya virus infects the Maersk
network, taking down the shipping giant for weeks
● Complexity: Initially simple network design (flat
network) results in a large spread of the virus
before systems can be shut down. Complexity in
backups of critical infrastructure caused massive
delays
● Common Mode: Active Directory is the lynch-pin for
almost all enterprise services. There was no full
backup of Active Directory
● Process: Without a test of a full systems failure, no
clear remediation was in place for this type of event
Example: Maersk Shipping

chef.io
Managing Complexity
● Are your interactions documented?
● Are they automated or manual?
● Are your systems scalable?
● What complexity is irreducible?
● How do you handle dependencies?

chef.io
Managing Complexity
● Are your interactions documented?
● Are they automated or manual?
● Are your systems scalable?
● What complexity is irreducible?
● How do you handle dependencies?
● One Path to Production
● API-Oriented Architecture
● Microservices
● Versioned Artifacts
● Codified Interactions (CI/CD/DevOps)

chef.io
● What services rely upon other
services?
● How do we test those interactions?
● What happens if a service cannot
reach a necessary service?
● Are we relying upon the product of
the service or upon specific systems?
Example: “Dagobah: 10.45.20.15”
Reducing Coupling

chef.io
● Buffer. Ensure that systems (&
subsystems) have time to recover
from events
● Resilience. Build in service discovery
● Linearity. Ensure a linear flow of
changes to the system (One Path)
● Observability. Have effective
monitoring around system state
● What services rely upon other
services?
● How do we test those interactions?
● What happens if a service cannot
reach a necessary service?
● Are we relying upon the product of
the service or upon specific systems?
Example: “Dagobah: 10.45.20.15”
Reducing Coupling

chef.io
● What core components do all
systems rely upon?
● What components do you rely upon
that you don’t own?
● How tightly coupled are they to each
other?
Common Mode Failures

chef.io
● Identify Risk. Identify Common Mode
Components and perform a risk
assessment
● Failure untested is failure unknown.
Test failure regularly (Chaos
Engineering)
● Focus on resilience, not redundancy
● Build agnostic systems (loosely-
coupled, easily replaceable)
● What core components do all
systems rely upon?
● What components do you rely upon
that you don’t own?
● How tightly coupled are they to each
other?
Common Mode Failures

16 chef.io
● Small, resilient agile teams work
○ Team of Teams (Special Forces)
○ Lean / Kaizen “The Goal” (Toyota Production System)
○ Agile Software Delivery (Continuous Delivery)
○ Black Box Thinking (All Major Airlines)
● Command & Control structures are too slow
○ Decision-making cycles are too slow, too rigid
○ Sunk-cost and other fallacies often prevent critical redesign
● Focus on Minimum Viable Product
○ Short Sprints
○ Outcome and Business Oriented Objectives
People & Process

17 chef.io
● Component failure is normal in complex systems
● Managing Complexity
○ One Path to Production
○ Define all interactions in code (Coded Enterprise)
● Reducing Coupling
○ Design systems to buffer/queue
○ Focus on the output (API-Oriented)
○ Look for changes to the system state (Observability)
● Common-Mode
○ Test Failure (Chaos Engineering)
○ Risk Analysis on critical services
○ Build for resiliency
● People & Process: Agile
System Accidents
(Recap)

System Accidents: Understanding Common Accidents

Recommended

Recommended

More Related Content

Similar to System Accidents: Understanding Common Accidents

Similar to System Accidents: Understanding Common Accidents (20)

Recently uploaded

Recently uploaded (20)

System Accidents: Understanding Common Accidents

Editor's Notes