Complex systems fail. Failures are a normal occurrence and how the system responds to component failures determines if the system is resilient or if it results in an incident. This is true not just for Information Systems but also for industrial and mechanical systems. Users can learn from the experience of these systems on what common themes permeate and how to best build our software factories to minimize the effect of component failures and prevent incidents from becoming accidents.
This talk walks through the issues and experiences of complex systems and identifies the common failure points within systems. With a focus on complexity, tight coupling, operator error and transitions, users can learn from the industrial and system accidents that precede them, as they work to build new, modern, efficient, safe software and systems designs. It will draw upon both industrial and software examples of how things can go wrong and the common failure points among them. By the end of this talk, attendees will have an understanding of what decisions increase the chance of failure and which decisions and designs reduce it.
3. 3 chef.io
Name
Title
Company
Who is Galen?
● Chef Software for 6 Years
● Based in San Diego
● Current: Lead Compliance & Security Architect
● Previously: Solutions & Customer Architect
● Global Traveler
4. chef.io
Key Components
● Systems are Complex
● Systems are Tightly Coupled
● Systems may have Common-mode Components
● Systems are inclusive of the technical, mechanical and human
components
● Systems have Potential for Catastrophic Failure
6. 6 chef.io
● Complexity: Nuclear reactors are inherently
complex. They are transformational systems that
are non-linear.
● Coupling: The safety systems are tightly coupled to
reactor operation. The safety systems are used as
the primary steam generation
● Common-mode: A pressure relief valve became
stuck open, resulting in the secondary system’s
pumping of water into the reactor to be less
effective
● Safety-System Failure: A critical backup water
system was inoperable due to valves being closed
For images that do not extend to the edge of the slide
Example: Three Mile Island
7. 7 chef.io
● Core issue: Misconfiguration removed core Google
network services from multiple regions
● Complexity: Google Cloud operates with multiple
regions, with redundancy for “defense in depth”
● Coupling: Systems are designed to operate for a
time after changes, but unable to find and fix in time
● Common-Mode: Debugging tools rely upon the same
network infrastructure that is now unresponsive due
to congestion. Significantly increased the time to
troubleshoot and fix
Example: Google Cloud Catch-22 (Incident #19009)
8. 8 chef.io
● Core issue: An OS re-image is scheduled for a set
of desktop systems, but is accidentally applied to
all systems
● Complexity: The tool for config mgmt is complex
and limited only by operator experience
● Coupling: Once the order is given to re-image, all
systems immediately take that order with no
buffer
● Process: The definitions for systems are built
manually in a UI, and applied through the same
UI. There is no ability to check and review
changes short of implementing people process
Example: Emory University/CommBank Config Mgr
9. 9 chef.io
● Core issue: The NotPetya virus infects the Maersk
network, taking down the shipping giant for weeks
● Complexity: Initially simple network design (flat
network) results in a large spread of the virus
before systems can be shut down. Complexity in
backups of critical infrastructure caused massive
delays
● Common Mode: Active Directory is the lynch-pin for
almost all enterprise services. There was no full
backup of Active Directory
● Process: Without a test of a full systems failure, no
clear remediation was in place for this type of event
Example: Maersk Shipping
10. chef.io
Managing Complexity
● Are your interactions documented?
● Are they automated or manual?
● Are your systems scalable?
● What complexity is irreducible?
● How do you handle dependencies?
11. chef.io
Managing Complexity
● Are your interactions documented?
● Are they automated or manual?
● Are your systems scalable?
● What complexity is irreducible?
● How do you handle dependencies?
● One Path to Production
● API-Oriented Architecture
● Microservices
● Versioned Artifacts
● Codified Interactions (CI/CD/DevOps)
12. chef.io
● What services rely upon other
services?
● How do we test those interactions?
● What happens if a service cannot
reach a necessary service?
● Are we relying upon the product of
the service or upon specific systems?
Example: “Dagobah: 10.45.20.15”
Reducing Coupling
13. chef.io
● Buffer. Ensure that systems (&
subsystems) have time to recover
from events
● Resilience. Build in service discovery
● Linearity. Ensure a linear flow of
changes to the system (One Path)
● Observability. Have effective
monitoring around system state
● What services rely upon other
services?
● How do we test those interactions?
● What happens if a service cannot
reach a necessary service?
● Are we relying upon the product of
the service or upon specific systems?
Example: “Dagobah: 10.45.20.15”
Reducing Coupling
14. chef.io
● What core components do all
systems rely upon?
● What components do you rely upon
that you don’t own?
● How tightly coupled are they to each
other?
Common Mode Failures
15. chef.io
● Identify Risk. Identify Common Mode
Components and perform a risk
assessment
● Failure untested is failure unknown.
Test failure regularly (Chaos
Engineering)
● Focus on resilience, not redundancy
● Build agnostic systems (loosely-
coupled, easily replaceable)
● What core components do all
systems rely upon?
● What components do you rely upon
that you don’t own?
● How tightly coupled are they to each
other?
Common Mode Failures
16. 16 chef.io
● Small, resilient agile teams work
○ Team of Teams (Special Forces)
○ Lean / Kaizen “The Goal” (Toyota Production System)
○ Agile Software Delivery (Continuous Delivery)
○ Black Box Thinking (All Major Airlines)
● Command & Control structures are too slow
○ Decision-making cycles are too slow, too rigid
○ Sunk-cost and other fallacies often prevent critical redesign
● Focus on Minimum Viable Product
○ Short Sprints
○ Outcome and Business Oriented Objectives
People & Process
17. 17 chef.io
● Component failure is normal in complex systems
● Managing Complexity
○ One Path to Production
○ Define all interactions in code (Coded Enterprise)
● Reducing Coupling
○ Design systems to buffer/queue
○ Focus on the output (API-Oriented)
○ Look for changes to the system state (Observability)
● Common-Mode
○ Test Failure (Chaos Engineering)
○ Risk Analysis on critical services
○ Build for resiliency
● People & Process: Agile
System Accidents
(Recap)
Examples
Three Mile Island
Primary Cooling System is the water within the reactor (High Pressure, High Heat, Radioactive)
Heats Secondary cooling system which turns steam turbines
A cupful of water leaked from the secondary system. Location of the leak tripped 2 pumps, which triggered a pump stoppage.
When this flow is interrupted, the steam turbine shuts down automatically as a safety device
However, the heat from the core still has to be dissipated, but now lacks its normal cooling mechanism (secondary system/steam turbine)
Emergency pumps turn on to pump water through the secondary system; in this instance they were blocked by a closed valve. (Valve is required to never be closed during normal operation, but it was)
Operator verified the pumps came on, but didn't know the valve was closed and water wasn’t flowing into the reactor. (Complexity/Observability)
Reactor scrammed because no heat was being removed (tightly coupled)
Another Automatic Safety Device triggers, the pressure relief valve to reduce pressure in the core. This dumps water out into an overflow tank.
After sufficient pressure was relieved, the operators ordered the pressure valve to close. The indicator on the control panel only shows if the order was given, not the actual valve state. (Common-Mode/Observability)
This stuck valve resulted in ⅓ of the reactor coolant to drain through the valve
-This all happened in 13 seconds
To recap: False Signal caused pumps to fail, emergency cooling out of position, indicator obscured, and a relief valve failed to reseat with a failed indicator
--Not in direct sequence
--Notices of radioactive water were deemed from "an unknown source", because the relief valve couldn't be open
--That water went into a different tank than intended, due to complexity of the system
Now, that’s an industrial example but very few of us run industrial systems. So lets look at some more relatable items.
Incident Report: https://status.cloud.google.com/incident/cloud-networking/19009
Google makes maintenance changes constantly. Their services are globally distributed and they employ a defense-in-depth approach to security and resiliency. The short version of the root cause of this outage is; a change was pushed to their network management systems and a bug allowed multiple, independent cluster management systems to be pulled offline for maintenance at the same time.
Complexity: Google’s network management clusters are distributed logically across regions, but the maintenance event triggered clusters to go offline even though the entire cluster was not in scope for the planned maintenance.Coupling: The system is designed to “fail-static” aka, operate without cluster management for a period of time in order to give Google Engineers a chance to fix issues before they become incidents.Common-Mode: Google figured out the issue relatively quickly, but were hampered when the tools they use to troubleshoot and resolve were unusable due to the extremely high network congestion on the remaining networks. This significantly increased the time to resolve the issue.
Complexity: Once Google determined the issue, got the tools working enough to apply the correct configuration, they determined that due to all network control clusters being down, the previous configuration state was lost. So Google had to rebuild it. This increased the downtime by an hour.As we can see, the outage isn’t just about complexity or coupling or process. But a combination of all of those things together creates this large scale outage.
Okay, but I’m not Google. I’m just Financial Company X, or a University. How could I possibly have a system that is complex enough for this to apply?
Example: System Center Config Manager (SCCM) https://myitforum.com/sccm-task-sequence-blew-up-australias-commbank/
Same story, 2 (that we know of) victims.Configuration Manager is a really powerful tool that can push patches, updates, modify users, settings, etc. It is however a UI-based tool and has the ability to be used as a very very big ‘foot-gun’.In both instances, an operator was creating a change to re-image a set of systems. In both cases these were desktop/laptop systems either due for an OS upgrade or just a planned re-image. However, the targeted scope of systems was not applied correctly and instead this change was applied, immediately to every single system System Center manages.This is all desktops. Laptops connected to the network. Servers.Servers. Running Active Directory. Exchange. CRM applications. Banking applications. You name it, its reimaged. Including the System Center server itself.So, in one instant every system in the network has been told to format itself in preparation for being imaged. And some systems even started the imaging process before System Center server went offline.Complexity: The tool used is UI based, has the ability to select all systems and has no direct feedback loop telling you which systems you are going to be affecting.
Coupling: Once the order is given, there is no undo or rollback to revert to the previous state. It isn’t stored somewhere, reviewed, etc.
Okay, so we have some examples of how these things can occur and there are some themes we can identify around and start designing systems.
What we want to do here is think about the questions we have on the components that would allow us to manage complexity.So, think about, can we understand the system? If we can, do we have to get that information from someone’s head or can we pull it from documentation? Ideally it’d be an API that operates exactly as its documented.The interactions in the system, are they known? Are they automatic, or are there people in the process? Moving code from one area to another? Turning off and on services, etc?Scalability is a component of complexity. How scalable is the system? Does it need to be?
What parts of our complexity are irreducible? Aka, which components are necessary? Which ones are unnecessary? Where can we simplify the system?
Dependency management is a critical, often overlooked component of complexity.
What we want to do here is think about the questions we have on the components that would allow us to manage complexity.So, think about, can we understand the system? If we can, do we have to get that information from someone’s head or can we pull it from documentation? Ideally it’d be an API that operates exactly as its documented.The interactions in the system, are they known? Are they automatic, or are there people in the process? Moving code from one area to another? Turning off and on services, etc?Scalability is a component of complexity. How scalable is the system? Does it need to be?
What parts of our complexity are irreducible? Aka, which components are necessary? Which ones are unnecessary? Where can we simplify the system?
Dependency management is a critical, often overlooked component of complexity.
========
These are the high level components necessary to manage complexity.
Simplify the pipeline. One path to production, regardless of the change
Documentation via API. If its an API, its a) documented (or at least queryable) and automatic versus manual.
Microservices are a way to reduce the system complexity into smaller bits that are more manageable. Allowing for us to influence scalability.
Creating a versioned artifact creates a moment of time record of our system state. Working with that artifact(s) allows us to ensure a known system state
Codifying interactions between teams, adopting that API-oriented architecture is critical
Same idea as complexity. How do we allow systems to be more loosely associated? Are they talking to a specific hostname or IP address? Or are they grabbing their configuration from a central location?
Same idea as complexity. How do we allow systems to be more loosely associated? Are they talking to a specific hostname or IP address? Or are they grabbing their configuration from a central location?