This talk will cover a security focused project that evolved into a chaos injection system.
The system is called “Lifespan Management” and it enforces a lifespan on a cloud hosted VM. After the lifespan expires, the host is terminated, and a replacement is brought up. It has the benefits of making it easier to apply fixes for CVE’s (CVE comes out on day X, we know hosts will age out by day Y), and reducing the value of a compromised machine (“I’ve finally captured a host! It’s being shutdown?? No!”)
This seemed simple enough, but the complexity it uncovered made for a fun, year-long adventure in chaos engineering.
In this talk, I’ll cover the evolution of the system, and some lessons we learned along the way like:
All termination API calls are not created equal
Zero failing health checks does not mean a host is healthy
Answering “Was this the chaos system?” quickly is essential
I’ll also include anecdotes like how it helped with Spectre/Meltdown mitigations, how it mercilessly killed all our kubernetes workers, and how it locked us out of our QA environment.
39. Breaking it up with labels
Stateless
Safe to replace
Stateful Automated
Replaceable with some graceful
state hand-off.
Requires Operator
Not safe to replace automatically.
Want someone watching
39
107. @paulcarletonjr
Death by a thousand JIRA tickets
● File against ourselves first, then
automate
● 1% case matters more with 10x
terminations
● Measure Quantity and Reliability
of tickets
113. @paulcarletonjr
Credits
● Photo by rawpixel on Unsplash
● Photo by Jens Lelie on Unsplash
● Photo by JohnsonMartin https://pixabay.com/en/wormhole-space-time-light-tunnel-739872/