Today’s systems are inherently complex, with some component parts often operating in or close to suboptimal or failure modes. Left unchecked, as complexity increases, the compounding of failure modes will inevitably lead to catastrophic system failure.
Chaos Days help us address this risk by spending time deliberately inducing failures, then analysing the response.
This session summarises our experience of running Chaos Days on a large scale platform. We’ll explore the what, why, how and when of running a Chaos Day, plus tips for running them remotely.
2. Photo by Darius Bashar on Unsplash
What is chaos engineering
and why should we care?
3. Building vital, high traffic services, fast
Google Cloud Dataflow In the Smart Home Data Pipeline
● Delivered 10 days early!
● Built in 4 weeks.
● 140,000 claims processed
on launch day.
● No production incidents
5. Operating on the edge of chaos
http://bit.ly/2ZavoyP
http://bit.ly/2QVeWzA
“Two normally-
benign
misconfigurations,
and a specific
software bug,
combined to initiate
the outage”
6. How can your system fail?
Google Cloud Dataflow In the Smart Home Data Pipeline
● What are the component parts?
● How are they connected?
● How reliable is each part?
● How reliable are the connections?
● What happens when X fails?
7. Addressing the risk of unexpected failure
A
B
A
B D
C
Z
E
G H
F
I
● Address risk by deliberate
inducing failure
● Observe, reflect and improve
● Build resilience in (like quality)
● Think about production (and
failure) all the time
Simples Hard
12. In process chaos engineering
● Part of normal engineering process
● Focus for all roles in the team
● Production thinking / building resilience in
Product
Owner
Dev QA Dev Ops
Focus on: Quality AND Production AND Resilience
Define Build Explore Deploy
13. (Unplanned chaos)
● Every day is a school day
● Handle incidents well
● Learn from incidents - post incident
reviews
● Start simple then incorporate tooling
A
B D
C
Z
E
G H
F
I
14. How does it help?
People
ProcessProduct
Knowledge
Behaviour
Expertise
Managing incidents
Learning from incidents
Engineering approach
Simplification
Observability
Runbooks
Resilience
15. Photo by Darius Bashar on Unsplash
Running a Chaos Day
- when and how?
16. Our context
Legacy systems
x100 million
internal
requests
(busiest day)
x100 million
log messages
(busiest day)
x850
microservices
x100M Customers
60 Delivery teams
~1000 Microservices
Lorenipsumcaveatempor
Loren ipsum caveat empor. Loren ipsum
caveat empor. Loren ipsum caveat empor
Loren ipsum caveat empor.
Lorenipsumcaveatempor
Loren ipsum caveat empor. Loren ipsum
caveat empor. Loren ipsum caveat empor
Loren ipsum caveat empor.
Lorenipsumcaveatempor
Loren ipsum caveat empor. Loren ipsum
caveat empor. Loren ipsum caveat empor
Loren ipsum caveat empor.
6 Platform teams
(AWS PaaS)
17. When were we ready for chaos?
2013 2014
Cloud
Docker
Scala
Mongo
ELK
Fast
growth
(teams,
services,
traffic)
18. When were we ready for chaos?
2013 2014 2015 2016
Cloud
Docker
Scala
Mongo
ELK
Fast
growth
(teams,
services,
traffic)
Multi
active WIP
Multi
active
19. When were we ready for chaos?
2013 2014 2015 2016 2017 2018
Cloud
Docker
Scala
Mongo
ELK
Fast
growth
(teams,
services,
traffic)
Multi
active WIP
Multi
active
More multi
active
(to AWS)
Self serve
deploys
AWS
Ready
for
Chaos
20. Photo by Darius Bashar on Unsplash
Who, where and exactly how?
21. Agents of chaos
● Virtual, closed team
● Draw from component
teams
● Experts / veterans
● Highest bus factor
22. Chaos scope - know thyself
● Know your architecture
● Know your steady state
● Know your constraints
○ What’s in your control?
○ What’s not?
○ What needs protecting?
Lorenipsumcaveatempor
Loren ipsum caveat empor. Loren ipsum
caveat empor. Loren ipsum caveat empor
Loren ipsum caveat empor.
X00 million
internal
requests
(busiest day)
X00 million
log messages
(busiest day)
23. Chaos scope - trust the brains-storm
http://bit.ly/2XzR7Q9
24. Chaos scope - brainstorm, then plan the
detail
Team
X
Team
Y
Team
Z
26. Deciding where
● Production or closest to it
● Production (like) load
● Production (like) telemetry
● Decide the blast radius
● Decide comm’s channel(s)
Production
Staging
QA
Development
28. Deciding when
● To warn or not
● It was just another ordinary day …
● What else is going on?
● Chaos cut-off
29. Keep calm and chaos on (agents)
● (Virtually) co-locate the agents
● Collaborate and coordinate well
● Time-box, cover ground
● (Self) document well
30. Keep calm and chaos on (everyone else)
● It was just another ordinary day ...
● Also (self) document well
● Pretend it’s Production on
32. Divide and conquer, then regroup
● Component teams retro’s /
incident reviews first
● Major on engineering
improvements (people,
process, product)
● Then team-of-teams retro
● Minor on chaos day
improvements
People
ProcessProduct
Team X
Team Y
Team Z
Team of
teams
33. What did we learn?
● Start small
● Manage/limit the pain
● Production is a tough step
● Production-like is also hard!
● Have fun!
35. What’s your next chaos step?
Manual
In process
Automated
Unplanned
● Where are you at in the journey?
● What’s the next (baby) step?
● Need any help?
○ Talk to us
○ Check out our playbooks
37. Simple solutions to big business problems.
Contact us
Our experienced teams deliver software
all around the globe.
London
+44 203 603 7830
helloUK@equalexperts.com
Manchester
+44 203 603 7830
helloUK@equalexperts.com
Pune
+91 20 6687 2400
helloIndia@equalexperts.com
Bengaluru
+91 99 7298 0224
helloIndia@equalexperts.com
Lisbon
+351 211 378 414
helloPortugal@equalexperts.com
New York
+1 866-943-9737
helloUSA@equalexperts.com
Calgary
+1 403 775-4861
helloCanada@equalexperts.com
Berlin
helloDE@equalexperts.com
Sydney
+612 8999 6661
helloAUS@equalexperts.com
Cape Town
+27 21 680 5252
helloSA@equalexperts.com
Editor's Notes
Hello, my name is Lyndsay Prewer. Over the last couple of years, I’ve been leading a group of teams that develop and operate a Platform-as-a-Service for a very large public sector client. In this talk I’ll describe how we’ve used Chaos Days to improve the resilience of our platform, and the effectiveness of our platform and it’s teams to gracefully handle catastrophic failures.
Chaos engineering is particularly relevant to distributed systems, as these have a scale and high level of complexity that make it impossible to determine their emergent properties and behaviour, let alone every possible failure mode, it’s impact and possible mitigation.
Although distributed systems have been around for decades, recent advances in technology, such as serverless, combined with agile and lean practices have led to teams being able to get more complex stuff into production faster and at lower cost.
We can build really cool applications like Nest XYZ, so we can do ABC. What could possibly go wrong!?
Chaos engineering is particularly relevant to distributed systems, as these have a scale and high level of complexity that make it impossible to determine their emergent properties and behaviour, let alone every possible failure mode, it’s impact and possible mitigation.
Although distributed systems have been around for decades, recent advances in technology, such as serverless, combined with agile and lean practices have led to teams being able to get more complex stuff into production faster and at lower cost.
We can build really cool applications like Nest XYZ, so we can do ABC. What could possibly go wrong!?
We can build really cool applications like Nest XYZ, so we can do ABC. What could possibly go wrong!?
Complex/distributed systems will fail - not if but when - our systems operate on the edge of chaos
Consider your own system...
As component parts and connections increase we get an exponential increase in the complexity of the emergent behaviour and thus the number of possible failure modes. This equates to a decrease in our ability to predict failures and their impact zone.
Building resilience in, similar to Build quality in
Production thinking
“It’s a mindset, not a toolset: you don’t need to be running EKS on AWS to benefit from ….”
It doesn’t mean we build systems that never fail, that are perfect and indestructible. It means we build systems that cope with failure well, that recover well, that are elastic.
Chaos Days (focus on what, not why, as why comes later)
Chaos testing (focus is very narrow/local to new/changed components)
Chaos Monkey, Symian army et al
AWS and GCP alternatives (spot instances, etc.)
(Semi-automated) - Super K8S Chaos Bro
Making this part of normal flow - link back to Production thinking / Building resilience in
It’s not just about more resilient components.
It starts with people, their knowledge, their expertise, their behaviours.
It covers process - how we respond to and manage incidents, how we learn from them, how we fold these learnings into our engineering practices.
On the product front, it’s more than just resilience improvements. It’s also making systems easier to observe, easier to understand and reason about. Systems that automatically heal and tolerate failure is the goal, but improvements in things such as telemetry, alerting and runbooks.
Describe size, scale and architecture of Public sector client
At various other clients, ranging from retail to payment systems, we’ve setup and run kube-monkey in all environments, opted for preemptible VMs, and run Game Days to help teams learn how to diagnose and debug Production issues.
For large platforms, owning teams should provide Chaos Agent to plot and scheme in secret with others.
Who knows your system the best?
Who do you turn to when the shit hits the fan?
Should be high bus factor person.
Map out your architecture and dependencies
Define steady state
What’s normal load/throughput?
How do you know the system is healthy? (heart rate, VO2-Max, metrics, 5XX / 499 (check this) responses, alerts)
What do you have control over? What services / teams do you want to protect?
Apollo 13 picture
Map out your architecture and dependencies
Doesn’t need to be a big diagram - just get the experts together and brainstorm.
Give them a clear intent, a goal, a direction and some constraints, then leave them to figure it out.
Define hypothesis for specific interventions and expected response, e.g.
Instance failures, app failures, AZ failures, volumes filling up, connections failing/slowing, database failing. Security attacks (break-the-bank approaches, malicious engineer)
Map out sequencing, e.g. what should go together, what kept apart, what can be done independently.
How will normal service be resumed?
Chaos Days are a perfect time to also run security attacks (break-the-bank approaches, malicious engineer)
Production or not? If not how production like are things (cookie cutter environments, telemetry)
How will load be generated?
Who will be impacted if chaos does reign?
What comm’s channel is normally used?
Some warning?
Anything else happening at that time (e.g. peak loads, major releases)
How will you ensure normal service is resumed - story from our first day
[Photo from 1st chaos day?] Co-locate agents of chaos, plus comm’s channel
Collaborate and coordinate in response to chaos and how it’s handled.
Timebox to ensure enough chaos variants covered and normal service is resumed
[Slack and trello screen shots?] Record what you’re doing (slack, trello - hypothesis, expected response, actual)
Just an ordinary day (i.e. all teams working as normal)
Record what you’re doing (slack)
Treat chaos environment as production
Team based retro’s then team of teams
Separate resilience improvements (e.g. tech, process, people) from chaos day improvements
[Slide, check our own list] Lessons learnt
What’s not worked well
Things we’d do next time
What’s your next step?
Describe various possible contexts, and possible next steps for each