Embracing collaborative chaos (April 2020) by Lyndsay Prewer

Embracing
collaborative chaos
Running chaos days on large platforms
Lyndsay Prewer | @equalexperts

Photo by Darius Bashar on Unsplash
What is chaos engineering
and why should we care?

Building vital, high traffic services, fast
Google Cloud Dataflow In the Smart Home Data Pipeline
● Delivered 10 days early!
● Built in 4 weeks.
● 140,000 claims processed
on launch day.
● No production incidents

Building cool, planet-scale, services, fast

Operating on the edge of chaos
http://bit.ly/2ZavoyP
http://bit.ly/2QVeWzA
“Two normally-
benign
misconfigurations,
and a specific
software bug,
combined to initiate
the outage”

How can your system fail?
● What are the component parts?
● How are they connected?
● How reliable is each part?
● How reliable are the connections?
● What happens when X fails?

Addressing the risk of unexpected failure
A
B
A
B D
C
Z
E
G H
F
I
● Address risk by deliberate
inducing failure
● Observe, reflect and improve
● Build resilience in (like quality)
● Think about production (and
failure) all the time
Simples Hard

What do we mean by resilience?

Four chaos engineering approaches
Manual
In process
Automated

Manual chaos
● Chaos Days
● AWS Game Days
● Change specific chaos

● Chaos monkey
● AWS spot instances / GCP
Preemptible VMs
● Randomised pod killer
Automated chaos

In process chaos engineering
● Part of normal engineering process
● Focus for all roles in the team
● Production thinking / building resilience in
Product
Owner
Dev QA Dev Ops
Focus on: Quality AND Production AND Resilience
Define Build Explore Deploy

(Unplanned chaos)
● Every day is a school day
● Handle incidents well
● Learn from incidents - post incident
reviews
● Start simple then incorporate tooling
A
B D
C
Z
E
G H
F
I

How does it help?
People
ProcessProduct
Knowledge
Behaviour
Expertise
Managing incidents
Learning from incidents
Engineering approach
Simplification
Observability
Runbooks
Resilience

Running a Chaos Day
- when and how?

Our context
Legacy systems
x100 million
internal
requests
(busiest day)
x100 million
log messages
(busiest day)
x850
microservices
x100M Customers
60 Delivery teams
~1000 Microservices
Lorenipsumcaveatempor
Loren ipsum caveat empor. Loren ipsum
caveat empor. Loren ipsum caveat empor
Loren ipsum caveat empor.
6 Platform teams
(AWS PaaS)

When were we ready for chaos?
2013 2014
Cloud
Docker
Scala
Mongo
ELK
Fast
growth
(teams,
services,
traffic)

2013 2014 2015 2016
Cloud
Docker
Scala
Mongo
ELK
Fast
growth
(teams,
services,
traffic)
Multi
active WIP
Multi
active

2013 2014 2015 2016 2017 2018
Cloud
Docker
Scala
Mongo
ELK
Fast
growth
(teams,
services,
traffic)
Multi
active WIP
Multi
active
More multi
active
(to AWS)
Self serve
deploys
AWS
Ready
for
Chaos

Who, where and exactly how?

Agents of chaos
● Virtual, closed team
● Draw from component
teams
● Experts / veterans
● Highest bus factor

Chaos scope - know thyself
● Know your architecture
● Know your steady state
● Know your constraints
○ What’s in your control?
○ What’s not?
○ What needs protecting?
X00 million
internal
requests
(busiest day)
X00 million
log messages
(busiest day)

Chaos scope - trust the brains-storm
http://bit.ly/2XzR7Q9

Chaos scope - brainstorm, then plan the
detail
Team
X
Team
Y
Team
Z

Chaos scope - hack the chaos
Team
X
Team
Y
Team
Z

Deciding where
● Production or closest to it
● Production (like) load
● Production (like) telemetry
● Decide the blast radius
● Decide comm’s channel(s)
Production
Staging
QA
Development

Execution

Deciding when
● To warn or not
● It was just another ordinary day …
● What else is going on?
● Chaos cut-off

Keep calm and chaos on (agents)
● (Virtually) co-locate the agents
● Collaborate and coordinate well
● Time-box, cover ground
● (Self) document well

Keep calm and chaos on (everyone else)
● It was just another ordinary day ...
● Also (self) document well
● Pretend it’s Production on

Retrospection

Divide and conquer, then regroup
● Component teams retro’s /
incident reviews first
● Major on engineering
improvements (people,
process, product)
● Then team-of-teams retro
● Minor on chaos day
improvements
People
ProcessProduct
Team X
Team Y
Team Z
Team of
teams

What did we learn?
● Start small
● Manage/limit the pain
● Production is a tough step
● Production-like is also hard!
● Have fun!

What next?

What’s your next chaos step?
Manual
In process
Automated
Unplanned
● Where are you at in the journey?
● What’s the next (baby) step?
● Need any help?
○ Talk to us
○ Check out our playbooks

Thank You
Simple solutions to big business problems.

Simple solutions to big business problems.
Contact us
Our experienced teams deliver software
all around the globe.
London
+44 203 603 7830
helloUK@equalexperts.com
Manchester
+44 203 603 7830
helloUK@equalexperts.com
Pune
+91 20 6687 2400
helloIndia@equalexperts.com
Bengaluru
+91 99 7298 0224
helloIndia@equalexperts.com
Lisbon
+351 211 378 414
helloPortugal@equalexperts.com
New York
+1 866-943-9737
helloUSA@equalexperts.com
Calgary
+1 403 775-4861
helloCanada@equalexperts.com
Berlin
helloDE@equalexperts.com
Sydney
+612 8999 6661
helloAUS@equalexperts.com
Cape Town
+27 21 680 5252
helloSA@equalexperts.com

Embracing collaborative chaos (April 2020) by Lyndsay Prewer

Recommended

Recommended

More Related Content

Similar to Embracing collaborative chaos (April 2020) by Lyndsay Prewer

Similar to Embracing collaborative chaos (April 2020) by Lyndsay Prewer (20)

More from Equal Experts

More from Equal Experts (20)

Recently uploaded

Recently uploaded (20)

Embracing collaborative chaos (April 2020) by Lyndsay Prewer

Editor's Notes