SlideShare a Scribd company logo
1 of 37
Embracing
collaborative chaos
Running chaos days on large platforms
Lyndsay Prewer | @equalexperts
Photo by Darius Bashar on Unsplash
What is chaos engineering
and why should we care?
Building vital, high traffic services, fast
Google Cloud Dataflow In the Smart Home Data Pipeline
● Delivered 10 days early!
● Built in 4 weeks.
● 140,000 claims processed
on launch day.
● No production incidents
Building cool, planet-scale, services, fast
Google Cloud Dataflow In the Smart Home Data Pipeline
Operating on the edge of chaos
http://bit.ly/2ZavoyP
http://bit.ly/2QVeWzA
“Two normally-
benign
misconfigurations,
and a specific
software bug,
combined to initiate
the outage”
How can your system fail?
Google Cloud Dataflow In the Smart Home Data Pipeline
● What are the component parts?
● How are they connected?
● How reliable is each part?
● How reliable are the connections?
● What happens when X fails?
Addressing the risk of unexpected failure
A
B
A
B D
C
Z
E
G H
F
I
● Address risk by deliberate
inducing failure
● Observe, reflect and improve
● Build resilience in (like quality)
● Think about production (and
failure) all the time
Simples Hard
What do we mean by resilience?
Four chaos engineering approaches
Manual
In process
Automated
Manual chaos
● Chaos Days
● AWS Game Days
● Change specific chaos
● Chaos monkey
● AWS spot instances / GCP
Preemptible VMs
● Randomised pod killer
Automated chaos
In process chaos engineering
● Part of normal engineering process
● Focus for all roles in the team
● Production thinking / building resilience in
Product
Owner
Dev QA Dev Ops
Focus on: Quality AND Production AND Resilience
Define Build Explore Deploy
(Unplanned chaos)
● Every day is a school day
● Handle incidents well
● Learn from incidents - post incident
reviews
● Start simple then incorporate tooling
A
B D
C
Z
E
G H
F
I
How does it help?
People
ProcessProduct
Knowledge
Behaviour
Expertise
Managing incidents
Learning from incidents
Engineering approach
Simplification
Observability
Runbooks
Resilience
Photo by Darius Bashar on Unsplash
Running a Chaos Day
- when and how?
Our context
Legacy systems
x100 million
internal
requests
(busiest day)
x100 million
log messages
(busiest day)
x850
microservices
x100M Customers
60 Delivery teams
~1000 Microservices
Lorenipsumcaveatempor
Loren ipsum caveat empor. Loren ipsum
caveat empor. Loren ipsum caveat empor
Loren ipsum caveat empor.
Lorenipsumcaveatempor
Loren ipsum caveat empor. Loren ipsum
caveat empor. Loren ipsum caveat empor
Loren ipsum caveat empor.
Lorenipsumcaveatempor
Loren ipsum caveat empor. Loren ipsum
caveat empor. Loren ipsum caveat empor
Loren ipsum caveat empor.
6 Platform teams
(AWS PaaS)
When were we ready for chaos?
2013 2014
Cloud
Docker
Scala
Mongo
ELK
Fast
growth
(teams,
services,
traffic)
When were we ready for chaos?
2013 2014 2015 2016
Cloud
Docker
Scala
Mongo
ELK
Fast
growth
(teams,
services,
traffic)
Multi
active WIP
Multi
active
When were we ready for chaos?
2013 2014 2015 2016 2017 2018
Cloud
Docker
Scala
Mongo
ELK
Fast
growth
(teams,
services,
traffic)
Multi
active WIP
Multi
active
More multi
active
(to AWS)
Self serve
deploys
AWS
Ready
for
Chaos
Photo by Darius Bashar on Unsplash
Who, where and exactly how?
Agents of chaos
● Virtual, closed team
● Draw from component
teams
● Experts / veterans
● Highest bus factor
Chaos scope - know thyself
● Know your architecture
● Know your steady state
● Know your constraints
○ What’s in your control?
○ What’s not?
○ What needs protecting?
Lorenipsumcaveatempor
Loren ipsum caveat empor. Loren ipsum
caveat empor. Loren ipsum caveat empor
Loren ipsum caveat empor.
X00 million
internal
requests
(busiest day)
X00 million
log messages
(busiest day)
Chaos scope - trust the brains-storm
http://bit.ly/2XzR7Q9
Chaos scope - brainstorm, then plan the
detail
Team
X
Team
Y
Team
Z
Chaos scope - hack the chaos
Team
X
Team
Y
Team
Z
Deciding where
● Production or closest to it
● Production (like) load
● Production (like) telemetry
● Decide the blast radius
● Decide comm’s channel(s)
Production
Staging
QA
Development
Photo by Darius Bashar on Unsplash
Execution
Deciding when
● To warn or not
● It was just another ordinary day …
● What else is going on?
● Chaos cut-off
Keep calm and chaos on (agents)
● (Virtually) co-locate the agents
● Collaborate and coordinate well
● Time-box, cover ground
● (Self) document well
Keep calm and chaos on (everyone else)
● It was just another ordinary day ...
● Also (self) document well
● Pretend it’s Production on
Photo by Darius Bashar on Unsplash
Retrospection
Divide and conquer, then regroup
● Component teams retro’s /
incident reviews first
● Major on engineering
improvements (people,
process, product)
● Then team-of-teams retro
● Minor on chaos day
improvements
People
ProcessProduct
Team X
Team Y
Team Z
Team of
teams
What did we learn?
● Start small
● Manage/limit the pain
● Production is a tough step
● Production-like is also hard!
● Have fun!
Photo by Darius Bashar on Unsplash
What next?
What’s your next chaos step?
Manual
In process
Automated
Unplanned
● Where are you at in the journey?
● What’s the next (baby) step?
● Need any help?
○ Talk to us
○ Check out our playbooks
Thank You
Simple solutions to big business problems.
Simple solutions to big business problems.
Contact us
Our experienced teams deliver software
all around the globe.
London
+44 203 603 7830
helloUK@equalexperts.com
Manchester
+44 203 603 7830
helloUK@equalexperts.com
Pune
+91 20 6687 2400
helloIndia@equalexperts.com
Bengaluru
+91 99 7298 0224
helloIndia@equalexperts.com
Lisbon
+351 211 378 414
helloPortugal@equalexperts.com
New York
+1 866-943-9737
helloUSA@equalexperts.com
Calgary
+1 403 775-4861
helloCanada@equalexperts.com
Berlin
helloDE@equalexperts.com
Sydney
+612 8999 6661
helloAUS@equalexperts.com
Cape Town
+27 21 680 5252
helloSA@equalexperts.com

More Related Content

Similar to Embracing collaborative chaos (April 2020) by Lyndsay Prewer

From devoops to devops
From devoops to devopsFrom devoops to devops
From devoops to devopsKris Buytaert
 
Dev secops opsec, devsec, devops ?
Dev secops opsec, devsec, devops ?Dev secops opsec, devsec, devops ?
Dev secops opsec, devsec, devops ?Kris Buytaert
 
RSA Conference APJ 2019 DevSecOps Days Security Chaos Engineering
RSA Conference APJ 2019 DevSecOps Days Security Chaos EngineeringRSA Conference APJ 2019 DevSecOps Days Security Chaos Engineering
RSA Conference APJ 2019 DevSecOps Days Security Chaos EngineeringAaron Rinehart
 
시니어가 들려주는 "내가 알고 있는 걸 당신도 알게 된다면"
시니어가 들려주는 "내가 알고 있는 걸 당신도 알게 된다면"시니어가 들려주는 "내가 알고 있는 걸 당신도 알게 된다면"
시니어가 들려주는 "내가 알고 있는 걸 당신도 알게 된다면"InfraEngineer
 
Continuous Infrastructure First
Continuous Infrastructure FirstContinuous Infrastructure First
Continuous Infrastructure FirstKris Buytaert
 
OWASP AppSec Global 2019 Security & Chaos Engineering
OWASP AppSec Global 2019 Security & Chaos EngineeringOWASP AppSec Global 2019 Security & Chaos Engineering
OWASP AppSec Global 2019 Security & Chaos EngineeringAaron Rinehart
 
Tenants for Going at DevSecOps Speed - LASCON 2023
Tenants for Going at DevSecOps Speed - LASCON 2023Tenants for Going at DevSecOps Speed - LASCON 2023
Tenants for Going at DevSecOps Speed - LASCON 2023Matt Tesauro
 
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...Demi Ben-Ari
 
Devops is a Security Requirement
Devops is a Security RequirementDevops is a Security Requirement
Devops is a Security RequirementKris Buytaert
 
Devops is dead, Long Live Devops
Devops is dead, Long Live DevopsDevops is dead, Long Live Devops
Devops is dead, Long Live DevopsKris Buytaert
 
People & Products – Lessons learned from the daily IT madness
People & Products – Lessons learned from the daily IT madnessPeople & Products – Lessons learned from the daily IT madness
People & Products – Lessons learned from the daily IT madnessinovex GmbH
 
GameDay - Achieving resilience through Chaos Engineering
GameDay - Achieving resilience through Chaos EngineeringGameDay - Achieving resilience through Chaos Engineering
GameDay - Achieving resilience through Chaos EngineeringDiUS
 
Devops, Secops, Opsec, DevSec *ops *.* ?
Devops, Secops, Opsec, DevSec *ops *.* ?Devops, Secops, Opsec, DevSec *ops *.* ?
Devops, Secops, Opsec, DevSec *ops *.* ?Kris Buytaert
 
Webinar: Preparing for Disasters that Will Actually Happen
Webinar: Preparing for Disasters that Will Actually HappenWebinar: Preparing for Disasters that Will Actually Happen
Webinar: Preparing for Disasters that Will Actually HappenStorage Switzerland
 
Continous Delivery of your Infrastructure
Continous Delivery of your InfrastructureContinous Delivery of your Infrastructure
Continous Delivery of your InfrastructureKris Buytaert
 
Building a Great AEM Team: Time Warner Cable's Journey
Building a Great AEM Team: Time Warner Cable's JourneyBuilding a Great AEM Team: Time Warner Cable's Journey
Building a Great AEM Team: Time Warner Cable's JourneyiCiDIGITAL
 
How we survived Hurricane Sandy
How we survived Hurricane SandyHow we survived Hurricane Sandy
How we survived Hurricane SandyMichael Zaic
 
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-AriThinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-AriDemi Ben-Ari
 
You build it, you run it
You build it, you run itYou build it, you run it
You build it, you run itSkyscanner
 

Similar to Embracing collaborative chaos (April 2020) by Lyndsay Prewer (20)

From devoops to devops
From devoops to devopsFrom devoops to devops
From devoops to devops
 
Dev secops opsec, devsec, devops ?
Dev secops opsec, devsec, devops ?Dev secops opsec, devsec, devops ?
Dev secops opsec, devsec, devops ?
 
RSA Conference APJ 2019 DevSecOps Days Security Chaos Engineering
RSA Conference APJ 2019 DevSecOps Days Security Chaos EngineeringRSA Conference APJ 2019 DevSecOps Days Security Chaos Engineering
RSA Conference APJ 2019 DevSecOps Days Security Chaos Engineering
 
시니어가 들려주는 "내가 알고 있는 걸 당신도 알게 된다면"
시니어가 들려주는 "내가 알고 있는 걸 당신도 알게 된다면"시니어가 들려주는 "내가 알고 있는 걸 당신도 알게 된다면"
시니어가 들려주는 "내가 알고 있는 걸 당신도 알게 된다면"
 
Continuous Infrastructure First
Continuous Infrastructure FirstContinuous Infrastructure First
Continuous Infrastructure First
 
OWASP AppSec Global 2019 Security & Chaos Engineering
OWASP AppSec Global 2019 Security & Chaos EngineeringOWASP AppSec Global 2019 Security & Chaos Engineering
OWASP AppSec Global 2019 Security & Chaos Engineering
 
Tenants for Going at DevSecOps Speed - LASCON 2023
Tenants for Going at DevSecOps Speed - LASCON 2023Tenants for Going at DevSecOps Speed - LASCON 2023
Tenants for Going at DevSecOps Speed - LASCON 2023
 
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...
 
Devops is a Security Requirement
Devops is a Security RequirementDevops is a Security Requirement
Devops is a Security Requirement
 
Devops is dead, Long Live Devops
Devops is dead, Long Live DevopsDevops is dead, Long Live Devops
Devops is dead, Long Live Devops
 
People & Products – Lessons learned from the daily IT madness
People & Products – Lessons learned from the daily IT madnessPeople & Products – Lessons learned from the daily IT madness
People & Products – Lessons learned from the daily IT madness
 
DevOps?!@
DevOps?!@DevOps?!@
DevOps?!@
 
GameDay - Achieving resilience through Chaos Engineering
GameDay - Achieving resilience through Chaos EngineeringGameDay - Achieving resilience through Chaos Engineering
GameDay - Achieving resilience through Chaos Engineering
 
Devops, Secops, Opsec, DevSec *ops *.* ?
Devops, Secops, Opsec, DevSec *ops *.* ?Devops, Secops, Opsec, DevSec *ops *.* ?
Devops, Secops, Opsec, DevSec *ops *.* ?
 
Webinar: Preparing for Disasters that Will Actually Happen
Webinar: Preparing for Disasters that Will Actually HappenWebinar: Preparing for Disasters that Will Actually Happen
Webinar: Preparing for Disasters that Will Actually Happen
 
Continous Delivery of your Infrastructure
Continous Delivery of your InfrastructureContinous Delivery of your Infrastructure
Continous Delivery of your Infrastructure
 
Building a Great AEM Team: Time Warner Cable's Journey
Building a Great AEM Team: Time Warner Cable's JourneyBuilding a Great AEM Team: Time Warner Cable's Journey
Building a Great AEM Team: Time Warner Cable's Journey
 
How we survived Hurricane Sandy
How we survived Hurricane SandyHow we survived Hurricane Sandy
How we survived Hurricane Sandy
 
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-AriThinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
 
You build it, you run it
You build it, you run itYou build it, you run it
You build it, you run it
 

More from Equal Experts

TRUST Framework Talk 2023-03-10.pptx
TRUST Framework Talk 2023-03-10.pptxTRUST Framework Talk 2023-03-10.pptx
TRUST Framework Talk 2023-03-10.pptxEqual Experts
 
Will it matter if your child cannot code?
Will it matter if your child cannot code?Will it matter if your child cannot code?
Will it matter if your child cannot code?Equal Experts
 
Platform Security IRL: Busting Buzzwords & Building Better
Platform Security IRL:  Busting Buzzwords & Building BetterPlatform Security IRL:  Busting Buzzwords & Building Better
Platform Security IRL: Busting Buzzwords & Building BetterEqual Experts
 
Software development practices & Infrastructure as Code - how well do they wo...
Software development practices & Infrastructure as Code - how well do they wo...Software development practices & Infrastructure as Code - how well do they wo...
Software development practices & Infrastructure as Code - how well do they wo...Equal Experts
 
A Whole Team Approach to Quality in Continuous Delivery - Lisa Crispin
A Whole Team Approach to Quality in Continuous Delivery - Lisa CrispinA Whole Team Approach to Quality in Continuous Delivery - Lisa Crispin
A Whole Team Approach to Quality in Continuous Delivery - Lisa CrispinEqual Experts
 
Secure Continuous Delivery
Secure Continuous DeliverySecure Continuous Delivery
Secure Continuous DeliveryEqual Experts
 
Smoothing the continuous delivery path a tale of two architectures - expert...
Smoothing the continuous delivery path   a tale of two architectures - expert...Smoothing the continuous delivery path   a tale of two architectures - expert...
Smoothing the continuous delivery path a tale of two architectures - expert...Equal Experts
 
Design Systems: Designing out Waste, Designing in Consistency
Design Systems: Designing out Waste, Designing in ConsistencyDesign Systems: Designing out Waste, Designing in Consistency
Design Systems: Designing out Waste, Designing in ConsistencyEqual Experts
 
Growing Together - software development in the Developing world
Growing Together - software development in the Developing worldGrowing Together - software development in the Developing world
Growing Together - software development in the Developing worldEqual Experts
 
Infrastructure - a journey from datacentres to cloud
Infrastructure - a journey from datacentres to cloudInfrastructure - a journey from datacentres to cloud
Infrastructure - a journey from datacentres to cloudEqual Experts
 
Data Science In Action: Prenatal Screening for Down Syndrome
Data Science In Action: Prenatal Screening for Down SyndromeData Science In Action: Prenatal Screening for Down Syndrome
Data Science In Action: Prenatal Screening for Down SyndromeEqual Experts
 
The essentials of the IT industry or What I wish I was taught about at Univer...
The essentials of the IT industry or What I wish I was taught about at Univer...The essentials of the IT industry or What I wish I was taught about at Univer...
The essentials of the IT industry or What I wish I was taught about at Univer...Equal Experts
 
Secrets of an agile transformation
Secrets of an agile transformationSecrets of an agile transformation
Secrets of an agile transformationEqual Experts
 
Obstacles of Digital Transformation Evolution
Obstacles of Digital Transformation EvolutionObstacles of Digital Transformation Evolution
Obstacles of Digital Transformation EvolutionEqual Experts
 
Avoiding the security brick
Avoiding the security brickAvoiding the security brick
Avoiding the security brickEqual Experts
 
Organising for Continuous Delivery
Organising for Continuous DeliveryOrganising for Continuous Delivery
Organising for Continuous DeliveryEqual Experts
 
Cracking passwords via common topologies
Cracking passwords via common topologiesCracking passwords via common topologies
Cracking passwords via common topologiesEqual Experts
 
Inception Phases - Handling Complexity
Inception Phases - Handling ComplexityInception Phases - Handling Complexity
Inception Phases - Handling ComplexityEqual Experts
 
Smoothing the Continuous Delivery Path - A Tale of Two Teams
Smoothing the Continuous Delivery Path - A Tale of Two TeamsSmoothing the Continuous Delivery Path - A Tale of Two Teams
Smoothing the Continuous Delivery Path - A Tale of Two TeamsEqual Experts
 

More from Equal Experts (20)

TRUST Framework Talk 2023-03-10.pptx
TRUST Framework Talk 2023-03-10.pptxTRUST Framework Talk 2023-03-10.pptx
TRUST Framework Talk 2023-03-10.pptx
 
Will it matter if your child cannot code?
Will it matter if your child cannot code?Will it matter if your child cannot code?
Will it matter if your child cannot code?
 
Platform Security IRL: Busting Buzzwords & Building Better
Platform Security IRL:  Busting Buzzwords & Building BetterPlatform Security IRL:  Busting Buzzwords & Building Better
Platform Security IRL: Busting Buzzwords & Building Better
 
Software development practices & Infrastructure as Code - how well do they wo...
Software development practices & Infrastructure as Code - how well do they wo...Software development practices & Infrastructure as Code - how well do they wo...
Software development practices & Infrastructure as Code - how well do they wo...
 
A Whole Team Approach to Quality in Continuous Delivery - Lisa Crispin
A Whole Team Approach to Quality in Continuous Delivery - Lisa CrispinA Whole Team Approach to Quality in Continuous Delivery - Lisa Crispin
A Whole Team Approach to Quality in Continuous Delivery - Lisa Crispin
 
Secure Continuous Delivery
Secure Continuous DeliverySecure Continuous Delivery
Secure Continuous Delivery
 
Smoothing the continuous delivery path a tale of two architectures - expert...
Smoothing the continuous delivery path   a tale of two architectures - expert...Smoothing the continuous delivery path   a tale of two architectures - expert...
Smoothing the continuous delivery path a tale of two architectures - expert...
 
Design Systems: Designing out Waste, Designing in Consistency
Design Systems: Designing out Waste, Designing in ConsistencyDesign Systems: Designing out Waste, Designing in Consistency
Design Systems: Designing out Waste, Designing in Consistency
 
Growing Together - software development in the Developing world
Growing Together - software development in the Developing worldGrowing Together - software development in the Developing world
Growing Together - software development in the Developing world
 
Infrastructure - a journey from datacentres to cloud
Infrastructure - a journey from datacentres to cloudInfrastructure - a journey from datacentres to cloud
Infrastructure - a journey from datacentres to cloud
 
Data Science In Action: Prenatal Screening for Down Syndrome
Data Science In Action: Prenatal Screening for Down SyndromeData Science In Action: Prenatal Screening for Down Syndrome
Data Science In Action: Prenatal Screening for Down Syndrome
 
The essentials of the IT industry or What I wish I was taught about at Univer...
The essentials of the IT industry or What I wish I was taught about at Univer...The essentials of the IT industry or What I wish I was taught about at Univer...
The essentials of the IT industry or What I wish I was taught about at Univer...
 
Secrets of an agile transformation
Secrets of an agile transformationSecrets of an agile transformation
Secrets of an agile transformation
 
Obstacles of Digital Transformation Evolution
Obstacles of Digital Transformation EvolutionObstacles of Digital Transformation Evolution
Obstacles of Digital Transformation Evolution
 
Avoiding the security brick
Avoiding the security brickAvoiding the security brick
Avoiding the security brick
 
Continuous Security
Continuous SecurityContinuous Security
Continuous Security
 
Organising for Continuous Delivery
Organising for Continuous DeliveryOrganising for Continuous Delivery
Organising for Continuous Delivery
 
Cracking passwords via common topologies
Cracking passwords via common topologiesCracking passwords via common topologies
Cracking passwords via common topologies
 
Inception Phases - Handling Complexity
Inception Phases - Handling ComplexityInception Phases - Handling Complexity
Inception Phases - Handling Complexity
 
Smoothing the Continuous Delivery Path - A Tale of Two Teams
Smoothing the Continuous Delivery Path - A Tale of Two TeamsSmoothing the Continuous Delivery Path - A Tale of Two Teams
Smoothing the Continuous Delivery Path - A Tale of Two Teams
 

Recently uploaded

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Recently uploaded (20)

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

Embracing collaborative chaos (April 2020) by Lyndsay Prewer

  • 1. Embracing collaborative chaos Running chaos days on large platforms Lyndsay Prewer | @equalexperts
  • 2. Photo by Darius Bashar on Unsplash What is chaos engineering and why should we care?
  • 3. Building vital, high traffic services, fast Google Cloud Dataflow In the Smart Home Data Pipeline ● Delivered 10 days early! ● Built in 4 weeks. ● 140,000 claims processed on launch day. ● No production incidents
  • 4. Building cool, planet-scale, services, fast Google Cloud Dataflow In the Smart Home Data Pipeline
  • 5. Operating on the edge of chaos http://bit.ly/2ZavoyP http://bit.ly/2QVeWzA “Two normally- benign misconfigurations, and a specific software bug, combined to initiate the outage”
  • 6. How can your system fail? Google Cloud Dataflow In the Smart Home Data Pipeline ● What are the component parts? ● How are they connected? ● How reliable is each part? ● How reliable are the connections? ● What happens when X fails?
  • 7. Addressing the risk of unexpected failure A B A B D C Z E G H F I ● Address risk by deliberate inducing failure ● Observe, reflect and improve ● Build resilience in (like quality) ● Think about production (and failure) all the time Simples Hard
  • 8. What do we mean by resilience?
  • 9. Four chaos engineering approaches Manual In process Automated
  • 10. Manual chaos ● Chaos Days ● AWS Game Days ● Change specific chaos
  • 11. ● Chaos monkey ● AWS spot instances / GCP Preemptible VMs ● Randomised pod killer Automated chaos
  • 12. In process chaos engineering ● Part of normal engineering process ● Focus for all roles in the team ● Production thinking / building resilience in Product Owner Dev QA Dev Ops Focus on: Quality AND Production AND Resilience Define Build Explore Deploy
  • 13. (Unplanned chaos) ● Every day is a school day ● Handle incidents well ● Learn from incidents - post incident reviews ● Start simple then incorporate tooling A B D C Z E G H F I
  • 14. How does it help? People ProcessProduct Knowledge Behaviour Expertise Managing incidents Learning from incidents Engineering approach Simplification Observability Runbooks Resilience
  • 15. Photo by Darius Bashar on Unsplash Running a Chaos Day - when and how?
  • 16. Our context Legacy systems x100 million internal requests (busiest day) x100 million log messages (busiest day) x850 microservices x100M Customers 60 Delivery teams ~1000 Microservices Lorenipsumcaveatempor Loren ipsum caveat empor. Loren ipsum caveat empor. Loren ipsum caveat empor Loren ipsum caveat empor. Lorenipsumcaveatempor Loren ipsum caveat empor. Loren ipsum caveat empor. Loren ipsum caveat empor Loren ipsum caveat empor. Lorenipsumcaveatempor Loren ipsum caveat empor. Loren ipsum caveat empor. Loren ipsum caveat empor Loren ipsum caveat empor. 6 Platform teams (AWS PaaS)
  • 17. When were we ready for chaos? 2013 2014 Cloud Docker Scala Mongo ELK Fast growth (teams, services, traffic)
  • 18. When were we ready for chaos? 2013 2014 2015 2016 Cloud Docker Scala Mongo ELK Fast growth (teams, services, traffic) Multi active WIP Multi active
  • 19. When were we ready for chaos? 2013 2014 2015 2016 2017 2018 Cloud Docker Scala Mongo ELK Fast growth (teams, services, traffic) Multi active WIP Multi active More multi active (to AWS) Self serve deploys AWS Ready for Chaos
  • 20. Photo by Darius Bashar on Unsplash Who, where and exactly how?
  • 21. Agents of chaos ● Virtual, closed team ● Draw from component teams ● Experts / veterans ● Highest bus factor
  • 22. Chaos scope - know thyself ● Know your architecture ● Know your steady state ● Know your constraints ○ What’s in your control? ○ What’s not? ○ What needs protecting? Lorenipsumcaveatempor Loren ipsum caveat empor. Loren ipsum caveat empor. Loren ipsum caveat empor Loren ipsum caveat empor. X00 million internal requests (busiest day) X00 million log messages (busiest day)
  • 23. Chaos scope - trust the brains-storm http://bit.ly/2XzR7Q9
  • 24. Chaos scope - brainstorm, then plan the detail Team X Team Y Team Z
  • 25. Chaos scope - hack the chaos Team X Team Y Team Z
  • 26. Deciding where ● Production or closest to it ● Production (like) load ● Production (like) telemetry ● Decide the blast radius ● Decide comm’s channel(s) Production Staging QA Development
  • 27. Photo by Darius Bashar on Unsplash Execution
  • 28. Deciding when ● To warn or not ● It was just another ordinary day … ● What else is going on? ● Chaos cut-off
  • 29. Keep calm and chaos on (agents) ● (Virtually) co-locate the agents ● Collaborate and coordinate well ● Time-box, cover ground ● (Self) document well
  • 30. Keep calm and chaos on (everyone else) ● It was just another ordinary day ... ● Also (self) document well ● Pretend it’s Production on
  • 31. Photo by Darius Bashar on Unsplash Retrospection
  • 32. Divide and conquer, then regroup ● Component teams retro’s / incident reviews first ● Major on engineering improvements (people, process, product) ● Then team-of-teams retro ● Minor on chaos day improvements People ProcessProduct Team X Team Y Team Z Team of teams
  • 33. What did we learn? ● Start small ● Manage/limit the pain ● Production is a tough step ● Production-like is also hard! ● Have fun!
  • 34. Photo by Darius Bashar on Unsplash What next?
  • 35. What’s your next chaos step? Manual In process Automated Unplanned ● Where are you at in the journey? ● What’s the next (baby) step? ● Need any help? ○ Talk to us ○ Check out our playbooks
  • 36. Thank You Simple solutions to big business problems.
  • 37. Simple solutions to big business problems. Contact us Our experienced teams deliver software all around the globe. London +44 203 603 7830 helloUK@equalexperts.com Manchester +44 203 603 7830 helloUK@equalexperts.com Pune +91 20 6687 2400 helloIndia@equalexperts.com Bengaluru +91 99 7298 0224 helloIndia@equalexperts.com Lisbon +351 211 378 414 helloPortugal@equalexperts.com New York +1 866-943-9737 helloUSA@equalexperts.com Calgary +1 403 775-4861 helloCanada@equalexperts.com Berlin helloDE@equalexperts.com Sydney +612 8999 6661 helloAUS@equalexperts.com Cape Town +27 21 680 5252 helloSA@equalexperts.com

Editor's Notes

  1. Hello, my name is Lyndsay Prewer. Over the last couple of years, I’ve been leading a group of teams that develop and operate a Platform-as-a-Service for a very large public sector client. In this talk I’ll describe how we’ve used Chaos Days to improve the resilience of our platform, and the effectiveness of our platform and it’s teams to gracefully handle catastrophic failures.
  2. Chaos engineering is particularly relevant to distributed systems, as these have a scale and high level of complexity that make it impossible to determine their emergent properties and behaviour, let alone every possible failure mode, it’s impact and possible mitigation. Although distributed systems have been around for decades, recent advances in technology, such as serverless, combined with agile and lean practices have led to teams being able to get more complex stuff into production faster and at lower cost. We can build really cool applications like Nest XYZ, so we can do ABC. What could possibly go wrong!?
  3. Chaos engineering is particularly relevant to distributed systems, as these have a scale and high level of complexity that make it impossible to determine their emergent properties and behaviour, let alone every possible failure mode, it’s impact and possible mitigation. Although distributed systems have been around for decades, recent advances in technology, such as serverless, combined with agile and lean practices have led to teams being able to get more complex stuff into production faster and at lower cost. We can build really cool applications like Nest XYZ, so we can do ABC. What could possibly go wrong!?
  4. We can build really cool applications like Nest XYZ, so we can do ABC. What could possibly go wrong!? Complex/distributed systems will fail - not if but when - our systems operate on the edge of chaos
  5. Consider your own system...
  6. As component parts and connections increase we get an exponential increase in the complexity of the emergent behaviour and thus the number of possible failure modes. This equates to a decrease in our ability to predict failures and their impact zone. Building resilience in, similar to Build quality in Production thinking “It’s a mindset, not a toolset: you don’t need to be running EKS on AWS to benefit from ….”
  7. It doesn’t mean we build systems that never fail, that are perfect and indestructible. It means we build systems that cope with failure well, that recover well, that are elastic.
  8. Chaos Days (focus on what, not why, as why comes later) Chaos testing (focus is very narrow/local to new/changed components)
  9. Chaos Monkey, Symian army et al AWS and GCP alternatives (spot instances, etc.) (Semi-automated) - Super K8S Chaos Bro
  10. Making this part of normal flow - link back to Production thinking / Building resilience in
  11. Reference https://medium.com/@NetflixTechBlog/introducing-dispatch-da4b8a2a8072
  12. It’s not just about more resilient components. It starts with people, their knowledge, their expertise, their behaviours. It covers process - how we respond to and manage incidents, how we learn from them, how we fold these learnings into our engineering practices. On the product front, it’s more than just resilience improvements. It’s also making systems easier to observe, easier to understand and reason about. Systems that automatically heal and tolerate failure is the goal, but improvements in things such as telemetry, alerting and runbooks.
  13. Describe size, scale and architecture of Public sector client At various other clients, ranging from retail to payment systems, we’ve setup and run kube-monkey in all environments, opted for preemptible VMs, and run Game Days to help teams learn how to diagnose and debug Production issues.
  14. For large platforms, owning teams should provide Chaos Agent to plot and scheme in secret with others. Who knows your system the best? Who do you turn to when the shit hits the fan? Should be high bus factor person.
  15. Map out your architecture and dependencies Define steady state What’s normal load/throughput? How do you know the system is healthy? (heart rate, VO2-Max, metrics, 5XX / 499 (check this) responses, alerts) What do you have control over? What services / teams do you want to protect?
  16. Apollo 13 picture Map out your architecture and dependencies Doesn’t need to be a big diagram - just get the experts together and brainstorm. Give them a clear intent, a goal, a direction and some constraints, then leave them to figure it out.
  17. Define hypothesis for specific interventions and expected response, e.g. Instance failures, app failures, AZ failures, volumes filling up, connections failing/slowing, database failing. Security attacks (break-the-bank approaches, malicious engineer) Map out sequencing, e.g. what should go together, what kept apart, what can be done independently. How will normal service be resumed?
  18. Chaos Days are a perfect time to also run security attacks (break-the-bank approaches, malicious engineer)
  19. Production or not? If not how production like are things (cookie cutter environments, telemetry) How will load be generated? Who will be impacted if chaos does reign? What comm’s channel is normally used?
  20. Some warning? Anything else happening at that time (e.g. peak loads, major releases) How will you ensure normal service is resumed - story from our first day
  21. [Photo from 1st chaos day?] Co-locate agents of chaos, plus comm’s channel Collaborate and coordinate in response to chaos and how it’s handled. Timebox to ensure enough chaos variants covered and normal service is resumed [Slack and trello screen shots?] Record what you’re doing (slack, trello - hypothesis, expected response, actual)
  22. Just an ordinary day (i.e. all teams working as normal) Record what you’re doing (slack) Treat chaos environment as production
  23. Team based retro’s then team of teams Separate resilience improvements (e.g. tech, process, people) from chaos day improvements
  24. [Slide, check our own list] Lessons learnt What’s not worked well Things we’d do next time
  25. What’s your next step?
  26. Describe various possible contexts, and possible next steps for each