SlideShare a Scribd company logo
1 of 51
Download to read offline
Chaos Engineering 101:
A Field Guide
Matthew Brahms | SRE | @matthewbrahms
What you will get from this talk in exchange for your time:
● Understand the definitions of Chaos Engineering (CE)
● Hear a brief history of the field
● Describe the mindset and methodologies of CE
● Know what steps you can take to start doing CE “in the wild”
● Realize the valuable outcomes of having a CE group at your org
● Prepare for common CE myths
● Have some resources for further investigation of the discipline
Who are we in this room?
dev/ops/devops/qa/qe/swe/sre/management
Chaos Engineering is the discipline of experimenting on a distributed
system in order to build confidence in the system’s capability to
withstand turbulent conditions in production.
- http://principlesofchaos.org/
Bad things will (and are) happening to your
system, no matter how well designed it is.
You cannot become ignorant to it.
All of this really means...
History of Chaos Engineering
A *brief* history of the CE field
● 2010 - Chaos Monkey
● 2011 - Simian Army
● 2012 - Chaos Monkey OSS
● 2014 - Chaos Engineer
role @ Netflix
● 2017 - Chaos Toolkit on
GitHub (OSS)
● 2018 - Gremlin hosts first
ChaosConf in SF
● 2018 - CNCF Chaos
working group
Where else can CE be found?
● Airline industry
○ Air Traffic Control
○ Plane construction
○ Pilot procedures
● Naval Air Operations at Sea
● Electrical Power Systems
● Public Water Systems
● Medical devices
○ Hospitals
○ Implanted devices
● Highway infrastructure
● Car crash safety ratings
Methodology/Mindset of Chaos Engineering
CE is a discipline
● This implies rigor, as in the
academic sense
● Each org/person is unique in
their implementation
● It’s not a process we can “say
we do” and then file it into the
abyss of “the wiki”
Form a hypothesis
● You should know your
app/tech stack well
● Whiteboard your entire
system with another senior
engineer and always with new
onboards
● Find a domain/service where a
failure is likely to exist and
start there
Test your ideas
● Goal is to either validate or
invalidate your failure-case
hypothesis
● The act of testing your
hypothesis should *not* result
in any harm to the user
experience!
Analyze results
● Lessons learned from the
experiment are priceless
● The results and lessons
learned should be
communicated to the entire
team
● Action items should be
started to increase resiliency if
there were issues discovered
Final Step: Repeat!
Chaos Engineering: “In the Wild”
Level 0 - The Basics
1. You will need team/engineering buy-in
2. You will need full support from your engineering and business leadership
3. You will need *observability* in your application/infrastructure/user experience.
Note: if you cannot detect/observe failure states when not formally doing chaos
engineering, that is an area to focus on before adopting chaos engineering.
4. You will need a fully-documented and robust SEV outage procedure (replete with
Incident commanders, blameless post-mortems, etc.) Note: this is another topic
that if there is a lack of maturity in before doing chaos engineering, this should be
built-up first.
** All of these could be *entire talks* on their own
Level 1 - Assemble team Time: varies
Two things are needed before going to level 2:
- A defined product/domain/service, etc. that you wish to test for failure
- A group of engineers (ops/dev/security/support/business):
- You need this group to be comprised of people who are involved end-to-end with your service
- They need to have time to attend pre-game meeting, experiment, and follow-up
- Involve/inform as many people as possible in case of a failure during the experiment
- Include Senior and Junior Engineers and even business people related to the service
- Be sure to set the expectations for the level of involvement you need
Example: “We will test our resiliency at the base layer of our infrastructure compute
nodes.”
Level 2 - Formulate Hypothesis Time: 1-2 hours
Get everyone together and formulate your hypothesis.
Whiteboard the entire service/hypothesis until everyone has a clear and thorough
understanding of the system and the actions that will be taken to experiment with
resiliency.
Also assign roles and responsibilities for each person that will occur during the
gameday. (Have a documentation user, have a QRF team, have someone just to
operate the experiment, etc.)
Document all of the above and socialize this documentation to other teams.
Example: “If we delete (lose) a cloud compute node, our Kubernetes cluster will
recover and re-provision, with no downtime or negative user experience.”
Level 3 - Gameday Time: 1-4 hours
Ideally, game day looks like a launch at NASA. Each of the assigned persons knows
their role and you can do a pre-launch checklist, ensuring each team is ready.
If there are any issues impacting the system or anything that the gameday *might*
interfere with or make worse, abort the launch.
If you are ready, then proceed with initiating the experiment keeping a keen eye on
watching the progress.
Example: “Our infrastructure is currently not degraded in any way, it is not Black
Friday, we have SRE, SWE, Support, Security, and a few business folks here. We will
now begin to delete a node and watch the success rates of our api’s while expecting
and monitoring for the node recovery/re-provisioning.”
Level 4 - Recap Lessons Learned Time: 30 minutes
Gather everyone involved and recap what happened. In case of success or failure and
remediation--be sure to go over the timeline of what happened.
Gather lessons that everyone learned, being sure to highlight what we learned from the
experiment that we didn’t know before (this is good to see value).
Plan work for engineering teams as necessary to close any resiliency gaps that the
experiment discovered.
Communicate the value of all that has occurred in this process to the business. This is
work that has directly contributed to the bottom line of the company.
Gameday Templates!
If you are very new to doing this, Gremlin has a complete set of templates and
checklists to help you get started! (They really are quite excellent!)
https://www.gremlin.com/gameday/
Outcomes for Chaos Engineering
1. Avoid costs of downtime.
Do we really *know* how much
downtime really costs our enterprise in:
Sales, Engineering, Loss of Productivity, etc.?
User experience will go up!
2. Decrease pages to Ops/Dev/SRE
Do we all like sleep?
Do we track the number of pages our teams get?
The blast-radius/cost of an outage event is large (lurkers & active)
3. Increase Productivity
Less time and money spent on outages
and reactive work will increase our time
and resources for proactive work/features.
What value could our Ops teams add if they were distracted less?
4. Increase the spread of knowledge
throughout your organization
Tired of running into lack of documentation/runbooks?
Tired of people leaving with *heaps* of “tribal knowledge” ?
Tired of people saying “I don’t know...that’s Johnny’s expertise” ?
Top Chaos Engineering Myths
(...not an exhaustive list)
Top Chaos Engineering Myths
1. It’s not my job!
2. *Now* what tool do we have to buy & learn?
3. It costs how much??
4. We have too much work to do (i.e. features,
bug-fixes, etc.)
5. We can just deal with outages JIT, right!?
6. Our uptime target is 100% right? Why should
we ever introduce “experiments” in
production?
7. Why do you think we even have an ops/sre
team?
8. We don’t even have SLO/SLI/SLA in
place...even if we wanted to, how could we
start?
*IMMEDIATE* thoughts/responses
from an SRE to these myths...
But wait...
this.
Busting CE myths & takeaways
It is *everyone’s*
job to care about
functionality,
reliability, and
ultimately #profit
Take the time to be
data-driven about
the whole cost
argument.
There is a
learning/implementation
curve when Engineering
Chaos, but continuous
learning and
improvement are job
req’s, right?
Do we really expect and
employ a strategy of
hope that only OPS/SRE
should be doing Chaos
Engineering?
Chaos Engineering != tooling
(necessarily)
Start with preemptible/spot instances for services in lower environments :)
What can you do about implementing chaos engineering:
1. Evangelize the idea and principles of chaos engineering to our organizations
2. Ensure that your systems are measurable (can detect chaos even if it is
unplanned) and that there is a really solid SEV process in-place.
3. Start with whiteboarding sessions/high-level discussions about how our
applications/services are architected and function--gain “herd immunity”
regarding knowledge
4. Pick 1 service or application that is well-documented, very observable, not in a
critical production path, etc. to serve as your first experiment upon for chaos
experimentation. Stop immediately if things go wrong.
5. If you need/feel like ramping up quickly, Gremlin may be a good choice
Chaos Engineering: Additional resources
Additional online resources
- Chaos Conf 2018 talks
- Gremlin (Chaos-as-a-service, Documentation, Community Labs, etc.)
- Gremlin Free Edition
- Chaos Slack community - https://slofile.com/slack/chaosengineering
- Talks by: Adrian Cockroft, Lorin Hochstein, Kolton Andrus, Tammy Butow, John
Allspaw
- CNCF Chaos WG (https://github.com/chaoseng/wg-chaoseng)
- Netflix Simian Army (https://github.com/Netflix/SimianArmy)
- Chaos Toolkit (https://github.com/chaostoolkit)
- Kubernetes Chaos Lab (https://github.com/matthewbrahms/kubernetes-chaos-lab)
Additional reading
Books for further academic reading:
- Release It! 2nd Edition by Michael Nygard
- Drift Into Failure by Sydney Dekker
- Chaos Engineering (O’Reilly)
- The Safety Anarchist by Sydney Dekker
Questions | Comments | Discussions | Ideas ?
Are you interested in
Chaos Engineering?
Join us at the meetup!
www.meetup.com/Austin-Chaos-Engineering-Meetup/

More Related Content

What's hot

Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsChaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsC4Media
 
Introduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft AzureIntroduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft AzureAna Medina
 
An Introduction to Chaos Engineering
An Introduction to Chaos EngineeringAn Introduction to Chaos Engineering
An Introduction to Chaos EngineeringGremlin
 
Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionChaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionKeet Sugathadasa
 
Chaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWSChaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWSBilal Aybar
 
From Monolithic to Microservices
From Monolithic to Microservices From Monolithic to Microservices
From Monolithic to Microservices Amazon Web Services
 
Aligning to the NIST Cybersecurity Framework in the AWS
Aligning to the NIST Cybersecurity Framework in the AWSAligning to the NIST Cybersecurity Framework in the AWS
Aligning to the NIST Cybersecurity Framework in the AWSAmazon Web Services
 
Chaos engineering and chaos testing
Chaos engineering and chaos testingChaos engineering and chaos testing
Chaos engineering and chaos testingjeetendra mandal
 
Microservice Architecture Patterns, by Richard Langlois P. Eng.
Microservice Architecture Patterns, by Richard Langlois P. Eng.Microservice Architecture Patterns, by Richard Langlois P. Eng.
Microservice Architecture Patterns, by Richard Langlois P. Eng.Richard Langlois P. Eng.
 
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...Ana Medina
 
Principles Of Chaos Engineering - Chaos Engineering Hamburg
Principles Of Chaos Engineering - Chaos Engineering HamburgPrinciples Of Chaos Engineering - Chaos Engineering Hamburg
Principles Of Chaos Engineering - Chaos Engineering HamburgNils Meder
 
Cloud Migration, Application Modernization and Security for Partners
Cloud Migration, Application Modernization and Security for PartnersCloud Migration, Application Modernization and Security for Partners
Cloud Migration, Application Modernization and Security for PartnersAmazon Web Services
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...DevOpsDays Tel Aviv
 
ENT204 The AWS Cloud Value Framework
ENT204 The AWS Cloud Value FrameworkENT204 The AWS Cloud Value Framework
ENT204 The AWS Cloud Value FrameworkAmazon Web Services
 
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps WayDevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Waysmalltown
 
Practical Chaos Engineering
Practical Chaos EngineeringPractical Chaos Engineering
Practical Chaos EngineeringSIGHUP
 
K8s on AWS: Introducing Amazon EKS
K8s on AWS: Introducing Amazon EKSK8s on AWS: Introducing Amazon EKS
K8s on AWS: Introducing Amazon EKSAmazon Web Services
 

What's hot (20)

Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsChaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient Systems
 
Introduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft AzureIntroduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft Azure
 
An Introduction to Chaos Engineering
An Introduction to Chaos EngineeringAn Introduction to Chaos Engineering
An Introduction to Chaos Engineering
 
Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionChaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in Production
 
Chaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWSChaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWS
 
From Monolithic to Microservices
From Monolithic to Microservices From Monolithic to Microservices
From Monolithic to Microservices
 
Aligning to the NIST Cybersecurity Framework in the AWS
Aligning to the NIST Cybersecurity Framework in the AWSAligning to the NIST Cybersecurity Framework in the AWS
Aligning to the NIST Cybersecurity Framework in the AWS
 
Chaos engineering and chaos testing
Chaos engineering and chaos testingChaos engineering and chaos testing
Chaos engineering and chaos testing
 
Microservice Architecture Patterns, by Richard Langlois P. Eng.
Microservice Architecture Patterns, by Richard Langlois P. Eng.Microservice Architecture Patterns, by Richard Langlois P. Eng.
Microservice Architecture Patterns, by Richard Langlois P. Eng.
 
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
 
Principles Of Chaos Engineering - Chaos Engineering Hamburg
Principles Of Chaos Engineering - Chaos Engineering HamburgPrinciples Of Chaos Engineering - Chaos Engineering Hamburg
Principles Of Chaos Engineering - Chaos Engineering Hamburg
 
Cloud Migration, Application Modernization and Security for Partners
Cloud Migration, Application Modernization and Security for PartnersCloud Migration, Application Modernization and Security for Partners
Cloud Migration, Application Modernization and Security for Partners
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
 
(ARC307) Infrastructure as Code
(ARC307) Infrastructure as Code(ARC307) Infrastructure as Code
(ARC307) Infrastructure as Code
 
Devops as a service
Devops as a serviceDevops as a service
Devops as a service
 
ENT204 The AWS Cloud Value Framework
ENT204 The AWS Cloud Value FrameworkENT204 The AWS Cloud Value Framework
ENT204 The AWS Cloud Value Framework
 
Cloud Economics
Cloud EconomicsCloud Economics
Cloud Economics
 
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps WayDevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
 
Practical Chaos Engineering
Practical Chaos EngineeringPractical Chaos Engineering
Practical Chaos Engineering
 
K8s on AWS: Introducing Amazon EKS
K8s on AWS: Introducing Amazon EKSK8s on AWS: Introducing Amazon EKS
K8s on AWS: Introducing Amazon EKS
 

Similar to Chaos Engineering 101: A Field Guide

Site-Reliability-Engineering-v2[6241].pdf
Site-Reliability-Engineering-v2[6241].pdfSite-Reliability-Engineering-v2[6241].pdf
Site-Reliability-Engineering-v2[6241].pdfDeepakGupta747774
 
DevOps - Boldly Go for Distro
DevOps - Boldly Go for DistroDevOps - Boldly Go for Distro
DevOps - Boldly Go for DistroPaul Boos
 
How to improve Developer Documentations ?
How to improve Developer Documentations ?How to improve Developer Documentations ?
How to improve Developer Documentations ?Utsav Parashar
 
30 days or less: New Features to Production
30 days or less: New Features to Production30 days or less: New Features to Production
30 days or less: New Features to ProductionKarthik Gaekwad
 
Scrum an extension pattern language for hyperproductive software development
Scrum an extension pattern language  for hyperproductive software developmentScrum an extension pattern language  for hyperproductive software development
Scrum an extension pattern language for hyperproductive software developmentShiraz316
 
Putting Devs On-Call: How to Empower Your Team
Putting Devs On-Call: How to Empower Your TeamPutting Devs On-Call: How to Empower Your Team
Putting Devs On-Call: How to Empower Your TeamVictorOps
 
Current Trends in Agile - opening keynote for Agile Israel 2014
Current Trends in Agile - opening keynote for Agile Israel 2014Current Trends in Agile - opening keynote for Agile Israel 2014
Current Trends in Agile - opening keynote for Agile Israel 2014Yuval Yeret
 
Continuous Deployment
Continuous DeploymentContinuous Deployment
Continuous DeploymentBrian Henerey
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018Christophe Rochefolle
 
Week 4 Assignment - Software Development PlanScenario-Your team has be.docx
Week 4 Assignment - Software Development PlanScenario-Your team has be.docxWeek 4 Assignment - Software Development PlanScenario-Your team has be.docx
Week 4 Assignment - Software Development PlanScenario-Your team has be.docxestefana2345678
 
Bcn devcon jose luis soria - patterns & antipatterns for delivery
Bcn devcon   jose luis soria - patterns & antipatterns for deliveryBcn devcon   jose luis soria - patterns & antipatterns for delivery
Bcn devcon jose luis soria - patterns & antipatterns for deliveryJose Luis Soria
 
5-Ways-to-Revolutionize-Your-Software-Testing
5-Ways-to-Revolutionize-Your-Software-Testing5-Ways-to-Revolutionize-Your-Software-Testing
5-Ways-to-Revolutionize-Your-Software-TestingMary Clemons
 
Process Evolution and Product Maturity
Process Evolution and Product MaturityProcess Evolution and Product Maturity
Process Evolution and Product MaturityQAware GmbH
 
Successful Software Projects - What you need to consider
Successful Software Projects - What you need to considerSuccessful Software Projects - What you need to consider
Successful Software Projects - What you need to considerLloydMoore
 
Choosing Automation for DevOps & Continuous Delivery in the Enterprise
Choosing Automation for DevOps & Continuous Delivery in the EnterpriseChoosing Automation for DevOps & Continuous Delivery in the Enterprise
Choosing Automation for DevOps & Continuous Delivery in the EnterpriseXebiaLabs
 
Scrum And The Enterprise
Scrum And The EnterpriseScrum And The Enterprise
Scrum And The EnterpriseJames Peckham
 
How to adapt the SDLC to the era of DevSecOps
How to adapt the SDLC to the era of DevSecOpsHow to adapt the SDLC to the era of DevSecOps
How to adapt the SDLC to the era of DevSecOpsZane Lackey
 

Similar to Chaos Engineering 101: A Field Guide (20)

Site-Reliability-Engineering-v2[6241].pdf
Site-Reliability-Engineering-v2[6241].pdfSite-Reliability-Engineering-v2[6241].pdf
Site-Reliability-Engineering-v2[6241].pdf
 
DevOps - Boldly Go for Distro
DevOps - Boldly Go for DistroDevOps - Boldly Go for Distro
DevOps - Boldly Go for Distro
 
How to improve Developer Documentations ?
How to improve Developer Documentations ?How to improve Developer Documentations ?
How to improve Developer Documentations ?
 
30 days or less: New Features to Production
30 days or less: New Features to Production30 days or less: New Features to Production
30 days or less: New Features to Production
 
Scrum an extension pattern language for hyperproductive software development
Scrum an extension pattern language  for hyperproductive software developmentScrum an extension pattern language  for hyperproductive software development
Scrum an extension pattern language for hyperproductive software development
 
Putting Devs On-Call: How to Empower Your Team
Putting Devs On-Call: How to Empower Your TeamPutting Devs On-Call: How to Empower Your Team
Putting Devs On-Call: How to Empower Your Team
 
Debugging
DebuggingDebugging
Debugging
 
Current Trends in Agile - opening keynote for Agile Israel 2014
Current Trends in Agile - opening keynote for Agile Israel 2014Current Trends in Agile - opening keynote for Agile Israel 2014
Current Trends in Agile - opening keynote for Agile Israel 2014
 
Core define and_win_cmd_line gr
Core define and_win_cmd_line grCore define and_win_cmd_line gr
Core define and_win_cmd_line gr
 
Continuous Deployment
Continuous DeploymentContinuous Deployment
Continuous Deployment
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018
 
Week 4 Assignment - Software Development PlanScenario-Your team has be.docx
Week 4 Assignment - Software Development PlanScenario-Your team has be.docxWeek 4 Assignment - Software Development PlanScenario-Your team has be.docx
Week 4 Assignment - Software Development PlanScenario-Your team has be.docx
 
Bcn devcon jose luis soria - patterns & antipatterns for delivery
Bcn devcon   jose luis soria - patterns & antipatterns for deliveryBcn devcon   jose luis soria - patterns & antipatterns for delivery
Bcn devcon jose luis soria - patterns & antipatterns for delivery
 
5-Ways-to-Revolutionize-Your-Software-Testing
5-Ways-to-Revolutionize-Your-Software-Testing5-Ways-to-Revolutionize-Your-Software-Testing
5-Ways-to-Revolutionize-Your-Software-Testing
 
Process Evolution and Product Maturity
Process Evolution and Product MaturityProcess Evolution and Product Maturity
Process Evolution and Product Maturity
 
Successful Software Projects - What you need to consider
Successful Software Projects - What you need to considerSuccessful Software Projects - What you need to consider
Successful Software Projects - What you need to consider
 
DRP.ppt
DRP.pptDRP.ppt
DRP.ppt
 
Choosing Automation for DevOps & Continuous Delivery in the Enterprise
Choosing Automation for DevOps & Continuous Delivery in the EnterpriseChoosing Automation for DevOps & Continuous Delivery in the Enterprise
Choosing Automation for DevOps & Continuous Delivery in the Enterprise
 
Scrum And The Enterprise
Scrum And The EnterpriseScrum And The Enterprise
Scrum And The Enterprise
 
How to adapt the SDLC to the era of DevSecOps
How to adapt the SDLC to the era of DevSecOpsHow to adapt the SDLC to the era of DevSecOps
How to adapt the SDLC to the era of DevSecOps
 

Recently uploaded

Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsSachinPawar510423
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptJasonTagapanGulla
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 

Recently uploaded (20)

POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documents
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.ppt
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 

Chaos Engineering 101: A Field Guide

  • 1. Chaos Engineering 101: A Field Guide Matthew Brahms | SRE | @matthewbrahms
  • 2. What you will get from this talk in exchange for your time: ● Understand the definitions of Chaos Engineering (CE) ● Hear a brief history of the field ● Describe the mindset and methodologies of CE ● Know what steps you can take to start doing CE “in the wild” ● Realize the valuable outcomes of having a CE group at your org ● Prepare for common CE myths ● Have some resources for further investigation of the discipline
  • 3. Who are we in this room? dev/ops/devops/qa/qe/swe/sre/management
  • 4.
  • 5. Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. - http://principlesofchaos.org/
  • 6. Bad things will (and are) happening to your system, no matter how well designed it is. You cannot become ignorant to it.
  • 7. All of this really means...
  • 8.
  • 9. History of Chaos Engineering
  • 10. A *brief* history of the CE field ● 2010 - Chaos Monkey ● 2011 - Simian Army ● 2012 - Chaos Monkey OSS ● 2014 - Chaos Engineer role @ Netflix ● 2017 - Chaos Toolkit on GitHub (OSS) ● 2018 - Gremlin hosts first ChaosConf in SF ● 2018 - CNCF Chaos working group
  • 11. Where else can CE be found?
  • 12. ● Airline industry ○ Air Traffic Control ○ Plane construction ○ Pilot procedures ● Naval Air Operations at Sea ● Electrical Power Systems ● Public Water Systems ● Medical devices ○ Hospitals ○ Implanted devices ● Highway infrastructure ● Car crash safety ratings
  • 13.
  • 15. CE is a discipline ● This implies rigor, as in the academic sense ● Each org/person is unique in their implementation ● It’s not a process we can “say we do” and then file it into the abyss of “the wiki”
  • 16. Form a hypothesis ● You should know your app/tech stack well ● Whiteboard your entire system with another senior engineer and always with new onboards ● Find a domain/service where a failure is likely to exist and start there
  • 17. Test your ideas ● Goal is to either validate or invalidate your failure-case hypothesis ● The act of testing your hypothesis should *not* result in any harm to the user experience!
  • 18. Analyze results ● Lessons learned from the experiment are priceless ● The results and lessons learned should be communicated to the entire team ● Action items should be started to increase resiliency if there were issues discovered
  • 21. Level 0 - The Basics 1. You will need team/engineering buy-in 2. You will need full support from your engineering and business leadership 3. You will need *observability* in your application/infrastructure/user experience. Note: if you cannot detect/observe failure states when not formally doing chaos engineering, that is an area to focus on before adopting chaos engineering. 4. You will need a fully-documented and robust SEV outage procedure (replete with Incident commanders, blameless post-mortems, etc.) Note: this is another topic that if there is a lack of maturity in before doing chaos engineering, this should be built-up first. ** All of these could be *entire talks* on their own
  • 22. Level 1 - Assemble team Time: varies Two things are needed before going to level 2: - A defined product/domain/service, etc. that you wish to test for failure - A group of engineers (ops/dev/security/support/business): - You need this group to be comprised of people who are involved end-to-end with your service - They need to have time to attend pre-game meeting, experiment, and follow-up - Involve/inform as many people as possible in case of a failure during the experiment - Include Senior and Junior Engineers and even business people related to the service - Be sure to set the expectations for the level of involvement you need Example: “We will test our resiliency at the base layer of our infrastructure compute nodes.”
  • 23. Level 2 - Formulate Hypothesis Time: 1-2 hours Get everyone together and formulate your hypothesis. Whiteboard the entire service/hypothesis until everyone has a clear and thorough understanding of the system and the actions that will be taken to experiment with resiliency. Also assign roles and responsibilities for each person that will occur during the gameday. (Have a documentation user, have a QRF team, have someone just to operate the experiment, etc.) Document all of the above and socialize this documentation to other teams. Example: “If we delete (lose) a cloud compute node, our Kubernetes cluster will recover and re-provision, with no downtime or negative user experience.”
  • 24. Level 3 - Gameday Time: 1-4 hours Ideally, game day looks like a launch at NASA. Each of the assigned persons knows their role and you can do a pre-launch checklist, ensuring each team is ready. If there are any issues impacting the system or anything that the gameday *might* interfere with or make worse, abort the launch. If you are ready, then proceed with initiating the experiment keeping a keen eye on watching the progress. Example: “Our infrastructure is currently not degraded in any way, it is not Black Friday, we have SRE, SWE, Support, Security, and a few business folks here. We will now begin to delete a node and watch the success rates of our api’s while expecting and monitoring for the node recovery/re-provisioning.”
  • 25. Level 4 - Recap Lessons Learned Time: 30 minutes Gather everyone involved and recap what happened. In case of success or failure and remediation--be sure to go over the timeline of what happened. Gather lessons that everyone learned, being sure to highlight what we learned from the experiment that we didn’t know before (this is good to see value). Plan work for engineering teams as necessary to close any resiliency gaps that the experiment discovered. Communicate the value of all that has occurred in this process to the business. This is work that has directly contributed to the bottom line of the company.
  • 26. Gameday Templates! If you are very new to doing this, Gremlin has a complete set of templates and checklists to help you get started! (They really are quite excellent!) https://www.gremlin.com/gameday/
  • 27. Outcomes for Chaos Engineering
  • 28. 1. Avoid costs of downtime. Do we really *know* how much downtime really costs our enterprise in: Sales, Engineering, Loss of Productivity, etc.? User experience will go up!
  • 29. 2. Decrease pages to Ops/Dev/SRE Do we all like sleep? Do we track the number of pages our teams get? The blast-radius/cost of an outage event is large (lurkers & active)
  • 30. 3. Increase Productivity Less time and money spent on outages and reactive work will increase our time and resources for proactive work/features. What value could our Ops teams add if they were distracted less?
  • 31. 4. Increase the spread of knowledge throughout your organization Tired of running into lack of documentation/runbooks? Tired of people leaving with *heaps* of “tribal knowledge” ? Tired of people saying “I don’t know...that’s Johnny’s expertise” ?
  • 32.
  • 33. Top Chaos Engineering Myths (...not an exhaustive list)
  • 34. Top Chaos Engineering Myths 1. It’s not my job! 2. *Now* what tool do we have to buy & learn? 3. It costs how much?? 4. We have too much work to do (i.e. features, bug-fixes, etc.) 5. We can just deal with outages JIT, right!? 6. Our uptime target is 100% right? Why should we ever introduce “experiments” in production? 7. Why do you think we even have an ops/sre team? 8. We don’t even have SLO/SLI/SLA in place...even if we wanted to, how could we start?
  • 35. *IMMEDIATE* thoughts/responses from an SRE to these myths...
  • 36.
  • 37.
  • 39. Busting CE myths & takeaways
  • 40.
  • 41. It is *everyone’s* job to care about functionality, reliability, and ultimately #profit
  • 42. Take the time to be data-driven about the whole cost argument.
  • 43. There is a learning/implementation curve when Engineering Chaos, but continuous learning and improvement are job req’s, right?
  • 44. Do we really expect and employ a strategy of hope that only OPS/SRE should be doing Chaos Engineering?
  • 45. Chaos Engineering != tooling (necessarily) Start with preemptible/spot instances for services in lower environments :)
  • 46. What can you do about implementing chaos engineering: 1. Evangelize the idea and principles of chaos engineering to our organizations 2. Ensure that your systems are measurable (can detect chaos even if it is unplanned) and that there is a really solid SEV process in-place. 3. Start with whiteboarding sessions/high-level discussions about how our applications/services are architected and function--gain “herd immunity” regarding knowledge 4. Pick 1 service or application that is well-documented, very observable, not in a critical production path, etc. to serve as your first experiment upon for chaos experimentation. Stop immediately if things go wrong. 5. If you need/feel like ramping up quickly, Gremlin may be a good choice
  • 48. Additional online resources - Chaos Conf 2018 talks - Gremlin (Chaos-as-a-service, Documentation, Community Labs, etc.) - Gremlin Free Edition - Chaos Slack community - https://slofile.com/slack/chaosengineering - Talks by: Adrian Cockroft, Lorin Hochstein, Kolton Andrus, Tammy Butow, John Allspaw - CNCF Chaos WG (https://github.com/chaoseng/wg-chaoseng) - Netflix Simian Army (https://github.com/Netflix/SimianArmy) - Chaos Toolkit (https://github.com/chaostoolkit) - Kubernetes Chaos Lab (https://github.com/matthewbrahms/kubernetes-chaos-lab)
  • 49. Additional reading Books for further academic reading: - Release It! 2nd Edition by Michael Nygard - Drift Into Failure by Sydney Dekker - Chaos Engineering (O’Reilly) - The Safety Anarchist by Sydney Dekker
  • 50. Questions | Comments | Discussions | Ideas ?
  • 51. Are you interested in Chaos Engineering? Join us at the meetup! www.meetup.com/Austin-Chaos-Engineering-Meetup/