Chaos Engineering:
Why the world needs more resilient systems
@tammybutow
InfoQ.com: News & Community Site
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
chaos-engineering-resilient-systems
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon London
www.qconlondon.com
Oh hai, nice to meet you!
@tammybutow
@tammybutow
tammybutow
tb@gremlin.com
Principal SRE @ Gremlin
Tech Advisory Board @
Greenpeace
Enjoys Skateboarding,
Snowboarding, Metal, Punk &
Breaking Things On Purpose.
Dropbox
DigitalOcean
National Australia Bank
Queensland University of Technology
Netflix
Amazon
Salesforce
Google
Our Gremlin Team Were Previously @
PagerDuty Datadog
More Resilient Systems!
Why the world needs:
A resilient system is a highly available and durable system.
A resilient system can maintain an acceptable level of service
in the face of failure.
A resilient system can weather the storm (a misconfiguration,
a large scale natural disaster or controlled chaos engineering).
What is a resilient system?
Resilient Systems
Let’s review industry examples
to understand why we need:
Cardiac monitoring is now done via a
bluetooth device implanted in the body and a mobile app.
The patient takes no action.
Resilience of the device is the only thing the patient cares about.
Med Tech Industry:
People are changing jobs, moving homes,
traveling and more. Systems need to not only
keep up but also provide value anytime/anywhere.
Fin Tech Industry:
A “technical issue related to some routine maintenance”. Impacted the purchase of over 2000 homes.
People are traveling so frequently for work and
leisure. They need to be able to get where they
need to go with no hassles.
Transport Tech Industry:
More remote learning than ever before. Many
students learn remotely. They need reliable access
to teachers, students and learning materials.
Edu Tech Industry:
People need protection from bushfires,
tsunamis, earthquakes and storms. Many of the
warning systems for these disasters are legacy
unreliable systems.
Enviro Tech Industry:
Insert photo of tsunami
Saturday, 7 February 2009 - Australia’s all-time worst bushfire disaster
Saturday, 7 February 2009 - Australia’s all-time worst bushfire disasters
Saturday, 7 February 2009 - Australia’s all-time worst bushfire disasters
What do these systems have in common?
The primary concern of the user is resilience of the system,
in particular high availability.
A great future for everyone
Let’s figure out how to create:
What does a great future look like?
More Resilient Systems?
How do we create:
Introducing:
Chaos Engineering
Chaos Engineering?
What is
Thoughtful, planned experiments designed to
reveal the weakness in our systems.
Chaos Engineering:
Inject something harmful, in order
to build an immunity
We can inject harm in hosts,
containers, pods, applications and
more.
Chaos Engineer?
What is a
A vaccine research computer scientist.
Chaos Engineer:
SREs / Production Engineers commonly practice
Chaos Engineering.
A vaccine research computer scientist.
Chaos Engineer:
A vaccine research computer scientist.
Chaos Engineer:
http://www.cancerresearchuk.org/about-cancer/cancer-in-general/treatment/immunotherapy/types/vaccines-to-treat-cancer
The Bad Database Vaccine
Bad DB
Vaccine
What happens when the
database is unreachable?
Does the database have reliable
and trustworthy monitoring?
Does the database fail
gracefully?
Injecting Harm in DynamoDB
https://www.gremlin.com/community/tutorials/gremlin-gameday-breaking-dynamodb/
Chaos Engineering
What do you need before you can start doing:
Prerequisites for Chaos Engineering
1. High Severity Incident Management
2. Monitoring
3. Measure the Impact of Downtime
Prerequisites for Chaos Engineering
High Severity Incident Management
Chaos Engineering Prerequisite #1:
The practice of recording, triaging, tracking, and
assigning business value to problems that
impact critical systems.
High Severity Incident Management:
gremlin.com/community
SEVs?
What are
What are SEVs?
The term SEV is derived from “High Severity Incident”
What are SEVs?
How Do You Determine SEV levels?
What is an example of SEV 0?
SEV Name: SEV 0 Runaway Cow (auto generated code names help your team
remember and refer to SEVs!)
SEV Description: Nintendo Switch eShop is down and not working
SEV Start Time: 08:40am Dec 25 2017 (Christmas Day)
What is the availability impact? 100%
What is the outage duration? 5 hours and 40 minutes
What is an example of SEV 0?
The SEV Lifecycle?
What is the
How To Run A GameDay
gremlin.com/community
How do you identify your
critical systems?
What are your critical tier 0 systems?
Traffic
Database
Storage
Monitoring
Chaos Engineering Prerequisite #2:
Monitoring
Why Do You Need:
Why Monitor - The Google SRE Book
https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html
How Should You Use Monitoring
Critical Services Dashboard
gremlin.com/community
The Four Golden Signals - The Google SRE Book
https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html
The Four Golden Signals - The Google SRE Book
https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html
Monitoring
Signal
Description Example
Latency The time it takes to service a request.  HTTP 500 error triggered due to loss of
connection to a database
Traffic A measure of how much demand is
being placed on your system
For a web service, this measurement is
usually HTTP requests per second
Errors The rate of requests that fail, either
explicitly, implicitly or by policy.
Catching HTTP 500s at your load balancer
can do a decent job of catching all completely
failed requests.
Saturation How "full" your service is. Should also
signal impending saturation.
It looks like your database will fill its hard drive
in 4 hours.
What Happens If You Do Chaos Engineering
Without Monitoring?
You won’t know what’s happening
Measure The Impact Of Downtime
Chaos Engineering Prerequisite #3:
Measure The Impact Of Downtime
We need to understand how SEV 0s
impact our customers and business.
Measure The Impact Of Downtime
System Impact:
• Availability
• Durability
Customer/Business Impact:
• Outcome
• Cost
• Time
What is the impact of the Nintendo Switch eShop SEV 0?
SEV Description: Nintendo Switch eShop is down and not working
What is the availability impact? 100%
Time? 5 hours and 40 minutes
Cost? ______
Outcome? Switch users all over the world can’t buy games
Chaos Engineering
Now we’re ready to get started with:
Chaos Engineering Use Case: Twilio
Chaos Engineering Case Study: Twilio
Ratequeue Chaos has 3 goals:
1. Pick a shard
2. Kill primary
3. Monitor recovery.
Share The
Chaos Engineering
Journey Widely
• Do a Chaos Engineering Kick Off @ All Hands
• Send email updates & progress reports
• Run Monthly Metrics Reviews
• Deliver Presentations
Share The Chaos Engineering Journey Widely
Don’t Surprise Everyone!
Gremlin?
What is
What is Gremlin?
Gremlin Chaos Engineering Attacks
There are a range of attacks built-in and ready to run on Linux.
Type of Attack Attack Gremlin Support
(March 2018)
Resource CPU ✅
Resource Disk ✅
Resource IO ✅
Resource Memory ✅
State Process Killer ✅
State Shutdown ✅
State Time Travel ✅
Network Blackhole ✅
Network DNS ✅
Network Latency ✅
Network Packet Loss ✅
Live Chaos Engineering
Demo
Create a Kubernetes Cluster
gremlin.com/community
Create a Kubernetes Cluster
Master
Node 1 Node 2 Node 3
159.65.85.204
159.65.85.158 159.65.85.169 159.65.85.202
Host Level Chaos Engineering With Kubernetes
Create a Kubernetes Daemonset For Gremlin
Create a Kubernetes Daemonset For Gremlin
Insert yams
View Your Kubernetes Pods
Run An Attack From The Gremlin Control Panel
Monitor Your Chaos Engineering Attack
Monitor Your Chaos Engineering Attack
Notify Your Team
The Path To Chaos Engineering
Let’s Review:
The Path To Chaos Engineering
High Severity
Incident
Management
Monitoring
Make & Measure
Improvements
Chaos
Engineering
Measure the
impact of
downtime
Blast Radius and Advanced Chaos
High Severity
Incident
Management
Monitoring
Make & Measure
Improvements
Chaos
Engineering
Measure the
impact of
downtime
Make Improvements?
How do you
1. Build - Build a new system / improve existing
2. Borrow - Use open source / contribute to OS
3. Buy - Use 3rd party systems
4. Brush up - GameDays / Team training
5. Break - Chaos Engineering / Failure injection
6. Begone - Decommission systems / delete code
How do you make improvements?
Always Measure Improvements
Tell a story of before and after with metrics
More Resilient Systems
The world needs:
More Resilient Systems!
You can create:
Join us on this journey!
gremlin.com/community
gremlin.com/slack
Thanks!
@tammybutow
gremlin.com
Watch the video with slide synchronization on
InfoQ.com!
https://www.infoq.com/presentations/chaos-
engineering-resilient-systems

Chaos Engineering: Why the World Needs More Resilient Systems

  • 1.
    Chaos Engineering: Why theworld needs more resilient systems @tammybutow
  • 2.
    InfoQ.com: News &Community Site Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ chaos-engineering-resilient-systems • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week
  • 3.
    Purpose of QCon -to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon London www.qconlondon.com
  • 4.
    Oh hai, niceto meet you! @tammybutow @tammybutow tammybutow tb@gremlin.com Principal SRE @ Gremlin Tech Advisory Board @ Greenpeace Enjoys Skateboarding, Snowboarding, Metal, Punk & Breaking Things On Purpose.
  • 5.
    Dropbox DigitalOcean National Australia Bank QueenslandUniversity of Technology Netflix Amazon Salesforce Google Our Gremlin Team Were Previously @ PagerDuty Datadog
  • 6.
  • 7.
    A resilient systemis a highly available and durable system. A resilient system can maintain an acceptable level of service in the face of failure. A resilient system can weather the storm (a misconfiguration, a large scale natural disaster or controlled chaos engineering). What is a resilient system?
  • 8.
    Resilient Systems Let’s reviewindustry examples to understand why we need:
  • 9.
    Cardiac monitoring isnow done via a bluetooth device implanted in the body and a mobile app. The patient takes no action. Resilience of the device is the only thing the patient cares about. Med Tech Industry:
  • 11.
    People are changingjobs, moving homes, traveling and more. Systems need to not only keep up but also provide value anytime/anywhere. Fin Tech Industry:
  • 12.
    A “technical issuerelated to some routine maintenance”. Impacted the purchase of over 2000 homes.
  • 13.
    People are travelingso frequently for work and leisure. They need to be able to get where they need to go with no hassles. Transport Tech Industry:
  • 15.
    More remote learningthan ever before. Many students learn remotely. They need reliable access to teachers, students and learning materials. Edu Tech Industry:
  • 17.
    People need protectionfrom bushfires, tsunamis, earthquakes and storms. Many of the warning systems for these disasters are legacy unreliable systems. Enviro Tech Industry:
  • 18.
    Insert photo oftsunami Saturday, 7 February 2009 - Australia’s all-time worst bushfire disaster
  • 19.
    Saturday, 7 February2009 - Australia’s all-time worst bushfire disasters
  • 20.
    Saturday, 7 February2009 - Australia’s all-time worst bushfire disasters
  • 21.
    What do thesesystems have in common? The primary concern of the user is resilience of the system, in particular high availability.
  • 22.
    A great futurefor everyone Let’s figure out how to create:
  • 23.
    What does agreat future look like?
  • 24.
  • 25.
  • 26.
  • 27.
    Thoughtful, planned experimentsdesigned to reveal the weakness in our systems. Chaos Engineering:
  • 28.
    Inject something harmful,in order to build an immunity
  • 30.
    We can injectharm in hosts, containers, pods, applications and more.
  • 31.
  • 32.
    A vaccine researchcomputer scientist. Chaos Engineer: SREs / Production Engineers commonly practice Chaos Engineering.
  • 33.
    A vaccine researchcomputer scientist. Chaos Engineer:
  • 34.
    A vaccine researchcomputer scientist. Chaos Engineer: http://www.cancerresearchuk.org/about-cancer/cancer-in-general/treatment/immunotherapy/types/vaccines-to-treat-cancer
  • 35.
    The Bad DatabaseVaccine Bad DB Vaccine What happens when the database is unreachable? Does the database have reliable and trustworthy monitoring? Does the database fail gracefully?
  • 36.
    Injecting Harm inDynamoDB https://www.gremlin.com/community/tutorials/gremlin-gameday-breaking-dynamodb/
  • 37.
    Chaos Engineering What doyou need before you can start doing:
  • 38.
  • 39.
    1. High SeverityIncident Management 2. Monitoring 3. Measure the Impact of Downtime Prerequisites for Chaos Engineering
  • 40.
    High Severity IncidentManagement Chaos Engineering Prerequisite #1:
  • 41.
    The practice ofrecording, triaging, tracking, and assigning business value to problems that impact critical systems. High Severity Incident Management:
  • 42.
  • 43.
  • 44.
    What are SEVs? Theterm SEV is derived from “High Severity Incident”
  • 45.
  • 46.
    How Do YouDetermine SEV levels?
  • 47.
    What is anexample of SEV 0? SEV Name: SEV 0 Runaway Cow (auto generated code names help your team remember and refer to SEVs!) SEV Description: Nintendo Switch eShop is down and not working SEV Start Time: 08:40am Dec 25 2017 (Christmas Day) What is the availability impact? 100% What is the outage duration? 5 hours and 40 minutes
  • 48.
    What is anexample of SEV 0?
  • 49.
  • 51.
    How To RunA GameDay gremlin.com/community
  • 52.
    How do youidentify your critical systems?
  • 53.
    What are yourcritical tier 0 systems? Traffic Database Storage
  • 54.
  • 55.
  • 56.
    Why Monitor -The Google SRE Book https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html
  • 57.
    How Should YouUse Monitoring
  • 58.
  • 59.
    The Four GoldenSignals - The Google SRE Book https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html
  • 60.
    The Four GoldenSignals - The Google SRE Book https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html Monitoring Signal Description Example Latency The time it takes to service a request.  HTTP 500 error triggered due to loss of connection to a database Traffic A measure of how much demand is being placed on your system For a web service, this measurement is usually HTTP requests per second Errors The rate of requests that fail, either explicitly, implicitly or by policy. Catching HTTP 500s at your load balancer can do a decent job of catching all completely failed requests. Saturation How "full" your service is. Should also signal impending saturation. It looks like your database will fill its hard drive in 4 hours.
  • 61.
    What Happens IfYou Do Chaos Engineering Without Monitoring?
  • 62.
    You won’t knowwhat’s happening
  • 63.
    Measure The ImpactOf Downtime Chaos Engineering Prerequisite #3:
  • 64.
    Measure The ImpactOf Downtime We need to understand how SEV 0s impact our customers and business.
  • 65.
    Measure The ImpactOf Downtime System Impact: • Availability • Durability Customer/Business Impact: • Outcome • Cost • Time
  • 66.
    What is theimpact of the Nintendo Switch eShop SEV 0? SEV Description: Nintendo Switch eShop is down and not working What is the availability impact? 100% Time? 5 hours and 40 minutes Cost? ______ Outcome? Switch users all over the world can’t buy games
  • 67.
    Chaos Engineering Now we’reready to get started with:
  • 68.
  • 69.
    Chaos Engineering CaseStudy: Twilio Ratequeue Chaos has 3 goals: 1. Pick a shard 2. Kill primary 3. Monitor recovery.
  • 70.
  • 71.
    • Do aChaos Engineering Kick Off @ All Hands • Send email updates & progress reports • Run Monthly Metrics Reviews • Deliver Presentations Share The Chaos Engineering Journey Widely
  • 72.
  • 73.
  • 74.
  • 75.
    Gremlin Chaos EngineeringAttacks There are a range of attacks built-in and ready to run on Linux. Type of Attack Attack Gremlin Support (March 2018) Resource CPU ✅ Resource Disk ✅ Resource IO ✅ Resource Memory ✅ State Process Killer ✅ State Shutdown ✅ State Time Travel ✅ Network Blackhole ✅ Network DNS ✅ Network Latency ✅ Network Packet Loss ✅
  • 76.
  • 77.
    Create a KubernetesCluster gremlin.com/community
  • 78.
    Create a KubernetesCluster Master Node 1 Node 2 Node 3 159.65.85.204 159.65.85.158 159.65.85.169 159.65.85.202
  • 79.
    Host Level ChaosEngineering With Kubernetes
  • 80.
    Create a KubernetesDaemonset For Gremlin
  • 81.
    Create a KubernetesDaemonset For Gremlin Insert yams
  • 82.
  • 83.
    Run An AttackFrom The Gremlin Control Panel
  • 84.
    Monitor Your ChaosEngineering Attack
  • 85.
    Monitor Your ChaosEngineering Attack
  • 86.
  • 87.
    The Path ToChaos Engineering Let’s Review:
  • 88.
    The Path ToChaos Engineering High Severity Incident Management Monitoring Make & Measure Improvements Chaos Engineering Measure the impact of downtime
  • 89.
    Blast Radius andAdvanced Chaos High Severity Incident Management Monitoring Make & Measure Improvements Chaos Engineering Measure the impact of downtime
  • 90.
  • 91.
    1. Build -Build a new system / improve existing 2. Borrow - Use open source / contribute to OS 3. Buy - Use 3rd party systems 4. Brush up - GameDays / Team training 5. Break - Chaos Engineering / Failure injection 6. Begone - Decommission systems / delete code How do you make improvements?
  • 92.
    Always Measure Improvements Tella story of before and after with metrics
  • 93.
  • 94.
  • 95.
    Join us onthis journey! gremlin.com/community gremlin.com/slack
  • 96.
  • 97.
    Watch the videowith slide synchronization on InfoQ.com! https://www.infoq.com/presentations/chaos- engineering-resilient-systems