SlideShare a Scribd company logo
Chaos Engineering
Injecting failure for building
resilience in systems
Nice to meet you
YURY NIÑO
Software Engineer and Chaos
Engineer Advocate.
Loves building software applications, solving
resilience issues and teaching. Passionate about
reading, writing and cycling.
Agenda
● Resilience vs Reliability
● Why the world needs Resilience and
Reliability?
● Chaos Engineering
● Principles of Chaos
● Chaos in Practice
● Game Days
How many of you
Have encountered a crash of
your systems on production?
A recognition for ...
This talk is dedicated to
the #SystemAdministrators well
caffeinated, who get woken up in the
middle of the night when “things go
bump”.
#EngineeringTeam #DigitalFactory
@jnhernandz @
What is a
Resilient System?
A resilient system can maintain an acceptable level
of service in the face of failure.
A resilient system can weather the storm such a
large scale natural disaster or a controlled chaos
engineering.
Tammy Bütow Principal SRE at Gremlin
https://securethegrid.com
A distributed system on production needs to be
resilient in order to be reliable and this is precisely
a target that we Software Engineers, Systems
Engineers, Site Reliability Engineers and Chaos
Engineers always aim.
Mine :)
Why the world needs
Resilient Systems?
Because ...
We are surrounded by
distributed systems.
When we read the news in our
cellphones, send an email or buy our
lunch ...
We do not tolerate that
they fail!
February 28th, 2017 will be remembered
● Simple Storage Service (S3) went down in US-EAST.
● Outage lasted about 4 hrs.
● > 100.000 websites across the world were impacted.
Me :(
The World is Chaotic!
● Distributed systems contains moving
parts.
● Many things can go wrong.
○ Hard disks can fail.
○ The network can go down.
○ Customer traffic can overload.
How many of you know
What is Chaos
Engineering?
Chaos Engineering
It is the discipline of experimenting in
production on a distributed system in
order to reveal their weakness and to
build confidence in their resilience
capability.
https://principlesofchaos.org/
Chaos Engineering
It is deliberately inducing stress or
fault into software and/or hardware as
a way of learning/verifying things
about systems.
https://www.gremlin.com
Chaos Engineering is about
● Simulating the failure of a datacenter.
● Injecting latency between services.
● Randomly causing exceptions.
● Changing time travel.
● Emulating I/O errors.
http://principlesofchaos.org/
2008
Chaos Engineering
began at Netflix
2010
Chaos Monkey was
launched
2018
A lot of resources for
Chaos Engineering.
2014
Role of Chaos
Engineer was created.
History of Chaos Engineering
Kolton Andrus
Chaos in Practice
Principles of
Chaos
https://principlesofchaos.org/
1. Steady Stead
2. Hypothesis:
Circuit
Breaker
builds
Resilience
2. Hypothesis:
Circuit
Breaker
builds
Resilience
4. Run the Experiment
Application
Name Finer Observability DataDog
Hypothesis Circuit Breaker works
Environment My Home Results
Duration 5 - 10 seconds
Load 1 request
Actions
4. Run the Experiment
Application
Name Finer Observability DataDog
Hypothesis Circuit Breaker works
Facing latencies > 5 seconds between
dashboard_api and smart_api to open
the circuit.
Environment My Home Results
Duration 20 milliseconds
Load 1 request
Issue #4356
Configure the proper hystrix parameters
according the results.
Implement a fallback.
Actions
Game Days
Game Day: Roles
Master of Disaster First on-call Team
https://www.pinterest.es/pin/824299538021645731/
Game Days can Transform our Teams
Even though Game Days are not real! they
make Engineers gain confidence.
Since we, Engineers are experiencing the failure as part
of our job, we should start designing for failure.
Me :)
The best time to learn about fire
is when you’re on fire.
—Jen Hammond, New Relic engineering manager
How to begin ...
https://chaosengineering.slack.com
https://github.com/dastergon/awesome-chaos
-engineering
https://www.infoq.com/chaos-engineering
@yurynino

More Related Content

What's hot

Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Amazon Web Services
 
Chaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWSChaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWS
Bilal Aybar
 
Chaos engineering and chaos testing
Chaos engineering and chaos testingChaos engineering and chaos testing
Chaos engineering and chaos testing
jeetendra mandal
 
An Introduction to Chaos Engineering
An Introduction to Chaos EngineeringAn Introduction to Chaos Engineering
An Introduction to Chaos Engineering
Gremlin
 
Principles Of Chaos Engineering - Chaos Engineering Hamburg
Principles Of Chaos Engineering - Chaos Engineering HamburgPrinciples Of Chaos Engineering - Chaos Engineering Hamburg
Principles Of Chaos Engineering - Chaos Engineering Hamburg
Nils Meder
 
Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionChaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in Production
Keet Sugathadasa
 
Chaos Engineering with Gremlin Platform
Chaos Engineering with Gremlin PlatformChaos Engineering with Gremlin Platform
Chaos Engineering with Gremlin Platform
Anshul Patel
 
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Ana Medina
 
Platform engineering 101
Platform engineering 101Platform engineering 101
Platform engineering 101
Sander Knape
 
Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)
Abeer R
 
Introduction to Docker - 2017
Introduction to Docker - 2017Introduction to Docker - 2017
Introduction to Docker - 2017
Docker, Inc.
 
SRE From Scratch
SRE From ScratchSRE From Scratch
SRE From Scratch
Grier Johnson
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
ITSM Academy, Inc.
 
Site reliability engineering
Site reliability engineeringSite reliability engineering
Site reliability engineering
Jason Loeffler
 
Platform Engineering
Platform EngineeringPlatform Engineering
Platform Engineering
Opsta
 
chaos-engineering-Knolx
chaos-engineering-Knolxchaos-engineering-Knolx
chaos-engineering-Knolx
Knoldus Inc.
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
Ashutosh Agarwal
 
Introduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft AzureIntroduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft Azure
Ana Medina
 
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains InfotechDevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
Rosalie Lauren
 
Agile Risk Management
Agile Risk ManagementAgile Risk Management
Agile Risk Management
Rowan Bunning
 

What's hot (20)

Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
 
Chaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWSChaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWS
 
Chaos engineering and chaos testing
Chaos engineering and chaos testingChaos engineering and chaos testing
Chaos engineering and chaos testing
 
An Introduction to Chaos Engineering
An Introduction to Chaos EngineeringAn Introduction to Chaos Engineering
An Introduction to Chaos Engineering
 
Principles Of Chaos Engineering - Chaos Engineering Hamburg
Principles Of Chaos Engineering - Chaos Engineering HamburgPrinciples Of Chaos Engineering - Chaos Engineering Hamburg
Principles Of Chaos Engineering - Chaos Engineering Hamburg
 
Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionChaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in Production
 
Chaos Engineering with Gremlin Platform
Chaos Engineering with Gremlin PlatformChaos Engineering with Gremlin Platform
Chaos Engineering with Gremlin Platform
 
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
 
Platform engineering 101
Platform engineering 101Platform engineering 101
Platform engineering 101
 
Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)
 
Introduction to Docker - 2017
Introduction to Docker - 2017Introduction to Docker - 2017
Introduction to Docker - 2017
 
SRE From Scratch
SRE From ScratchSRE From Scratch
SRE From Scratch
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
 
Site reliability engineering
Site reliability engineeringSite reliability engineering
Site reliability engineering
 
Platform Engineering
Platform EngineeringPlatform Engineering
Platform Engineering
 
chaos-engineering-Knolx
chaos-engineering-Knolxchaos-engineering-Knolx
chaos-engineering-Knolx
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
 
Introduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft AzureIntroduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft Azure
 
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains InfotechDevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
 
Agile Risk Management
Agile Risk ManagementAgile Risk Management
Agile Risk Management
 

Similar to Chaos Engineering: Injecting Failure for Building Resilience in Systems

OWASP AppSec Global 2019 Security & Chaos Engineering
OWASP AppSec Global 2019 Security & Chaos EngineeringOWASP AppSec Global 2019 Security & Chaos Engineering
OWASP AppSec Global 2019 Security & Chaos Engineering
Aaron Rinehart
 
Pivotal APJ Security Chaos Engineering
Pivotal APJ Security Chaos EngineeringPivotal APJ Security Chaos Engineering
Pivotal APJ Security Chaos Engineering
Aaron Rinehart
 
RSA Conference APJ 2019 DevSecOps Days Security Chaos Engineering
RSA Conference APJ 2019 DevSecOps Days Security Chaos EngineeringRSA Conference APJ 2019 DevSecOps Days Security Chaos Engineering
RSA Conference APJ 2019 DevSecOps Days Security Chaos Engineering
Aaron Rinehart
 
Embracing Failure - AzureDay Rome
Embracing Failure - AzureDay RomeEmbracing Failure - AzureDay Rome
Embracing Failure - AzureDay Rome
Alberto Acerbis
 
Chaos Engineering – why we should all practice breaking things on purpose by ...
Chaos Engineering – why we should all practice breaking things on purpose by ...Chaos Engineering – why we should all practice breaking things on purpose by ...
Chaos Engineering – why we should all practice breaking things on purpose by ...
Alex Cachia
 
AllTheTalks Security Chaos Engineering
AllTheTalks Security Chaos Engineering AllTheTalks Security Chaos Engineering
AllTheTalks Security Chaos Engineering
Aaron Rinehart
 
AllDayDevOps Security Chaos Engineering 2019
AllDayDevOps Security Chaos Engineering 2019 AllDayDevOps Security Chaos Engineering 2019
AllDayDevOps Security Chaos Engineering 2019
Aaron Rinehart
 
JDD 2016 - Jedrzej Dabrowa - Distributed System Fault Injection Testing With ...
JDD 2016 - Jedrzej Dabrowa - Distributed System Fault Injection Testing With ...JDD 2016 - Jedrzej Dabrowa - Distributed System Fault Injection Testing With ...
JDD 2016 - Jedrzej Dabrowa - Distributed System Fault Injection Testing With ...
PROIDEA
 
Architectural Patterns of Resilient Distributed Systems
 Architectural Patterns of Resilient Distributed Systems Architectural Patterns of Resilient Distributed Systems
Architectural Patterns of Resilient Distributed Systems
Ines Sombra
 
Using security to drive chaos engineering - April 2018
Using security to drive chaos engineering - April 2018Using security to drive chaos engineering - April 2018
Using security to drive chaos engineering - April 2018
Dinis Cruz
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018
Christophe Rochefolle
 
ADDO - Navigating the DevSecOps App-ocalypse 2020
ADDO - Navigating the DevSecOps App-ocalypse 2020 ADDO - Navigating the DevSecOps App-ocalypse 2020
ADDO - Navigating the DevSecOps App-ocalypse 2020
Aaron Rinehart
 
Chaos Engineering to Establish Software Reliability
Chaos Engineering to Establish Software ReliabilityChaos Engineering to Establish Software Reliability
Chaos Engineering to Establish Software Reliability
GleecusTechlabs1
 
Stability anti patterns in cloud-native applications
Stability anti patterns in cloud-native applicationsStability anti patterns in cloud-native applications
Stability anti patterns in cloud-native applications
Ana-Maria Mihalceanu
 
Chaos is a ladder !
Chaos is a ladder !Chaos is a ladder !
Chaos is a ladder !
Haggai Philip Zagury
 
VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering"
VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering"VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering"
VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering"
Aaron Rinehart
 
Availability in a cloud native world v1.6 (Feb 2019)
Availability in a cloud native world v1.6 (Feb 2019)Availability in a cloud native world v1.6 (Feb 2019)
Availability in a cloud native world v1.6 (Feb 2019)
Haytham Elkhoja
 
Containers and Why They Matter
Containers and Why They MatterContainers and Why They Matter
Containers and Why They Matter
Ray Lukas
 
Designing Cloud Backup to reduce DR downtime for IT Professionals
Designing Cloud Backup to reduce DR downtime for IT ProfessionalsDesigning Cloud Backup to reduce DR downtime for IT Professionals
Designing Cloud Backup to reduce DR downtime for IT Professionals
Storage Switzerland
 
CS5032 Lecture 2: Failure
CS5032 Lecture 2: FailureCS5032 Lecture 2: Failure
CS5032 Lecture 2: Failure
John Rooksby
 

Similar to Chaos Engineering: Injecting Failure for Building Resilience in Systems (20)

OWASP AppSec Global 2019 Security & Chaos Engineering
OWASP AppSec Global 2019 Security & Chaos EngineeringOWASP AppSec Global 2019 Security & Chaos Engineering
OWASP AppSec Global 2019 Security & Chaos Engineering
 
Pivotal APJ Security Chaos Engineering
Pivotal APJ Security Chaos EngineeringPivotal APJ Security Chaos Engineering
Pivotal APJ Security Chaos Engineering
 
RSA Conference APJ 2019 DevSecOps Days Security Chaos Engineering
RSA Conference APJ 2019 DevSecOps Days Security Chaos EngineeringRSA Conference APJ 2019 DevSecOps Days Security Chaos Engineering
RSA Conference APJ 2019 DevSecOps Days Security Chaos Engineering
 
Embracing Failure - AzureDay Rome
Embracing Failure - AzureDay RomeEmbracing Failure - AzureDay Rome
Embracing Failure - AzureDay Rome
 
Chaos Engineering – why we should all practice breaking things on purpose by ...
Chaos Engineering – why we should all practice breaking things on purpose by ...Chaos Engineering – why we should all practice breaking things on purpose by ...
Chaos Engineering – why we should all practice breaking things on purpose by ...
 
AllTheTalks Security Chaos Engineering
AllTheTalks Security Chaos Engineering AllTheTalks Security Chaos Engineering
AllTheTalks Security Chaos Engineering
 
AllDayDevOps Security Chaos Engineering 2019
AllDayDevOps Security Chaos Engineering 2019 AllDayDevOps Security Chaos Engineering 2019
AllDayDevOps Security Chaos Engineering 2019
 
JDD 2016 - Jedrzej Dabrowa - Distributed System Fault Injection Testing With ...
JDD 2016 - Jedrzej Dabrowa - Distributed System Fault Injection Testing With ...JDD 2016 - Jedrzej Dabrowa - Distributed System Fault Injection Testing With ...
JDD 2016 - Jedrzej Dabrowa - Distributed System Fault Injection Testing With ...
 
Architectural Patterns of Resilient Distributed Systems
 Architectural Patterns of Resilient Distributed Systems Architectural Patterns of Resilient Distributed Systems
Architectural Patterns of Resilient Distributed Systems
 
Using security to drive chaos engineering - April 2018
Using security to drive chaos engineering - April 2018Using security to drive chaos engineering - April 2018
Using security to drive chaos engineering - April 2018
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018
 
ADDO - Navigating the DevSecOps App-ocalypse 2020
ADDO - Navigating the DevSecOps App-ocalypse 2020 ADDO - Navigating the DevSecOps App-ocalypse 2020
ADDO - Navigating the DevSecOps App-ocalypse 2020
 
Chaos Engineering to Establish Software Reliability
Chaos Engineering to Establish Software ReliabilityChaos Engineering to Establish Software Reliability
Chaos Engineering to Establish Software Reliability
 
Stability anti patterns in cloud-native applications
Stability anti patterns in cloud-native applicationsStability anti patterns in cloud-native applications
Stability anti patterns in cloud-native applications
 
Chaos is a ladder !
Chaos is a ladder !Chaos is a ladder !
Chaos is a ladder !
 
VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering"
VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering"VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering"
VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering"
 
Availability in a cloud native world v1.6 (Feb 2019)
Availability in a cloud native world v1.6 (Feb 2019)Availability in a cloud native world v1.6 (Feb 2019)
Availability in a cloud native world v1.6 (Feb 2019)
 
Containers and Why They Matter
Containers and Why They MatterContainers and Why They Matter
Containers and Why They Matter
 
Designing Cloud Backup to reduce DR downtime for IT Professionals
Designing Cloud Backup to reduce DR downtime for IT ProfessionalsDesigning Cloud Backup to reduce DR downtime for IT Professionals
Designing Cloud Backup to reduce DR downtime for IT Professionals
 
CS5032 Lecture 2: Failure
CS5032 Lecture 2: FailureCS5032 Lecture 2: Failure
CS5032 Lecture 2: Failure
 

Recently uploaded

E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
Hornet Dynamics
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
Quickdice ERP
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
brainerhub1
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
socradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdfsocradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdf
SOCRadar
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
Gerardo Pardo-Castellote
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
Peter Muessig
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
Philip Schwarz
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
ICS
 

Recently uploaded (20)

E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
socradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdfsocradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdf
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
 

Chaos Engineering: Injecting Failure for Building Resilience in Systems

  • 1. Chaos Engineering Injecting failure for building resilience in systems
  • 2. Nice to meet you YURY NIÑO Software Engineer and Chaos Engineer Advocate. Loves building software applications, solving resilience issues and teaching. Passionate about reading, writing and cycling.
  • 3. Agenda ● Resilience vs Reliability ● Why the world needs Resilience and Reliability? ● Chaos Engineering ● Principles of Chaos ● Chaos in Practice ● Game Days
  • 4. How many of you Have encountered a crash of your systems on production?
  • 5. A recognition for ... This talk is dedicated to the #SystemAdministrators well caffeinated, who get woken up in the middle of the night when “things go bump”. #EngineeringTeam #DigitalFactory @jnhernandz @
  • 7. A resilient system can maintain an acceptable level of service in the face of failure. A resilient system can weather the storm such a large scale natural disaster or a controlled chaos engineering. Tammy Bütow Principal SRE at Gremlin
  • 9. A distributed system on production needs to be resilient in order to be reliable and this is precisely a target that we Software Engineers, Systems Engineers, Site Reliability Engineers and Chaos Engineers always aim. Mine :)
  • 10. Why the world needs Resilient Systems?
  • 11. Because ... We are surrounded by distributed systems. When we read the news in our cellphones, send an email or buy our lunch ... We do not tolerate that they fail!
  • 12.
  • 13. February 28th, 2017 will be remembered ● Simple Storage Service (S3) went down in US-EAST. ● Outage lasted about 4 hrs. ● > 100.000 websites across the world were impacted.
  • 14. Me :(
  • 15. The World is Chaotic! ● Distributed systems contains moving parts. ● Many things can go wrong. ○ Hard disks can fail. ○ The network can go down. ○ Customer traffic can overload.
  • 16. How many of you know What is Chaos Engineering?
  • 17. Chaos Engineering It is the discipline of experimenting in production on a distributed system in order to reveal their weakness and to build confidence in their resilience capability. https://principlesofchaos.org/
  • 18. Chaos Engineering It is deliberately inducing stress or fault into software and/or hardware as a way of learning/verifying things about systems. https://www.gremlin.com
  • 19. Chaos Engineering is about ● Simulating the failure of a datacenter. ● Injecting latency between services. ● Randomly causing exceptions. ● Changing time travel. ● Emulating I/O errors. http://principlesofchaos.org/
  • 20. 2008 Chaos Engineering began at Netflix 2010 Chaos Monkey was launched 2018 A lot of resources for Chaos Engineering. 2014 Role of Chaos Engineer was created. History of Chaos Engineering Kolton Andrus
  • 24.
  • 27. 4. Run the Experiment Application Name Finer Observability DataDog Hypothesis Circuit Breaker works Environment My Home Results Duration 5 - 10 seconds Load 1 request Actions
  • 28. 4. Run the Experiment Application Name Finer Observability DataDog Hypothesis Circuit Breaker works Facing latencies > 5 seconds between dashboard_api and smart_api to open the circuit. Environment My Home Results Duration 20 milliseconds Load 1 request Issue #4356 Configure the proper hystrix parameters according the results. Implement a fallback. Actions
  • 30. Game Day: Roles Master of Disaster First on-call Team https://www.pinterest.es/pin/824299538021645731/
  • 31. Game Days can Transform our Teams Even though Game Days are not real! they make Engineers gain confidence.
  • 32. Since we, Engineers are experiencing the failure as part of our job, we should start designing for failure. Me :) The best time to learn about fire is when you’re on fire. —Jen Hammond, New Relic engineering manager
  • 33. How to begin ... https://chaosengineering.slack.com https://github.com/dastergon/awesome-chaos -engineering https://www.infoq.com/chaos-engineering @yurynino