SlideShare a Scribd company logo
From Duke of DevOps
To Queen of Chaos
APIdays.io Paris
December 11 & 12, 2018
Christophe ROCHEFOLLE
Director Operational Experience @OUI.sncf
@crochefolle
Experienced IT executive
providing tech & organization
to improve #quality & #agility
for IT systems,
#ChaosEngineering fan
Co-author of French
DevOps book
Who am I ?
French National Railway
Company
Founded in 1938.
First e-Commerce website
in France
IT Leader in mobility,
transform your journey into
an amazing experience
Where is my playground ?
99,997%
SLA availability
OUR RECORD
39
TICKETS SOLD by SECOND
SPEED RECORD
574.8
KM/H
2008 Andrew Shafer and Patrick Debois helds a "birds of a feather" session in
'Agile Toronto'
2009 “DevOpsDays” conference started in Belgium by Patrick Debois, and term
“DevOps" coined
2009 “10 Deploys per Day at Flickr” talk by John Allspaw and Paul Hammond
in “Velocity” conference
2009 In “Velocity” conference, Andrew Clay coined "Wall of confusion"
2009 Mike Rother wrote Toyota Kata and defined 'Improvement Kata'
2010 “Continuous Delivery” book from Jez Humble and David Farley, defined
"deployment pipeline"
2011 “The Phoenix Project” book from Gene Kim and Kevin Behr
2011 Amazon deploys to production every 11.6 seconds
2014 “DevOps for Dummies” book by Sanjeev Sharma
2014 Etsy deploys more than 50 times a day
2016 “The DevOps Handbook” book by Gene Kim and Jez Humble
2016 First “DevOpsREX” conference in Paris
2018 “Mettre en oeuvre DevOps – 2nd Edition” book by Alain Sacquet and me
2008 2010 2011 2014 2016 2018
DevOps
2009
DevOps: Shorten design to cash and
quick feedback
feedback
Duke of DevOps
Time is money.
Your TTM rocks !
You have a master in
CI/CD
Queen of Chaos
But the evil
is coming !!!
TIME
TTM
MTTR
slow fast
low
high
Increasing automation
Faster release cycles
Ephemeral knowledge
Increasing complexity
The automation paradox U-curve
For the first time, availability is
the main concern for IT European
management, before security.
Source: Master of Machines III
Real life
Focus was on the left side
CI/CD
Test automation
Application Lifecycle Management
Artifact management
IaaS / PaaS / CaaS
Deployment
RIGHT
LEFT
Time for Shift-Right
We need new ways to develop
reliability concern for our teams
…(an) error budget provides a clear,
objective metric that determines how
unreliable the service is allowed to
be…
SRE Error budget
• paying off some technical debt
• improve the logging to ease support
• add some additional integration or end-to-end tests
• do those first steps to enable blue/green
deployments
• implement service mesh
But, when was the last time that your product owner
willingly added any of those technical stories to the
next sprint?
Why having Error budget ?
SRE Error
budget
Where to start ?
1. Convert unavailability to cash
2. Define Service Level Objective with business team
3. Define Error budget
Availability = successful requests / (successful request + failed requests)
A failed request can be:
1. A 500 response, due to some bug.
2. No response, due to the service being down.
3. A slow response: if the client gives up before the response is available, it is as good as no response.
4. Incorrect data, due to some bug.
Error budget = (1 - availability) = failed requests / (successful requests + failed requests)
So if a service SLO is 99.9%, it has a 0.01% error budget. If the service is serves one million
request per quarter, the error budget tells us that it can fail up to ten thousand times.
SRE Error
budget
SRE Error
budget
How to use it ?
Company agreement:
Teams may no longer make any new release
without spending time improving the reliability
of the service when error budget is 0.
In fact, they better do improvement before it.
We need new ways to know
what f$$$ happens in production
Monitoring systems have not changed significantly in 20 years and has
fallen behind the way we build software.
Our software is now large distributed systems made up of many non-
uniform interacting components while the core functionality of
monitoring systems has stagnated.
Monitoring is dead
@grepory, Monitorama 2016
Why we need observability?
Observability
Complexity is exploding everywhere,
but our tools are designed for
a predictable world.
• Can you understand what’s happening inside your
code and systems, simply by asking questions
using your tools?
• Can you answer any new question you think of, or
only the ones you prepared for?
• Having to ship new code every time you want to
ask a new question … SUCKS.
Low
Medium
High
Microservice that does one thing
Function with no side effects
Monolith with logging
Monolith with tracing and logging
Monitoring
Thresholds, alerts, watching the
health of a system by checking for a
long list of symptoms. Black box-
oriented.
Observability
What can you learn about the
running state of a program by
observing its outputs?
(Instrumentation, tracing,
debugging)
Observability
What do we want ?
a system is observable
when your team can quickly
and reliably track down any
new problem with no prior
knowledge.
Observability
Where to start ?
Observability
• Rich instrumentation
• Events, not metrics
• No aggregation
• Few dashboards
• Test in production
Internal state from software
Wrap every network call, every data call
Structured data only
Arbitrarily wide events mean you can amass more and
more context over time. Use sampling to control costs
and bandwidth.
Aggregates destroy your precious details.
We need MORE detail and MORE context.
Dashboard focus on specific known possible failure. We
need to explore raw data to discover what we don’t
know. If you already know the answer, do self-healing !
Software engineers spend too much time looking at code in
elaborately falsified environments, and not enough time
observing it in the real world.
Need more information ?
https://www.d2si.io/observabilite
Follow @mipsytipsy
engineer/cofounder/CEO
“the only good diff is a red diff”
We need shit-right testing
RIGHT
LEFT
https://medium.com/@copyconstruct/testing-in-production-the-safe-way-18ca102d0ef1
The performance of complex systems
is typically optimized at the edge of
chaos, just before system behavior
will become unrecognizably turbulent.
Chaos Engineering
—Sidney Dekker, Drift Into Failure
How much confidence we can have in the
complex systems that we put into production?
Why do we need Chaos
Engineering ?
Chaos
Engineering
With so many interacting components, the number of
things that can go wrong in a distributed system is
enormous.
You’ll never be able to prevent all possible failure modes,
but you can identify many of the weaknesses in your
system before they’re triggered by these events.
Queen of Chaos
So, to fight the evil
Chaos
Engineering
Chaos engineering
is the discipline of experimenting
on a distributed system in order
to build confidence in the systems
capacity to withstand turbulent conditions
in production
Principles of Chaos Engineering
Chaos
Engineering
2004
Chaos
engineering
2010 2012 2016 2017 2018
2004
2010
2012
2016
2017
2018
Amazon—Jesse Robbins. Master of disaster
Netflix—Greg Orzell. @chaosimia - First implementation of
Chaos Monkey to enforce use of auto-scaled stateless services
NetflixOSS open sources simian army
Gremlin Inc founded
Netflix chaos eng book. Chaos toolkit open source project
Chaos concepts getting adopted widely !
Where to start ?
Chaos
Engineering
Hypothesis testing
We think we have safety margin in this dimension, let’s
carefully test to be sure
In production
Without causing an issue
1. Start by defining ‘steady state’ as some measurable output of a system that
indicates normal behavior.
2. Hypothesize that this steady state will continue in both the control group
and the experimental group.
3. Introduce variables that reflect real world events like servers that crash,
hard drives that malfunction, network connections that are severed, etc.
4. Try to disprove the hypothesis by looking for a difference in steady state
between the control group and the experimental group.
• Simulating the failure of an entire region or datacenter.
• Partially deleting Kafka topics over a variety of instances to recreate an issue that occurred in
production.
• Injecting latency between services for a select percentage of traffic over a predetermined period
of time.
• Function-based chaos (runtime injection): randomly causing functions to throw exceptions.
• Code insertion: Adding instructions to the target program and allowing fault injection to occur
prior to certain instructions.
• Time travel: forcing system clocks out of sync with each other.
• Executing a routine in driver code emulating I/O errors.
• Maxing out CPU cores on an Elasticsearch cluster.
Injecting Chaos
Chaos
Engineering
Different experiments for
every stage
Chaos
Engineering
Infrastructure
Switching
Application
PeopleGame days
Simian Army
chaostoolkit
ChAP
Gremlin
Our story of Chaos Engineering @OUI.sncf
2015
2016 2018
Birth of an
ambition :
Chaos Monkey
EXPERIMENTATION
INDUSTRIALIZATION
All critical
applications run
Chaos experiment
2017
OUR BESTIARY IS BORN IN OCTOBER
1ST DAYS OF CHAOS
Detection : 87%
Diagnostic : 73%
Resolution : 45%
RUN IN PRODUCTION
First Chaos Monkey in
production…
…and production is
still up
2ND DAYS OF CHAOS 3RD DAYS OF CHAOS
To follow our
experiment, birth of
the
https://days-of-chaos.slack.com
Paris Chaos Engineering Meetup
http://meetu.ps/c/3BMlX/xNjMx/f https://chaosengineering.slack.com
http://days-of-chaos.com/
https://medium.com/paris-
chaos-engineering-
community
SRE Error Budget
Observability
Test in production
Chaos Engineering
Continuous Quality
CI/CD
Test automation
Application Lifecycle Management
Artifact management
IaaS / PaaS / CaaS
Deployment
Thank you
And
Bon appetite !!!

More Related Content

What's hot

Chaos Engineering
Chaos EngineeringChaos Engineering
Chaos Engineering
Yury Roa
 
Chaos Engineering: Injecting Failure for Building Resilience in Systems
Chaos Engineering: Injecting Failure for Building Resilience in SystemsChaos Engineering: Injecting Failure for Building Resilience in Systems
Chaos Engineering: Injecting Failure for Building Resilience in Systems
Yury Roa
 
DevOps for Defenders in the Enterprise
DevOps for Defenders in the EnterpriseDevOps for Defenders in the Enterprise
DevOps for Defenders in the Enterprise
James Wickett
 
Chaos Engineering 101: A Field Guide
Chaos Engineering 101: A Field GuideChaos Engineering 101: A Field Guide
Chaos Engineering 101: A Field Guide
matthewbrahms
 
Chaos Engineering when you're not Netflix
Chaos Engineering when you're not NetflixChaos Engineering when you're not Netflix
Chaos Engineering when you're not Netflix
Martez Reed
 
SecOps - Bringing Agility into Security
SecOps - Bringing Agility into SecuritySecOps - Bringing Agility into Security
SecOps - Bringing Agility into Security
Atlassian
 
Craft 2019 - Security Chaos Engineering - Security Precognition
Craft 2019 - Security Chaos Engineering - Security PrecognitionCraft 2019 - Security Chaos Engineering - Security Precognition
Craft 2019 - Security Chaos Engineering - Security Precognition
Aaron Rinehart
 
Chaos Engineering – why we should all practice breaking things on purpose by ...
Chaos Engineering – why we should all practice breaking things on purpose by ...Chaos Engineering – why we should all practice breaking things on purpose by ...
Chaos Engineering – why we should all practice breaking things on purpose by ...
Alex Cachia
 
The New Ways of Chaos, Security, and DevOps
The New Ways of Chaos, Security, and DevOpsThe New Ways of Chaos, Security, and DevOps
The New Ways of Chaos, Security, and DevOps
James Wickett
 
GDS-Austin - DevSecOps & Security Chaos Engineering
GDS-Austin - DevSecOps & Security Chaos EngineeringGDS-Austin - DevSecOps & Security Chaos Engineering
GDS-Austin - DevSecOps & Security Chaos Engineering
Aaron Rinehart
 
ChaoSlingr: Introducing Security based Chaos Testing
ChaoSlingr: Introducing Security based Chaos TestingChaoSlingr: Introducing Security based Chaos Testing
ChaoSlingr: Introducing Security based Chaos Testing
Aaron Rinehart
 
The Seven Habits of the Highly Effective DevSecOp
The Seven Habits of the Highly Effective DevSecOpThe Seven Habits of the Highly Effective DevSecOp
The Seven Habits of the Highly Effective DevSecOp
James Wickett
 
An Introduction to Chaos Engineering
An Introduction to Chaos EngineeringAn Introduction to Chaos Engineering
An Introduction to Chaos Engineering
Gremlin
 
OWASP AppSec Global 2019 Security & Chaos Engineering
OWASP AppSec Global 2019 Security & Chaos EngineeringOWASP AppSec Global 2019 Security & Chaos Engineering
OWASP AppSec Global 2019 Security & Chaos Engineering
Aaron Rinehart
 
DevSecOps and the New Path Forward
DevSecOps and the New Path ForwardDevSecOps and the New Path Forward
DevSecOps and the New Path Forward
James Wickett
 
Adversary Driven Defense in the Real World
Adversary Driven Defense in the Real WorldAdversary Driven Defense in the Real World
Adversary Driven Defense in the Real World
James Wickett
 
DevOps - Understanding Core Concepts (Old)
DevOps - Understanding Core Concepts (Old)DevOps - Understanding Core Concepts (Old)
DevOps - Understanding Core Concepts (Old)
Nitin Bhide
 
Making Observability Actionable At Scale - DBS DevConnect 2019
Making Observability Actionable At Scale - DBS DevConnect 2019Making Observability Actionable At Scale - DBS DevConnect 2019
Making Observability Actionable At Scale - DBS DevConnect 2019
Squadcast Inc
 
DevOps for the Discouraged
DevOps for the Discouraged DevOps for the Discouraged
DevOps for the Discouraged
James Wickett
 
Road to DevOps ROI
Road to DevOps ROIRoad to DevOps ROI
Road to DevOps ROI
Cloudmunch
 

What's hot (20)

Chaos Engineering
Chaos EngineeringChaos Engineering
Chaos Engineering
 
Chaos Engineering: Injecting Failure for Building Resilience in Systems
Chaos Engineering: Injecting Failure for Building Resilience in SystemsChaos Engineering: Injecting Failure for Building Resilience in Systems
Chaos Engineering: Injecting Failure for Building Resilience in Systems
 
DevOps for Defenders in the Enterprise
DevOps for Defenders in the EnterpriseDevOps for Defenders in the Enterprise
DevOps for Defenders in the Enterprise
 
Chaos Engineering 101: A Field Guide
Chaos Engineering 101: A Field GuideChaos Engineering 101: A Field Guide
Chaos Engineering 101: A Field Guide
 
Chaos Engineering when you're not Netflix
Chaos Engineering when you're not NetflixChaos Engineering when you're not Netflix
Chaos Engineering when you're not Netflix
 
SecOps - Bringing Agility into Security
SecOps - Bringing Agility into SecuritySecOps - Bringing Agility into Security
SecOps - Bringing Agility into Security
 
Craft 2019 - Security Chaos Engineering - Security Precognition
Craft 2019 - Security Chaos Engineering - Security PrecognitionCraft 2019 - Security Chaos Engineering - Security Precognition
Craft 2019 - Security Chaos Engineering - Security Precognition
 
Chaos Engineering – why we should all practice breaking things on purpose by ...
Chaos Engineering – why we should all practice breaking things on purpose by ...Chaos Engineering – why we should all practice breaking things on purpose by ...
Chaos Engineering – why we should all practice breaking things on purpose by ...
 
The New Ways of Chaos, Security, and DevOps
The New Ways of Chaos, Security, and DevOpsThe New Ways of Chaos, Security, and DevOps
The New Ways of Chaos, Security, and DevOps
 
GDS-Austin - DevSecOps & Security Chaos Engineering
GDS-Austin - DevSecOps & Security Chaos EngineeringGDS-Austin - DevSecOps & Security Chaos Engineering
GDS-Austin - DevSecOps & Security Chaos Engineering
 
ChaoSlingr: Introducing Security based Chaos Testing
ChaoSlingr: Introducing Security based Chaos TestingChaoSlingr: Introducing Security based Chaos Testing
ChaoSlingr: Introducing Security based Chaos Testing
 
The Seven Habits of the Highly Effective DevSecOp
The Seven Habits of the Highly Effective DevSecOpThe Seven Habits of the Highly Effective DevSecOp
The Seven Habits of the Highly Effective DevSecOp
 
An Introduction to Chaos Engineering
An Introduction to Chaos EngineeringAn Introduction to Chaos Engineering
An Introduction to Chaos Engineering
 
OWASP AppSec Global 2019 Security & Chaos Engineering
OWASP AppSec Global 2019 Security & Chaos EngineeringOWASP AppSec Global 2019 Security & Chaos Engineering
OWASP AppSec Global 2019 Security & Chaos Engineering
 
DevSecOps and the New Path Forward
DevSecOps and the New Path ForwardDevSecOps and the New Path Forward
DevSecOps and the New Path Forward
 
Adversary Driven Defense in the Real World
Adversary Driven Defense in the Real WorldAdversary Driven Defense in the Real World
Adversary Driven Defense in the Real World
 
DevOps - Understanding Core Concepts (Old)
DevOps - Understanding Core Concepts (Old)DevOps - Understanding Core Concepts (Old)
DevOps - Understanding Core Concepts (Old)
 
Making Observability Actionable At Scale - DBS DevConnect 2019
Making Observability Actionable At Scale - DBS DevConnect 2019Making Observability Actionable At Scale - DBS DevConnect 2019
Making Observability Actionable At Scale - DBS DevConnect 2019
 
DevOps for the Discouraged
DevOps for the Discouraged DevOps for the Discouraged
DevOps for the Discouraged
 
Road to DevOps ROI
Road to DevOps ROIRoad to DevOps ROI
Road to DevOps ROI
 

Similar to From Duke of DevOps to Queen of Chaos - Api days 2018

An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
Brian Brazil
 
Confoo-Montreal-2016: Controlling Your Environments using Infrastructure as Code
Confoo-Montreal-2016: Controlling Your Environments using Infrastructure as CodeConfoo-Montreal-2016: Controlling Your Environments using Infrastructure as Code
Confoo-Montreal-2016: Controlling Your Environments using Infrastructure as Code
Steve Mercier
 
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.02014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
Joakim Lindbom
 
A Tale of Contemporary Software
A Tale of Contemporary SoftwareA Tale of Contemporary Software
A Tale of Contemporary Software
Yun Zhi Lin
 
Embracing Failure - AzureDay Rome
Embracing Failure - AzureDay RomeEmbracing Failure - AzureDay Rome
Embracing Failure - AzureDay Rome
Alberto Acerbis
 
From Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.auFrom Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.au
evanbottcher
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
Brian Brazil
 
Evolving to Cloud-Native - Nate Schutta (2/2)
Evolving to Cloud-Native - Nate Schutta (2/2)Evolving to Cloud-Native - Nate Schutta (2/2)
Evolving to Cloud-Native - Nate Schutta (2/2)
VMware Tanzu
 
Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015
Mirco Hering
 
Moving to Microservices with the Help of Distributed Traces
Moving to Microservices with the Help of Distributed TracesMoving to Microservices with the Help of Distributed Traces
Moving to Microservices with the Help of Distributed Traces
KP Kaiser
 
Accelerate your Application Delivery with DevOps and Microservices
Accelerate your Application Delivery with DevOps and MicroservicesAccelerate your Application Delivery with DevOps and Microservices
Accelerate your Application Delivery with DevOps and Microservices
Amazon Web Services
 
2016 - 10 questions you should answer before building a new microservice
2016 - 10 questions you should answer before building a new microservice2016 - 10 questions you should answer before building a new microservice
2016 - 10 questions you should answer before building a new microservice
devopsdaysaustin
 
Workshop - The Little Pattern That Could.pdf
Workshop - The Little Pattern That Could.pdfWorkshop - The Little Pattern That Could.pdf
Workshop - The Little Pattern That Could.pdf
TobiasGoeschel
 
Model-Based Testing for Cypress
Model-Based Testing for CypressModel-Based Testing for Cypress
Model-Based Testing for Cypress
Curiosity Software Ireland
 
Scaling Engineering with Docker
Scaling Engineering with DockerScaling Engineering with Docker
Scaling Engineering with Docker
Tom Leach
 
Monitoring As Code: How to Integrate App Monitoring Into Your Developer Cycle
Monitoring As Code: How to Integrate App Monitoring Into Your Developer CycleMonitoring As Code: How to Integrate App Monitoring Into Your Developer Cycle
Monitoring As Code: How to Integrate App Monitoring Into Your Developer Cycle
Atlassian
 
Week 4 Assignment - Software Development PlanScenario-Your team has be.docx
Week 4 Assignment - Software Development PlanScenario-Your team has be.docxWeek 4 Assignment - Software Development PlanScenario-Your team has be.docx
Week 4 Assignment - Software Development PlanScenario-Your team has be.docx
estefana2345678
 
Code instrumentation
Code instrumentationCode instrumentation
Code instrumentation
Mennan Tekbir
 
Innovate Better Through Machine data Analytics
Innovate Better Through Machine data AnalyticsInnovate Better Through Machine data Analytics
Innovate Better Through Machine data Analytics
Hal Rottenberg
 
Curiosity and Sauce Labs present - When to stop testing: 3 dimensions of test...
Curiosity and Sauce Labs present - When to stop testing: 3 dimensions of test...Curiosity and Sauce Labs present - When to stop testing: 3 dimensions of test...
Curiosity and Sauce Labs present - When to stop testing: 3 dimensions of test...
Curiosity Software Ireland
 

Similar to From Duke of DevOps to Queen of Chaos - Api days 2018 (20)

An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
Confoo-Montreal-2016: Controlling Your Environments using Infrastructure as Code
Confoo-Montreal-2016: Controlling Your Environments using Infrastructure as CodeConfoo-Montreal-2016: Controlling Your Environments using Infrastructure as Code
Confoo-Montreal-2016: Controlling Your Environments using Infrastructure as Code
 
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.02014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
 
A Tale of Contemporary Software
A Tale of Contemporary SoftwareA Tale of Contemporary Software
A Tale of Contemporary Software
 
Embracing Failure - AzureDay Rome
Embracing Failure - AzureDay RomeEmbracing Failure - AzureDay Rome
Embracing Failure - AzureDay Rome
 
From Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.auFrom Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.au
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Evolving to Cloud-Native - Nate Schutta (2/2)
Evolving to Cloud-Native - Nate Schutta (2/2)Evolving to Cloud-Native - Nate Schutta (2/2)
Evolving to Cloud-Native - Nate Schutta (2/2)
 
Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015
 
Moving to Microservices with the Help of Distributed Traces
Moving to Microservices with the Help of Distributed TracesMoving to Microservices with the Help of Distributed Traces
Moving to Microservices with the Help of Distributed Traces
 
Accelerate your Application Delivery with DevOps and Microservices
Accelerate your Application Delivery with DevOps and MicroservicesAccelerate your Application Delivery with DevOps and Microservices
Accelerate your Application Delivery with DevOps and Microservices
 
2016 - 10 questions you should answer before building a new microservice
2016 - 10 questions you should answer before building a new microservice2016 - 10 questions you should answer before building a new microservice
2016 - 10 questions you should answer before building a new microservice
 
Workshop - The Little Pattern That Could.pdf
Workshop - The Little Pattern That Could.pdfWorkshop - The Little Pattern That Could.pdf
Workshop - The Little Pattern That Could.pdf
 
Model-Based Testing for Cypress
Model-Based Testing for CypressModel-Based Testing for Cypress
Model-Based Testing for Cypress
 
Scaling Engineering with Docker
Scaling Engineering with DockerScaling Engineering with Docker
Scaling Engineering with Docker
 
Monitoring As Code: How to Integrate App Monitoring Into Your Developer Cycle
Monitoring As Code: How to Integrate App Monitoring Into Your Developer CycleMonitoring As Code: How to Integrate App Monitoring Into Your Developer Cycle
Monitoring As Code: How to Integrate App Monitoring Into Your Developer Cycle
 
Week 4 Assignment - Software Development PlanScenario-Your team has be.docx
Week 4 Assignment - Software Development PlanScenario-Your team has be.docxWeek 4 Assignment - Software Development PlanScenario-Your team has be.docx
Week 4 Assignment - Software Development PlanScenario-Your team has be.docx
 
Code instrumentation
Code instrumentationCode instrumentation
Code instrumentation
 
Innovate Better Through Machine data Analytics
Innovate Better Through Machine data AnalyticsInnovate Better Through Machine data Analytics
Innovate Better Through Machine data Analytics
 
Curiosity and Sauce Labs present - When to stop testing: 3 dimensions of test...
Curiosity and Sauce Labs present - When to stop testing: 3 dimensions of test...Curiosity and Sauce Labs present - When to stop testing: 3 dimensions of test...
Curiosity and Sauce Labs present - When to stop testing: 3 dimensions of test...
 

More from Christophe Rochefolle

Agile Secteur Public - Numérique Responsable
Agile Secteur Public - Numérique ResponsableAgile Secteur Public - Numérique Responsable
Agile Secteur Public - Numérique Responsable
Christophe Rochefolle
 
Une App responsable pour de la mobilité durable
Une App responsable pour de la mobilité durableUne App responsable pour de la mobilité durable
Une App responsable pour de la mobilité durable
Christophe Rochefolle
 
#DevOps - Et si on déployait le vendredi
#DevOps - Et si on déployait le vendredi#DevOps - Et si on déployait le vendredi
#DevOps - Et si on déployait le vendredi
Christophe Rochefolle
 
Cloud Expo Europe 2018 - "Et si on testait en production ?"
Cloud Expo Europe 2018 - "Et si on testait en production ?"Cloud Expo Europe 2018 - "Et si on testait en production ?"
Cloud Expo Europe 2018 - "Et si on testait en production ?"
Christophe Rochefolle
 
Paris Chaos Engineering Meetup #6
Paris Chaos Engineering Meetup #6Paris Chaos Engineering Meetup #6
Paris Chaos Engineering Meetup #6
Christophe Rochefolle
 
Qualité Logiciel - Outils Open Source pour Java et Web
Qualité Logiciel - Outils Open Source pour Java et WebQualité Logiciel - Outils Open Source pour Java et Web
Qualité Logiciel - Outils Open Source pour Java et Web
Christophe Rochefolle
 
Qualité logiciel - Generalités
Qualité logiciel - GeneralitésQualité logiciel - Generalités
Qualité logiciel - Generalités
Christophe Rochefolle
 
Automatisation des tests - objectifs et concepts - partie 2
Automatisation des tests  - objectifs et concepts - partie 2Automatisation des tests  - objectifs et concepts - partie 2
Automatisation des tests - objectifs et concepts - partie 2
Christophe Rochefolle
 
Automatisation des tests - objectifs et concepts - partie 1
Automatisation des tests  - objectifs et concepts - partie 1Automatisation des tests  - objectifs et concepts - partie 1
Automatisation des tests - objectifs et concepts - partie 1
Christophe Rochefolle
 
Paris Chaos Engineering Meetup #5
Paris Chaos Engineering Meetup #5Paris Chaos Engineering Meetup #5
Paris Chaos Engineering Meetup #5
Christophe Rochefolle
 
Jftl 2018 chaos engineering
Jftl 2018   chaos engineeringJftl 2018   chaos engineering
Jftl 2018 chaos engineering
Christophe Rochefolle
 
Paris Chaos Engineering Meetup #2
Paris Chaos Engineering Meetup #2Paris Chaos Engineering Meetup #2
Paris Chaos Engineering Meetup #2
Christophe Rochefolle
 
Paris Chaos Engineering Meetup #1
Paris Chaos Engineering Meetup #1 Paris Chaos Engineering Meetup #1
Paris Chaos Engineering Meetup #1
Christophe Rochefolle
 

More from Christophe Rochefolle (13)

Agile Secteur Public - Numérique Responsable
Agile Secteur Public - Numérique ResponsableAgile Secteur Public - Numérique Responsable
Agile Secteur Public - Numérique Responsable
 
Une App responsable pour de la mobilité durable
Une App responsable pour de la mobilité durableUne App responsable pour de la mobilité durable
Une App responsable pour de la mobilité durable
 
#DevOps - Et si on déployait le vendredi
#DevOps - Et si on déployait le vendredi#DevOps - Et si on déployait le vendredi
#DevOps - Et si on déployait le vendredi
 
Cloud Expo Europe 2018 - "Et si on testait en production ?"
Cloud Expo Europe 2018 - "Et si on testait en production ?"Cloud Expo Europe 2018 - "Et si on testait en production ?"
Cloud Expo Europe 2018 - "Et si on testait en production ?"
 
Paris Chaos Engineering Meetup #6
Paris Chaos Engineering Meetup #6Paris Chaos Engineering Meetup #6
Paris Chaos Engineering Meetup #6
 
Qualité Logiciel - Outils Open Source pour Java et Web
Qualité Logiciel - Outils Open Source pour Java et WebQualité Logiciel - Outils Open Source pour Java et Web
Qualité Logiciel - Outils Open Source pour Java et Web
 
Qualité logiciel - Generalités
Qualité logiciel - GeneralitésQualité logiciel - Generalités
Qualité logiciel - Generalités
 
Automatisation des tests - objectifs et concepts - partie 2
Automatisation des tests  - objectifs et concepts - partie 2Automatisation des tests  - objectifs et concepts - partie 2
Automatisation des tests - objectifs et concepts - partie 2
 
Automatisation des tests - objectifs et concepts - partie 1
Automatisation des tests  - objectifs et concepts - partie 1Automatisation des tests  - objectifs et concepts - partie 1
Automatisation des tests - objectifs et concepts - partie 1
 
Paris Chaos Engineering Meetup #5
Paris Chaos Engineering Meetup #5Paris Chaos Engineering Meetup #5
Paris Chaos Engineering Meetup #5
 
Jftl 2018 chaos engineering
Jftl 2018   chaos engineeringJftl 2018   chaos engineering
Jftl 2018 chaos engineering
 
Paris Chaos Engineering Meetup #2
Paris Chaos Engineering Meetup #2Paris Chaos Engineering Meetup #2
Paris Chaos Engineering Meetup #2
 
Paris Chaos Engineering Meetup #1
Paris Chaos Engineering Meetup #1 Paris Chaos Engineering Meetup #1
Paris Chaos Engineering Meetup #1
 

Recently uploaded

Online toll plaza booking system project report.doc.pdf
Online toll plaza booking system project report.doc.pdfOnline toll plaza booking system project report.doc.pdf
Online toll plaza booking system project report.doc.pdf
Kamal Acharya
 
11th International Conference on Computer Science, Engineering and Informatio...
11th International Conference on Computer Science, Engineering and Informatio...11th International Conference on Computer Science, Engineering and Informatio...
11th International Conference on Computer Science, Engineering and Informatio...
ijcisjournal
 
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
YanKing2
 
OSHA LOTO training, LOTO, lock out tag out
OSHA LOTO training, LOTO, lock out tag outOSHA LOTO training, LOTO, lock out tag out
OSHA LOTO training, LOTO, lock out tag out
Ateeb19
 
Concepts Basic/ Technical Electronic Material.pdf
Concepts Basic/ Technical Electronic Material.pdfConcepts Basic/ Technical Electronic Material.pdf
Concepts Basic/ Technical Electronic Material.pdf
OBD II
 
The world of Technology Management MEM 814.pptx
The world of Technology Management MEM 814.pptxThe world of Technology Management MEM 814.pptx
The world of Technology Management MEM 814.pptx
engrasjadshahzad
 
李易峰祝绪丹做爱视频流出【网芷:ht28.co】可爱学生妹>>>[网趾:ht28.co】]<<<
李易峰祝绪丹做爱视频流出【网芷:ht28.co】可爱学生妹>>>[网趾:ht28.co】]<<<李易峰祝绪丹做爱视频流出【网芷:ht28.co】可爱学生妹>>>[网趾:ht28.co】]<<<
李易峰祝绪丹做爱视频流出【网芷:ht28.co】可爱学生妹>>>[网趾:ht28.co】]<<<
amzhoxvzidbke
 
Red Hat Enterprise Linux Administration 9.0 RH134 pdf
Red Hat Enterprise Linux Administration 9.0 RH134 pdfRed Hat Enterprise Linux Administration 9.0 RH134 pdf
Red Hat Enterprise Linux Administration 9.0 RH134 pdf
mdfkobir
 
Disaster Management and Mitigation presentation
Disaster Management and Mitigation presentationDisaster Management and Mitigation presentation
Disaster Management and Mitigation presentation
RajaRamannaTarigoppu
 
# Smart Parking Management System.pptx using IOT
# Smart Parking Management System.pptx using IOT# Smart Parking Management System.pptx using IOT
# Smart Parking Management System.pptx using IOT
Yesh20
 
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
PradeepKumarSK3
 
PPT_grt.pptx engineering criteria grt for accrediation
PPT_grt.pptx engineering criteria  grt for accrediationPPT_grt.pptx engineering criteria  grt for accrediation
PPT_grt.pptx engineering criteria grt for accrediation
SHALINIRAJAN20
 
Presentation slide on DESIGN AND FABRICATION OF MOBILE CONTROLLED DRAINAGE.pptx
Presentation slide on DESIGN AND FABRICATION OF MOBILE CONTROLLED DRAINAGE.pptxPresentation slide on DESIGN AND FABRICATION OF MOBILE CONTROLLED DRAINAGE.pptx
Presentation slide on DESIGN AND FABRICATION OF MOBILE CONTROLLED DRAINAGE.pptx
Er. Kushal Ghimire
 
Generative-AI-a-boost-for-operations-Presentation.pdf
Generative-AI-a-boost-for-operations-Presentation.pdfGenerative-AI-a-boost-for-operations-Presentation.pdf
Generative-AI-a-boost-for-operations-Presentation.pdf
Aries716858
 
RECENT DEVELOPMENTS IN RING SPINNING.pptx
RECENT DEVELOPMENTS IN RING SPINNING.pptxRECENT DEVELOPMENTS IN RING SPINNING.pptx
RECENT DEVELOPMENTS IN RING SPINNING.pptx
peacesoul123
 
Python programming Introduction about Python
Python programming Introduction about PythonPython programming Introduction about Python
Python programming Introduction about Python
Senthil Vit
 
Time-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 TalkTime-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 Talk
Evan Chan
 
FINE-TUNING OF SMALL/MEDIUM LLMS FOR BUSINESS QA ON STRUCTURED DATA
FINE-TUNING OF SMALL/MEDIUM LLMS FOR BUSINESS QA ON STRUCTURED DATAFINE-TUNING OF SMALL/MEDIUM LLMS FOR BUSINESS QA ON STRUCTURED DATA
FINE-TUNING OF SMALL/MEDIUM LLMS FOR BUSINESS QA ON STRUCTURED DATA
kevig
 
Top EPC companies in India - Best EPC Contractor
Top EPC companies in India - Best EPC  ContractorTop EPC companies in India - Best EPC  Contractor
Top EPC companies in India - Best EPC Contractor
MangeshK6
 
Ludo system project report management .pdf
Ludo  system project report management .pdfLudo  system project report management .pdf
Ludo system project report management .pdf
Kamal Acharya
 

Recently uploaded (20)

Online toll plaza booking system project report.doc.pdf
Online toll plaza booking system project report.doc.pdfOnline toll plaza booking system project report.doc.pdf
Online toll plaza booking system project report.doc.pdf
 
11th International Conference on Computer Science, Engineering and Informatio...
11th International Conference on Computer Science, Engineering and Informatio...11th International Conference on Computer Science, Engineering and Informatio...
11th International Conference on Computer Science, Engineering and Informatio...
 
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
 
OSHA LOTO training, LOTO, lock out tag out
OSHA LOTO training, LOTO, lock out tag outOSHA LOTO training, LOTO, lock out tag out
OSHA LOTO training, LOTO, lock out tag out
 
Concepts Basic/ Technical Electronic Material.pdf
Concepts Basic/ Technical Electronic Material.pdfConcepts Basic/ Technical Electronic Material.pdf
Concepts Basic/ Technical Electronic Material.pdf
 
The world of Technology Management MEM 814.pptx
The world of Technology Management MEM 814.pptxThe world of Technology Management MEM 814.pptx
The world of Technology Management MEM 814.pptx
 
李易峰祝绪丹做爱视频流出【网芷:ht28.co】可爱学生妹>>>[网趾:ht28.co】]<<<
李易峰祝绪丹做爱视频流出【网芷:ht28.co】可爱学生妹>>>[网趾:ht28.co】]<<<李易峰祝绪丹做爱视频流出【网芷:ht28.co】可爱学生妹>>>[网趾:ht28.co】]<<<
李易峰祝绪丹做爱视频流出【网芷:ht28.co】可爱学生妹>>>[网趾:ht28.co】]<<<
 
Red Hat Enterprise Linux Administration 9.0 RH134 pdf
Red Hat Enterprise Linux Administration 9.0 RH134 pdfRed Hat Enterprise Linux Administration 9.0 RH134 pdf
Red Hat Enterprise Linux Administration 9.0 RH134 pdf
 
Disaster Management and Mitigation presentation
Disaster Management and Mitigation presentationDisaster Management and Mitigation presentation
Disaster Management and Mitigation presentation
 
# Smart Parking Management System.pptx using IOT
# Smart Parking Management System.pptx using IOT# Smart Parking Management System.pptx using IOT
# Smart Parking Management System.pptx using IOT
 
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
 
PPT_grt.pptx engineering criteria grt for accrediation
PPT_grt.pptx engineering criteria  grt for accrediationPPT_grt.pptx engineering criteria  grt for accrediation
PPT_grt.pptx engineering criteria grt for accrediation
 
Presentation slide on DESIGN AND FABRICATION OF MOBILE CONTROLLED DRAINAGE.pptx
Presentation slide on DESIGN AND FABRICATION OF MOBILE CONTROLLED DRAINAGE.pptxPresentation slide on DESIGN AND FABRICATION OF MOBILE CONTROLLED DRAINAGE.pptx
Presentation slide on DESIGN AND FABRICATION OF MOBILE CONTROLLED DRAINAGE.pptx
 
Generative-AI-a-boost-for-operations-Presentation.pdf
Generative-AI-a-boost-for-operations-Presentation.pdfGenerative-AI-a-boost-for-operations-Presentation.pdf
Generative-AI-a-boost-for-operations-Presentation.pdf
 
RECENT DEVELOPMENTS IN RING SPINNING.pptx
RECENT DEVELOPMENTS IN RING SPINNING.pptxRECENT DEVELOPMENTS IN RING SPINNING.pptx
RECENT DEVELOPMENTS IN RING SPINNING.pptx
 
Python programming Introduction about Python
Python programming Introduction about PythonPython programming Introduction about Python
Python programming Introduction about Python
 
Time-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 TalkTime-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 Talk
 
FINE-TUNING OF SMALL/MEDIUM LLMS FOR BUSINESS QA ON STRUCTURED DATA
FINE-TUNING OF SMALL/MEDIUM LLMS FOR BUSINESS QA ON STRUCTURED DATAFINE-TUNING OF SMALL/MEDIUM LLMS FOR BUSINESS QA ON STRUCTURED DATA
FINE-TUNING OF SMALL/MEDIUM LLMS FOR BUSINESS QA ON STRUCTURED DATA
 
Top EPC companies in India - Best EPC Contractor
Top EPC companies in India - Best EPC  ContractorTop EPC companies in India - Best EPC  Contractor
Top EPC companies in India - Best EPC Contractor
 
Ludo system project report management .pdf
Ludo  system project report management .pdfLudo  system project report management .pdf
Ludo system project report management .pdf
 

From Duke of DevOps to Queen of Chaos - Api days 2018

  • 1. From Duke of DevOps To Queen of Chaos APIdays.io Paris December 11 & 12, 2018 Christophe ROCHEFOLLE Director Operational Experience @OUI.sncf @crochefolle
  • 2. Experienced IT executive providing tech & organization to improve #quality & #agility for IT systems, #ChaosEngineering fan Co-author of French DevOps book Who am I ?
  • 3. French National Railway Company Founded in 1938. First e-Commerce website in France IT Leader in mobility, transform your journey into an amazing experience Where is my playground ? 99,997% SLA availability OUR RECORD 39 TICKETS SOLD by SECOND SPEED RECORD 574.8 KM/H
  • 4. 2008 Andrew Shafer and Patrick Debois helds a "birds of a feather" session in 'Agile Toronto' 2009 “DevOpsDays” conference started in Belgium by Patrick Debois, and term “DevOps" coined 2009 “10 Deploys per Day at Flickr” talk by John Allspaw and Paul Hammond in “Velocity” conference 2009 In “Velocity” conference, Andrew Clay coined "Wall of confusion" 2009 Mike Rother wrote Toyota Kata and defined 'Improvement Kata' 2010 “Continuous Delivery” book from Jez Humble and David Farley, defined "deployment pipeline" 2011 “The Phoenix Project” book from Gene Kim and Kevin Behr 2011 Amazon deploys to production every 11.6 seconds 2014 “DevOps for Dummies” book by Sanjeev Sharma 2014 Etsy deploys more than 50 times a day 2016 “The DevOps Handbook” book by Gene Kim and Jez Humble 2016 First “DevOpsREX” conference in Paris 2018 “Mettre en oeuvre DevOps – 2nd Edition” book by Alain Sacquet and me 2008 2010 2011 2014 2016 2018 DevOps 2009
  • 5. DevOps: Shorten design to cash and quick feedback feedback
  • 6. Duke of DevOps Time is money. Your TTM rocks ! You have a master in CI/CD
  • 7. Queen of Chaos But the evil is coming !!! TIME TTM MTTR slow fast low high Increasing automation Faster release cycles Ephemeral knowledge Increasing complexity The automation paradox U-curve
  • 8. For the first time, availability is the main concern for IT European management, before security. Source: Master of Machines III
  • 9. Real life Focus was on the left side CI/CD Test automation Application Lifecycle Management Artifact management IaaS / PaaS / CaaS Deployment
  • 11. We need new ways to develop reliability concern for our teams
  • 12. …(an) error budget provides a clear, objective metric that determines how unreliable the service is allowed to be… SRE Error budget
  • 13. • paying off some technical debt • improve the logging to ease support • add some additional integration or end-to-end tests • do those first steps to enable blue/green deployments • implement service mesh But, when was the last time that your product owner willingly added any of those technical stories to the next sprint? Why having Error budget ? SRE Error budget
  • 14. Where to start ? 1. Convert unavailability to cash 2. Define Service Level Objective with business team 3. Define Error budget Availability = successful requests / (successful request + failed requests) A failed request can be: 1. A 500 response, due to some bug. 2. No response, due to the service being down. 3. A slow response: if the client gives up before the response is available, it is as good as no response. 4. Incorrect data, due to some bug. Error budget = (1 - availability) = failed requests / (successful requests + failed requests) So if a service SLO is 99.9%, it has a 0.01% error budget. If the service is serves one million request per quarter, the error budget tells us that it can fail up to ten thousand times. SRE Error budget
  • 15. SRE Error budget How to use it ? Company agreement: Teams may no longer make any new release without spending time improving the reliability of the service when error budget is 0. In fact, they better do improvement before it.
  • 16. We need new ways to know what f$$$ happens in production
  • 17. Monitoring systems have not changed significantly in 20 years and has fallen behind the way we build software. Our software is now large distributed systems made up of many non- uniform interacting components while the core functionality of monitoring systems has stagnated. Monitoring is dead @grepory, Monitorama 2016
  • 18. Why we need observability? Observability Complexity is exploding everywhere, but our tools are designed for a predictable world. • Can you understand what’s happening inside your code and systems, simply by asking questions using your tools? • Can you answer any new question you think of, or only the ones you prepared for? • Having to ship new code every time you want to ask a new question … SUCKS.
  • 19. Low Medium High Microservice that does one thing Function with no side effects Monolith with logging Monolith with tracing and logging Monitoring Thresholds, alerts, watching the health of a system by checking for a long list of symptoms. Black box- oriented. Observability What can you learn about the running state of a program by observing its outputs? (Instrumentation, tracing, debugging) Observability
  • 20. What do we want ? a system is observable when your team can quickly and reliably track down any new problem with no prior knowledge. Observability
  • 21. Where to start ? Observability • Rich instrumentation • Events, not metrics • No aggregation • Few dashboards • Test in production Internal state from software Wrap every network call, every data call Structured data only Arbitrarily wide events mean you can amass more and more context over time. Use sampling to control costs and bandwidth. Aggregates destroy your precious details. We need MORE detail and MORE context. Dashboard focus on specific known possible failure. We need to explore raw data to discover what we don’t know. If you already know the answer, do self-healing ! Software engineers spend too much time looking at code in elaborately falsified environments, and not enough time observing it in the real world.
  • 22. Need more information ? https://www.d2si.io/observabilite Follow @mipsytipsy engineer/cofounder/CEO “the only good diff is a red diff”
  • 23. We need shit-right testing RIGHT LEFT
  • 25. The performance of complex systems is typically optimized at the edge of chaos, just before system behavior will become unrecognizably turbulent. Chaos Engineering —Sidney Dekker, Drift Into Failure
  • 26. How much confidence we can have in the complex systems that we put into production? Why do we need Chaos Engineering ? Chaos Engineering With so many interacting components, the number of things that can go wrong in a distributed system is enormous. You’ll never be able to prevent all possible failure modes, but you can identify many of the weaknesses in your system before they’re triggered by these events.
  • 27. Queen of Chaos So, to fight the evil Chaos Engineering
  • 28. Chaos engineering is the discipline of experimenting on a distributed system in order to build confidence in the systems capacity to withstand turbulent conditions in production Principles of Chaos Engineering Chaos Engineering
  • 29. 2004 Chaos engineering 2010 2012 2016 2017 2018 2004 2010 2012 2016 2017 2018 Amazon—Jesse Robbins. Master of disaster Netflix—Greg Orzell. @chaosimia - First implementation of Chaos Monkey to enforce use of auto-scaled stateless services NetflixOSS open sources simian army Gremlin Inc founded Netflix chaos eng book. Chaos toolkit open source project Chaos concepts getting adopted widely !
  • 30. Where to start ? Chaos Engineering Hypothesis testing We think we have safety margin in this dimension, let’s carefully test to be sure In production Without causing an issue 1. Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior. 2. Hypothesize that this steady state will continue in both the control group and the experimental group. 3. Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc. 4. Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.
  • 31. • Simulating the failure of an entire region or datacenter. • Partially deleting Kafka topics over a variety of instances to recreate an issue that occurred in production. • Injecting latency between services for a select percentage of traffic over a predetermined period of time. • Function-based chaos (runtime injection): randomly causing functions to throw exceptions. • Code insertion: Adding instructions to the target program and allowing fault injection to occur prior to certain instructions. • Time travel: forcing system clocks out of sync with each other. • Executing a routine in driver code emulating I/O errors. • Maxing out CPU cores on an Elasticsearch cluster. Injecting Chaos Chaos Engineering
  • 32. Different experiments for every stage Chaos Engineering Infrastructure Switching Application PeopleGame days Simian Army chaostoolkit ChAP Gremlin
  • 33. Our story of Chaos Engineering @OUI.sncf 2015 2016 2018 Birth of an ambition : Chaos Monkey EXPERIMENTATION INDUSTRIALIZATION All critical applications run Chaos experiment 2017 OUR BESTIARY IS BORN IN OCTOBER 1ST DAYS OF CHAOS Detection : 87% Diagnostic : 73% Resolution : 45% RUN IN PRODUCTION First Chaos Monkey in production… …and production is still up 2ND DAYS OF CHAOS 3RD DAYS OF CHAOS To follow our experiment, birth of the
  • 34. https://days-of-chaos.slack.com Paris Chaos Engineering Meetup http://meetu.ps/c/3BMlX/xNjMx/f https://chaosengineering.slack.com http://days-of-chaos.com/ https://medium.com/paris- chaos-engineering- community
  • 35. SRE Error Budget Observability Test in production Chaos Engineering Continuous Quality CI/CD Test automation Application Lifecycle Management Artifact management IaaS / PaaS / CaaS Deployment

Editor's Notes

  1. Wouldn’t it be nice to spend the next sprint or two paying off some of that technical debt that your project had accrued? Wouldn’t it be nice to improve the logging to ease support? Or add some additional integration or end-to-end tests? Or maybe do those first steps to enable blue/green deployments? But, when was the last time that your product owner willingly added any of those technical stories to the next sprint?
  2. Zoom la prochaine, comment on y est passé