SlideShare a Scribd company logo
can be applied to the nascent world of microservices.
Put some SRE
in your microservices
Hard-won lessons from the world of SRE…
The many faces of
Theo Schlossnagle
@postwait
CEO Circonus
The nature of the problem
Software Sucks
Once you’ve run software at scale,
you have a deep understanding of
how it is all tied together with
loose string and hope.
All software will fail, but
good software
fails well
• Consider the phrase:
“have you used X in anger.”
Never undervalue grace in failure.
Rule . 𝛌1 Crash landings should be both
fast and controlled.
What it means to
fail quickly & safely
• The scope of failure should
collapse completely.
• The time to failure should be
measured in small multiples of
normal service time
• Nothing outside the scope of
failure should be impacted.
https://www.youtube.com/watch?v=5SL1A2d2e7M
Autopsies: not just for medicine.
Rule . 𝛌2 Post-mortems are
fundamental.
Pragmatic analysis is required to
understand failure’s
true nature
• Post-mortem analysis is critical
• Stack traces
• Forensic logs
• Images (cores, dumps, etc.)
The difference between a shock and electrocution is real.
Rule . 𝛌3 Use circuit breakers.
Circuit breakers are designed to
avoid
cascading failure
• it’s not all about,
especially with microservices
• protect yourselves and others
• circuit breakers of many type
• timing
• queue depth
• concurrency
http://melissaomarkham.com
You cannot understand what you cannot measure.
Rule . 𝛌4 Behavior is complex.
Understand it.
Don’t measure to assess availability
measure to understand
Build robust models of behavior
Understand performance changes
Don’t use averages
Don’t use percentiles alone
Don’t measure to assess availability
measure to understand
Build robust models of behavior
Understand performance changes
Don’t use averages
Don’t use percentiles alone
It’s easy to demand perfection; it’s also stupid.
Rule . 𝛌5 Have an failure budget.
Avoid failure is simply impossible,
expect and manage
failure
• use failure budgets
• set expectations reasonably
• define and reward successes on
improvement and competency,
not just uptime.
Justice should be blind; operations should not.
Rule . 𝛌6 Instrumentation &
Observability have no equals.
For every “I wonder what X is right now?”
in production,
you must have answers
DTrace
eBPF
Instrument code for observability
https://www.pinterest.com/pin/441775044670412234/
Thank you.

More Related Content

What's hot

Chaos Engineering 101: A Field Guide
Chaos Engineering 101: A Field GuideChaos Engineering 101: A Field Guide
Chaos Engineering 101: A Field Guide
matthewbrahms
 
Shift Left. Wait, what? No, Shift Right!!!
Shift Left. Wait, what? No, Shift Right!!!Shift Left. Wait, what? No, Shift Right!!!
Shift Left. Wait, what? No, Shift Right!!!
Phillip Maddux
 
The left is not wrong, just not right; It's time to shift right!
The left is not wrong, just not right; It's time to shift right!The left is not wrong, just not right; It's time to shift right!
The left is not wrong, just not right; It's time to shift right!
Phillip Maddux
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
Ashutosh Agarwal
 
How to SRE when you have no SRE
How to SRE when you have no SREHow to SRE when you have no SRE
How to SRE when you have no SRE
Squadcast Inc
 
OpsStack Overview 20170806.1
OpsStack Overview 20170806.1OpsStack Overview 20170806.1
OpsStack Overview 20170806.1
Siglos
 
Is your Automation Infrastructure ‘Well Architected’?
Is your Automation Infrastructure ‘Well Architected’?Is your Automation Infrastructure ‘Well Architected’?
Is your Automation Infrastructure ‘Well Architected’?
Adam Goucher
 
Sigma Open Tech Week: Bitter Truth About Software Security
Sigma Open Tech Week: Bitter Truth About Software SecuritySigma Open Tech Week: Bitter Truth About Software Security
Sigma Open Tech Week: Bitter Truth About Software Security
Vlad Styran
 
Ops Happen: Improve Security Without Getting in the Way
Ops Happen: Improve Security Without Getting in the WayOps Happen: Improve Security Without Getting in the Way
Ops Happen: Improve Security Without Getting in the Way
SeniorStoryteller
 
Introduction to Chaos Engineering
Introduction to Chaos EngineeringIntroduction to Chaos Engineering
Introduction to Chaos Engineering
Raymond Adrian (Rad) Butalid
 
Tales from a radically polyglot team
Tales from a radically polyglot teamTales from a radically polyglot team
Tales from a radically polyglot team
Thoughtworks
 
CSA Raleigh application security and deception in the cloud
CSA Raleigh   application security and deception in the cloudCSA Raleigh   application security and deception in the cloud
CSA Raleigh application security and deception in the cloud
Phillip Maddux
 
What We Learned from Three Years of Sciencing the Crap Out of DevOps
What We Learned from Three Years of Sciencing the Crap Out of DevOpsWhat We Learned from Three Years of Sciencing the Crap Out of DevOps
What We Learned from Three Years of Sciencing the Crap Out of DevOps
SeniorStoryteller
 
The Most Important Thing: How Mozilla Does Security and What You Can Steal
The Most Important Thing: How Mozilla Does Security and What You Can StealThe Most Important Thing: How Mozilla Does Security and What You Can Steal
The Most Important Thing: How Mozilla Does Security and What You Can Steal
mozilla.presentations
 
Your Data Scientist Hates You
Your Data Scientist Hates YouYour Data Scientist Hates You
Your Data Scientist Hates You
Bradford Stephens
 
Chaos engineering intro
Chaos engineering introChaos engineering intro
Chaos engineering intro
Shantanu Deshpande
 
An Introduction to Chaos Engineering
An Introduction to Chaos EngineeringAn Introduction to Chaos Engineering
An Introduction to Chaos Engineering
Gremlin
 
Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)
Bradford Stephens
 
Building a culture where software projects get done
Building a culture where software projects get doneBuilding a culture where software projects get done
Building a culture where software projects get done
thegdb
 
Blend it up - leancamp london presentation
Blend it up - leancamp london presentationBlend it up - leancamp london presentation
Blend it up - leancamp london presentationAntonio Terreno
 

What's hot (20)

Chaos Engineering 101: A Field Guide
Chaos Engineering 101: A Field GuideChaos Engineering 101: A Field Guide
Chaos Engineering 101: A Field Guide
 
Shift Left. Wait, what? No, Shift Right!!!
Shift Left. Wait, what? No, Shift Right!!!Shift Left. Wait, what? No, Shift Right!!!
Shift Left. Wait, what? No, Shift Right!!!
 
The left is not wrong, just not right; It's time to shift right!
The left is not wrong, just not right; It's time to shift right!The left is not wrong, just not right; It's time to shift right!
The left is not wrong, just not right; It's time to shift right!
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
 
How to SRE when you have no SRE
How to SRE when you have no SREHow to SRE when you have no SRE
How to SRE when you have no SRE
 
OpsStack Overview 20170806.1
OpsStack Overview 20170806.1OpsStack Overview 20170806.1
OpsStack Overview 20170806.1
 
Is your Automation Infrastructure ‘Well Architected’?
Is your Automation Infrastructure ‘Well Architected’?Is your Automation Infrastructure ‘Well Architected’?
Is your Automation Infrastructure ‘Well Architected’?
 
Sigma Open Tech Week: Bitter Truth About Software Security
Sigma Open Tech Week: Bitter Truth About Software SecuritySigma Open Tech Week: Bitter Truth About Software Security
Sigma Open Tech Week: Bitter Truth About Software Security
 
Ops Happen: Improve Security Without Getting in the Way
Ops Happen: Improve Security Without Getting in the WayOps Happen: Improve Security Without Getting in the Way
Ops Happen: Improve Security Without Getting in the Way
 
Introduction to Chaos Engineering
Introduction to Chaos EngineeringIntroduction to Chaos Engineering
Introduction to Chaos Engineering
 
Tales from a radically polyglot team
Tales from a radically polyglot teamTales from a radically polyglot team
Tales from a radically polyglot team
 
CSA Raleigh application security and deception in the cloud
CSA Raleigh   application security and deception in the cloudCSA Raleigh   application security and deception in the cloud
CSA Raleigh application security and deception in the cloud
 
What We Learned from Three Years of Sciencing the Crap Out of DevOps
What We Learned from Three Years of Sciencing the Crap Out of DevOpsWhat We Learned from Three Years of Sciencing the Crap Out of DevOps
What We Learned from Three Years of Sciencing the Crap Out of DevOps
 
The Most Important Thing: How Mozilla Does Security and What You Can Steal
The Most Important Thing: How Mozilla Does Security and What You Can StealThe Most Important Thing: How Mozilla Does Security and What You Can Steal
The Most Important Thing: How Mozilla Does Security and What You Can Steal
 
Your Data Scientist Hates You
Your Data Scientist Hates YouYour Data Scientist Hates You
Your Data Scientist Hates You
 
Chaos engineering intro
Chaos engineering introChaos engineering intro
Chaos engineering intro
 
An Introduction to Chaos Engineering
An Introduction to Chaos EngineeringAn Introduction to Chaos Engineering
An Introduction to Chaos Engineering
 
Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)
 
Building a culture where software projects get done
Building a culture where software projects get doneBuilding a culture where software projects get done
Building a culture where software projects get done
 
Blend it up - leancamp london presentation
Blend it up - leancamp london presentationBlend it up - leancamp london presentation
Blend it up - leancamp london presentation
 

Similar to Applying SRE techniques to micro service design

Chaos engineering
Chaos engineering Chaos engineering
Chaos engineering
Alberto Acerbis
 
Making disaster routine
Making disaster routineMaking disaster routine
Making disaster routine
Peter Varhol
 
Working Effectively with PeopleSoft Support
Working Effectively with PeopleSoft SupportWorking Effectively with PeopleSoft Support
Working Effectively with PeopleSoft Support
Smart ERP Solutions, Inc.
 
Orchestration, the conductor's score
Orchestration, the conductor's scoreOrchestration, the conductor's score
Orchestration, the conductor's score
Salesforce Engineering
 
Normal accidents and outpatient surgeries
Normal accidents and outpatient surgeriesNormal accidents and outpatient surgeries
Normal accidents and outpatient surgeries
Jonathan Creasy
 
Reliability Engineering Q&A - LCE
Reliability Engineering Q&A - LCEReliability Engineering Q&A - LCE
Reliability Engineering Q&A - LCE
Jan van Rooyen
 
Evil Tester's Guide to Agile Testing
Evil Tester's Guide to Agile TestingEvil Tester's Guide to Agile Testing
Evil Tester's Guide to Agile Testing
Alan Richardson
 
Reanimating DevOps to Build Things that Work
Reanimating DevOps to Build Things that WorkReanimating DevOps to Build Things that Work
Reanimating DevOps to Build Things that Work
DevOpsDays Baltimore
 
High Reliabilty Systems
High Reliabilty SystemsHigh Reliabilty Systems
High Reliabilty Systems
LloydMoore
 
01. foundamentals of testing
01. foundamentals of testing01. foundamentals of testing
01. foundamentals of testing
Tricia Karina
 
Startup Operating Systems
Startup Operating SystemsStartup Operating Systems
Startup Operating Systems
Dean Haritos
 
Mucon microservices and innovation
Mucon microservices and innovationMucon microservices and innovation
Mucon microservices and innovation
Gawain Hammond
 
CS5032 Lecture 5: Human Error 1
CS5032 Lecture 5: Human Error 1CS5032 Lecture 5: Human Error 1
CS5032 Lecture 5: Human Error 1John Rooksby
 
No more excuses QASymphony
No more excuses QASymphonyNo more excuses QASymphony
No more excuses QASymphony
QASymphony
 
Advanced Maintenance And Reliability (Maintenance and Reliability Best Pract...
Advanced Maintenance And Reliability  (Maintenance and Reliability Best Pract...Advanced Maintenance And Reliability  (Maintenance and Reliability Best Pract...
Advanced Maintenance And Reliability (Maintenance and Reliability Best Pract...
Ricky Smith CMRP, CMRT
 
Design testabilty
Design testabiltyDesign testabilty
Design testabilty
Richard Neeve
 
Introduction to Software Engineering and Software Process Models
Introduction to Software Engineering and Software Process ModelsIntroduction to Software Engineering and Software Process Models
Introduction to Software Engineering and Software Process Models
santoshkawade5
 
Adapting Scrum in an Organization with Tailored Processes
Adapting Scrum in an Organization with Tailored ProcessesAdapting Scrum in an Organization with Tailored Processes
Adapting Scrum in an Organization with Tailored Processes
Prabhat Sinha
 
A real-life overview of Agile and Scrum
A real-life overview of Agile and ScrumA real-life overview of Agile and Scrum
A real-life overview of Agile and Scrum
mtoppa
 
SOFTWARE TESTING TRAFUNDAMENTALS OF SOFTWARE TESTING.pptx
SOFTWARE TESTING TRAFUNDAMENTALS OF SOFTWARE TESTING.pptxSOFTWARE TESTING TRAFUNDAMENTALS OF SOFTWARE TESTING.pptx
SOFTWARE TESTING TRAFUNDAMENTALS OF SOFTWARE TESTING.pptx
Financial Services Innovators
 

Similar to Applying SRE techniques to micro service design (20)

Chaos engineering
Chaos engineering Chaos engineering
Chaos engineering
 
Making disaster routine
Making disaster routineMaking disaster routine
Making disaster routine
 
Working Effectively with PeopleSoft Support
Working Effectively with PeopleSoft SupportWorking Effectively with PeopleSoft Support
Working Effectively with PeopleSoft Support
 
Orchestration, the conductor's score
Orchestration, the conductor's scoreOrchestration, the conductor's score
Orchestration, the conductor's score
 
Normal accidents and outpatient surgeries
Normal accidents and outpatient surgeriesNormal accidents and outpatient surgeries
Normal accidents and outpatient surgeries
 
Reliability Engineering Q&A - LCE
Reliability Engineering Q&A - LCEReliability Engineering Q&A - LCE
Reliability Engineering Q&A - LCE
 
Evil Tester's Guide to Agile Testing
Evil Tester's Guide to Agile TestingEvil Tester's Guide to Agile Testing
Evil Tester's Guide to Agile Testing
 
Reanimating DevOps to Build Things that Work
Reanimating DevOps to Build Things that WorkReanimating DevOps to Build Things that Work
Reanimating DevOps to Build Things that Work
 
High Reliabilty Systems
High Reliabilty SystemsHigh Reliabilty Systems
High Reliabilty Systems
 
01. foundamentals of testing
01. foundamentals of testing01. foundamentals of testing
01. foundamentals of testing
 
Startup Operating Systems
Startup Operating SystemsStartup Operating Systems
Startup Operating Systems
 
Mucon microservices and innovation
Mucon microservices and innovationMucon microservices and innovation
Mucon microservices and innovation
 
CS5032 Lecture 5: Human Error 1
CS5032 Lecture 5: Human Error 1CS5032 Lecture 5: Human Error 1
CS5032 Lecture 5: Human Error 1
 
No more excuses QASymphony
No more excuses QASymphonyNo more excuses QASymphony
No more excuses QASymphony
 
Advanced Maintenance And Reliability (Maintenance and Reliability Best Pract...
Advanced Maintenance And Reliability  (Maintenance and Reliability Best Pract...Advanced Maintenance And Reliability  (Maintenance and Reliability Best Pract...
Advanced Maintenance And Reliability (Maintenance and Reliability Best Pract...
 
Design testabilty
Design testabiltyDesign testabilty
Design testabilty
 
Introduction to Software Engineering and Software Process Models
Introduction to Software Engineering and Software Process ModelsIntroduction to Software Engineering and Software Process Models
Introduction to Software Engineering and Software Process Models
 
Adapting Scrum in an Organization with Tailored Processes
Adapting Scrum in an Organization with Tailored ProcessesAdapting Scrum in an Organization with Tailored Processes
Adapting Scrum in an Organization with Tailored Processes
 
A real-life overview of Agile and Scrum
A real-life overview of Agile and ScrumA real-life overview of Agile and Scrum
A real-life overview of Agile and Scrum
 
SOFTWARE TESTING TRAFUNDAMENTALS OF SOFTWARE TESTING.pptx
SOFTWARE TESTING TRAFUNDAMENTALS OF SOFTWARE TESTING.pptxSOFTWARE TESTING TRAFUNDAMENTALS OF SOFTWARE TESTING.pptx
SOFTWARE TESTING TRAFUNDAMENTALS OF SOFTWARE TESTING.pptx
 

More from Theo Schlossnagle

Adding Simplicity to Complexity
Adding Simplicity to ComplexityAdding Simplicity to Complexity
Adding Simplicity to Complexity
Theo Schlossnagle
 
Put Some SRE in Your Shipped Software
Put Some SRE in Your Shipped SoftwarePut Some SRE in Your Shipped Software
Put Some SRE in Your Shipped Software
Theo Schlossnagle
 
Distributed Systems - Like It Or Not
Distributed Systems - Like It Or NotDistributed Systems - Like It Or Not
Distributed Systems - Like It Or Not
Theo Schlossnagle
 
SRECon Coherent Performance
SRECon Coherent PerformanceSRECon Coherent Performance
SRECon Coherent Performance
Theo Schlossnagle
 
Commandments of scale
Commandments of scaleCommandments of scale
Commandments of scale
Theo Schlossnagle
 
Adaptive availability
Adaptive availabilityAdaptive availability
Adaptive availability
Theo Schlossnagle
 
Project reality
Project realityProject reality
Project reality
Theo Schlossnagle
 
Monitoring the #DevOps way
Monitoring the #DevOps wayMonitoring the #DevOps way
Monitoring the #DevOps way
Theo Schlossnagle
 
The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
Theo Schlossnagle
 
Understanding Slowness
Understanding SlownessUnderstanding Slowness
Understanding Slowness
Theo Schlossnagle
 
OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012Theo Schlossnagle
 
Monitoring and observability
Monitoring and observabilityMonitoring and observability
Monitoring and observabilityTheo Schlossnagle
 
Monitoring and observability
Monitoring and observabilityMonitoring and observability
Monitoring and observability
Theo Schlossnagle
 
Xtreme Deployment
Xtreme DeploymentXtreme Deployment
Xtreme Deployment
Theo Schlossnagle
 
Atldevops
AtldevopsAtldevops
It's all about telemetry
It's all about telemetryIt's all about telemetry
It's all about telemetry
Theo Schlossnagle
 
Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentationTheo Schlossnagle
 
Social improvements in monitoring
Social improvements in monitoringSocial improvements in monitoring
Social improvements in monitoringTheo Schlossnagle
 
What's in a number?
What's in a number?What's in a number?
What's in a number?
Theo Schlossnagle
 

More from Theo Schlossnagle (20)

Adding Simplicity to Complexity
Adding Simplicity to ComplexityAdding Simplicity to Complexity
Adding Simplicity to Complexity
 
Put Some SRE in Your Shipped Software
Put Some SRE in Your Shipped SoftwarePut Some SRE in Your Shipped Software
Put Some SRE in Your Shipped Software
 
Distributed Systems - Like It Or Not
Distributed Systems - Like It Or NotDistributed Systems - Like It Or Not
Distributed Systems - Like It Or Not
 
SRECon Coherent Performance
SRECon Coherent PerformanceSRECon Coherent Performance
SRECon Coherent Performance
 
Commandments of scale
Commandments of scaleCommandments of scale
Commandments of scale
 
Adaptive availability
Adaptive availabilityAdaptive availability
Adaptive availability
 
Project reality
Project realityProject reality
Project reality
 
Monitoring the #DevOps way
Monitoring the #DevOps wayMonitoring the #DevOps way
Monitoring the #DevOps way
 
The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
 
Understanding Slowness
Understanding SlownessUnderstanding Slowness
Understanding Slowness
 
OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012
 
Monitoring and observability
Monitoring and observabilityMonitoring and observability
Monitoring and observability
 
Omnios and unix
Omnios and unixOmnios and unix
Omnios and unix
 
Monitoring and observability
Monitoring and observabilityMonitoring and observability
Monitoring and observability
 
Xtreme Deployment
Xtreme DeploymentXtreme Deployment
Xtreme Deployment
 
Atldevops
AtldevopsAtldevops
Atldevops
 
It's all about telemetry
It's all about telemetryIt's all about telemetry
It's all about telemetry
 
Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentation
 
Social improvements in monitoring
Social improvements in monitoringSocial improvements in monitoring
Social improvements in monitoring
 
What's in a number?
What's in a number?What's in a number?
What's in a number?
 

Recently uploaded

Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
Srikant77
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 

Recently uploaded (20)

Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 

Applying SRE techniques to micro service design

  • 1. can be applied to the nascent world of microservices. Put some SRE in your microservices Hard-won lessons from the world of SRE…
  • 2. The many faces of Theo Schlossnagle @postwait CEO Circonus
  • 3. The nature of the problem Software Sucks Once you’ve run software at scale, you have a deep understanding of how it is all tied together with loose string and hope.
  • 4. All software will fail, but good software fails well • Consider the phrase: “have you used X in anger.”
  • 5. Never undervalue grace in failure. Rule . 𝛌1 Crash landings should be both fast and controlled.
  • 6. What it means to fail quickly & safely • The scope of failure should collapse completely. • The time to failure should be measured in small multiples of normal service time • Nothing outside the scope of failure should be impacted. https://www.youtube.com/watch?v=5SL1A2d2e7M
  • 7. Autopsies: not just for medicine. Rule . 𝛌2 Post-mortems are fundamental.
  • 8. Pragmatic analysis is required to understand failure’s true nature • Post-mortem analysis is critical • Stack traces • Forensic logs • Images (cores, dumps, etc.)
  • 9. The difference between a shock and electrocution is real. Rule . 𝛌3 Use circuit breakers.
  • 10. Circuit breakers are designed to avoid cascading failure • it’s not all about, especially with microservices • protect yourselves and others • circuit breakers of many type • timing • queue depth • concurrency http://melissaomarkham.com
  • 11. You cannot understand what you cannot measure. Rule . 𝛌4 Behavior is complex. Understand it.
  • 12. Don’t measure to assess availability measure to understand Build robust models of behavior Understand performance changes Don’t use averages Don’t use percentiles alone
  • 13. Don’t measure to assess availability measure to understand Build robust models of behavior Understand performance changes Don’t use averages Don’t use percentiles alone
  • 14. It’s easy to demand perfection; it’s also stupid. Rule . 𝛌5 Have an failure budget.
  • 15. Avoid failure is simply impossible, expect and manage failure • use failure budgets • set expectations reasonably • define and reward successes on improvement and competency, not just uptime.
  • 16. Justice should be blind; operations should not. Rule . 𝛌6 Instrumentation & Observability have no equals.
  • 17. For every “I wonder what X is right now?” in production, you must have answers DTrace eBPF Instrument code for observability https://www.pinterest.com/pin/441775044670412234/