Proprietary & Confidential
Reducing Trauma in
Organizations with SLOs and
Chaos Engineering
Mandi Walls
DevOps Advocate
PagerDuty
@Lnxchk
Julie Gunderson
Sr. Reliability Advocate
Gremlin
@julie_gund
Introduction
Centering user experience is key
to prioritizing what will best serve
the users and increase
engagement.
We do this with qualitative practices
like Full Service Ownership and
quantitative practices like SLIs and
SLOs.
Measuring the Cost of Downtime
Cost = R + E + C + ( B + A )
During the Outage
R = Revenue Lost
E = Employee Productivity
After the Outage
C = Customer Chargebacks
(SLA Breaches)
Unquantifiable
B = Brand Defamation
A = Employee Attrition
Amazon is estimated to lose $13.22MM/hour or $220,000/min
Your company? Average is $300,000/hour
What are SLAs, SLIs,
and SLOs?
SLIs and SLOs
Indicators - our metrics
Objectives - our goals for those metrics
Error Budgets
Product
Development
Capacity Planning
Testing & Release Procedures
Post-incident Analysis
Incident Response
Monitoring & Observability E.g. SLOs & SLIs
E.g. Blam
eless Postm
ortem
s
E.g. Canary Deploym
ents
E.g. Error Budgets
Centering the User
Experience
Proprietary & Confidential
The Users are the Point
What is important to the users?
How do you know?
Is it different for different parts of
the application?
Focus on What Users Care About
News Stream
Loads Fast
on Scroll
What else?
No Missing
Images
Center
Module
Loads First
No Errors on
Main Page
Fast Load
Time
Translate User Experience to Useful Metrics
Photo by Luke Chesser on Unsplash
How do we know if our
SLO/SLIs are working as
expected? ?
We inject failure
proactively to validate
SLOs/SLIs. ✓
Prerequisites for
setting SLIs and SLOs
Telemetry
• Monitoring - keep track of what you know
• Logging - scan for errors after the fact
• Tracing - follow the user through the service ecosystem
• Observability - the baseline characteristic
Photo by Mikail McVerry on Unsplash
Dependency Mapping
Success of your SLOs
depends on the SLOs of your
service upstream
dependencies
Creating SLIs, SLOs,
and Error Budgets
[SLI][SLO][t]
SLI = (good/valid) * 100
eb = 100 - SLI
Example
Valid Events Good
Events
SLO Error
Budget
Allowable
Bad Events
100 99 99% 1% 1
1000 999 99.9% 0.1% 1
1000 990 99% 1% 10
10,000 9999 99.99% 0.01% 1
100,000 99,000 99% 1% 1000
100,000 99,999 99.999% 0.001% 1
Perfect: 100% of web requests have 0ms latency all the time!
30
Perfect: 100% of web requests have 0ms latency all the time!
SLA: 90% of web requests have latency <500ms for the
month… or customer gets money back.
31
Perfect: 100% of web requests have 0ms latency all the time!
SLA: 90% of web requests have latency <500ms for the
month… or customer gets money back.
SLO: 95% of web requests have latency <500ms over a rolling
month.
32
Perfect: 100% of web requests have 0ms latency all the time!
SLA: 90% of web requests have latency <500ms for the
month… or customer gets money back.
SLO: 95% of web requests have latency <500ms over a rolling
month.
SLI: web requests latency <500ms
33
Instance
Downtime
Occurs
Datadog picks up that
instance is down for SLI
calculation (metric), auto
tracks SLO is now
impacted (monitor)
PagerDuty fires an alert
that the uptime SLO has
been breached
Gremlin
Downtime
SLO Scenario
is run on
staging
Revising and Revisiting
Photo by Jonathan Kemper on Unsplash
These are internal tools!
You can change them if they
no longer work for you!
Use Chaos
Engineering to test
out new features
and focus on your
SLOs
40
Development Staging Production
Working with Upstream Dependencies
Do your dependencies publish their own SLOs?
Can you defensively code around bad performance?
Do you need to explore alternatives?
Use Chaos
Engineering to
validate your
dependencies
?
Unplanned Work
Incidents can indicate that work needs to be done
Your SLOs and error budgets are part of your postmortem discussion
Revisit and prioritize work based on the outcomes of a major incident
Lifecycle
• Research user behavior
• Measure and monitor for reliability
and performance
• Set goals, write SLIs, establish
SLOs
• Work to keep SLOs in the green
• Verify SLOs and error budgets in
incident post mortems
• Adjust SLOs to new business
requirements
Summary
SLIs prioritize the User Experience
SLOs quantify “good” vs “bad” experience to a quantitative goal
Error Budgets tell your team where you stand
They all feed back into the work prioritization process
You can change them when they no longer work for you
Resources
Talks at SLOConf: https://www.sloconf.com/talks
Google’s SRE Books are available online: https://sre.google/books/
Implementing SLOs:
https://www.oreilly.com/library/view/implementing-service-level/9781492076803/
Gremlin Free: https://www.gremlin.com/buttons/
Sign up for a PagerDuty trial at https://pagerduty.com/sign-up
Gremlin Certified Chaos Engineering Professional certification:
https://www.gremlin.com/certification
Proprietary & Confidential
Julie’s random
slides
Moving to the cloud
Verify host failure, autoscaling rules, and memory.
Migrating to microservices
Validate that each new service can fail independently.
Protect against cascading failures and knock-on effects.
Adopting Kubernetes
The devil is the in details. Have you configured everything correctly?
Are you running one large cluster?
Find your monitoring gaps, reduce signal to noise
“We’ll get paged if that breaks”, until you don’t.
A false sense of security is worse than nothing.
Train your teams
We run fire drills, train firefights, and first responders.
Are you investing in your operations teams?
55

Addo reducing trauma in organizations with SLOs and chaos engineering

  • 1.
    Proprietary & Confidential ReducingTrauma in Organizations with SLOs and Chaos Engineering Mandi Walls DevOps Advocate PagerDuty @Lnxchk Julie Gunderson Sr. Reliability Advocate Gremlin @julie_gund
  • 2.
    Introduction Centering user experienceis key to prioritizing what will best serve the users and increase engagement. We do this with qualitative practices like Full Service Ownership and quantitative practices like SLIs and SLOs.
  • 3.
    Measuring the Costof Downtime Cost = R + E + C + ( B + A ) During the Outage R = Revenue Lost E = Employee Productivity After the Outage C = Customer Chargebacks (SLA Breaches) Unquantifiable B = Brand Defamation A = Employee Attrition Amazon is estimated to lose $13.22MM/hour or $220,000/min Your company? Average is $300,000/hour
  • 4.
    What are SLAs,SLIs, and SLOs?
  • 5.
    SLIs and SLOs Indicators- our metrics Objectives - our goals for those metrics
  • 6.
  • 7.
    Product Development Capacity Planning Testing &Release Procedures Post-incident Analysis Incident Response Monitoring & Observability E.g. SLOs & SLIs E.g. Blam eless Postm ortem s E.g. Canary Deploym ents E.g. Error Budgets
  • 8.
  • 9.
    Proprietary & Confidential TheUsers are the Point What is important to the users? How do you know? Is it different for different parts of the application?
  • 10.
    Focus on WhatUsers Care About News Stream Loads Fast on Scroll What else? No Missing Images Center Module Loads First No Errors on Main Page Fast Load Time
  • 11.
    Translate User Experienceto Useful Metrics Photo by Luke Chesser on Unsplash
  • 12.
    How do weknow if our SLO/SLIs are working as expected? ?
  • 13.
    We inject failure proactivelyto validate SLOs/SLIs. ✓
  • 14.
  • 15.
    Telemetry • Monitoring -keep track of what you know • Logging - scan for errors after the fact • Tracing - follow the user through the service ecosystem • Observability - the baseline characteristic Photo by Mikail McVerry on Unsplash
  • 16.
    Dependency Mapping Success ofyour SLOs depends on the SLOs of your service upstream dependencies
  • 17.
  • 18.
  • 19.
    Example Valid Events Good Events SLOError Budget Allowable Bad Events 100 99 99% 1% 1 1000 999 99.9% 0.1% 1 1000 990 99% 1% 10 10,000 9999 99.99% 0.01% 1 100,000 99,000 99% 1% 1000 100,000 99,999 99.999% 0.001% 1
  • 20.
    Perfect: 100% ofweb requests have 0ms latency all the time! 30
  • 21.
    Perfect: 100% ofweb requests have 0ms latency all the time! SLA: 90% of web requests have latency <500ms for the month… or customer gets money back. 31
  • 22.
    Perfect: 100% ofweb requests have 0ms latency all the time! SLA: 90% of web requests have latency <500ms for the month… or customer gets money back. SLO: 95% of web requests have latency <500ms over a rolling month. 32
  • 23.
    Perfect: 100% ofweb requests have 0ms latency all the time! SLA: 90% of web requests have latency <500ms for the month… or customer gets money back. SLO: 95% of web requests have latency <500ms over a rolling month. SLI: web requests latency <500ms 33
  • 24.
    Instance Downtime Occurs Datadog picks upthat instance is down for SLI calculation (metric), auto tracks SLO is now impacted (monitor) PagerDuty fires an alert that the uptime SLO has been breached Gremlin Downtime SLO Scenario is run on staging
  • 25.
    Revising and Revisiting Photoby Jonathan Kemper on Unsplash These are internal tools! You can change them if they no longer work for you!
  • 26.
    Use Chaos Engineering totest out new features and focus on your SLOs 40
  • 27.
  • 28.
    Working with UpstreamDependencies Do your dependencies publish their own SLOs? Can you defensively code around bad performance? Do you need to explore alternatives?
  • 29.
  • 30.
    Unplanned Work Incidents canindicate that work needs to be done Your SLOs and error budgets are part of your postmortem discussion Revisit and prioritize work based on the outcomes of a major incident
  • 31.
    Lifecycle • Research userbehavior • Measure and monitor for reliability and performance • Set goals, write SLIs, establish SLOs • Work to keep SLOs in the green • Verify SLOs and error budgets in incident post mortems • Adjust SLOs to new business requirements
  • 32.
    Summary SLIs prioritize theUser Experience SLOs quantify “good” vs “bad” experience to a quantitative goal Error Budgets tell your team where you stand They all feed back into the work prioritization process You can change them when they no longer work for you
  • 33.
    Resources Talks at SLOConf:https://www.sloconf.com/talks Google’s SRE Books are available online: https://sre.google/books/ Implementing SLOs: https://www.oreilly.com/library/view/implementing-service-level/9781492076803/ Gremlin Free: https://www.gremlin.com/buttons/ Sign up for a PagerDuty trial at https://pagerduty.com/sign-up Gremlin Certified Chaos Engineering Professional certification: https://www.gremlin.com/certification
  • 34.
  • 35.
    Moving to thecloud Verify host failure, autoscaling rules, and memory. Migrating to microservices Validate that each new service can fail independently. Protect against cascading failures and knock-on effects. Adopting Kubernetes The devil is the in details. Have you configured everything correctly? Are you running one large cluster? Find your monitoring gaps, reduce signal to noise “We’ll get paged if that breaks”, until you don’t. A false sense of security is worse than nothing. Train your teams We run fire drills, train firefights, and first responders. Are you investing in your operations teams? 55