Addo reducing trauma in organizations with SLOs and chaos engineering

Proprietary & Conﬁdential
Reducing Trauma in
Organizations with SLOs and
Chaos Engineering
Mandi Walls
DevOps Advocate
PagerDuty
@Lnxchk
Julie Gunderson
Sr. Reliability Advocate
Gremlin
@julie_gund

Introduction
Centering user experience is key
to prioritizing what will best serve
the users and increase
engagement.
We do this with qualitative practices
like Full Service Ownership and
quantitative practices like SLIs and
SLOs.

Measuring the Cost of Downtime
Cost = R + E + C + ( B + A )
During the Outage
R = Revenue Lost
E = Employee Productivity
After the Outage
C = Customer Chargebacks
(SLA Breaches)
Unquantiﬁable
B = Brand Defamation
A = Employee Attrition
Amazon is estimated to lose $13.22MM/hour or $220,000/min
Your company? Average is $300,000/hour

What are SLAs, SLIs,
and SLOs?

SLIs and SLOs
Indicators - our metrics
Objectives - our goals for those metrics

Product
Development
Capacity Planning
Testing & Release Procedures
Post-incident Analysis
Incident Response
Monitoring & Observability E.g. SLOs & SLIs
E.g. Blam
eless Postm
ortem
s
E.g. Canary Deploym
ents
E.g. Error Budgets

The Users are the Point
What is important to the users?
How do you know?
Is it diﬀerent for diﬀerent parts of
the application?

Focus on What Users Care About
News Stream
Loads Fast
on Scroll
What else?
No Missing
Images
Center
Module
Loads First
No Errors on
Main Page
Fast Load
Time

Translate User Experience to Useful Metrics
Photo by Luke Chesser on Unsplash

How do we know if our
SLO/SLIs are working as
expected? ?

We inject failure
proactively to validate
SLOs/SLIs. ✓

Prerequisites for
setting SLIs and SLOs

Telemetry
• Monitoring - keep track of what you know
• Logging - scan for errors after the fact
• Tracing - follow the user through the service ecosystem
• Observability - the baseline characteristic
Photo by Mikail McVerry on Unsplash

Dependency Mapping
Success of your SLOs
depends on the SLOs of your
service upstream
dependencies

Creating SLIs, SLOs,
and Error Budgets

[SLI][SLO][t]
SLI = (good/valid) * 100
eb = 100 - SLI

Example
Valid Events Good
Events
SLO Error
Budget
Allowable
Bad Events
100 99 99% 1% 1
1000 999 99.9% 0.1% 1
1000 990 99% 1% 10
10,000 9999 99.99% 0.01% 1
100,000 99,000 99% 1% 1000
100,000 99,999 99.999% 0.001% 1

Perfect: 100% of web requests have 0ms latency all the time!
30

SLA: 90% of web requests have latency <500ms for the
month… or customer gets money back.
31

SLO: 95% of web requests have latency <500ms over a rolling
month.
32

SLO: 95% of web requests have latency <500ms over a rolling
month.
SLI: web requests latency <500ms
33

Instance
Downtime
Occurs
Datadog picks up that
instance is down for SLI
calculation (metric), auto
tracks SLO is now
impacted (monitor)
PagerDuty fires an alert
that the uptime SLO has
been breached
Gremlin
Downtime
SLO Scenario
is run on
staging

Revising and Revisiting
Photo by Jonathan Kemper on Unsplash
These are internal tools!
You can change them if they
no longer work for you!

Use Chaos
Engineering to test
out new features
and focus on your
SLOs
40

Development Staging Production

Working with Upstream Dependencies
Do your dependencies publish their own SLOs?
Can you defensively code around bad performance?
Do you need to explore alternatives?

Use Chaos
Engineering to
validate your
dependencies
?

Unplanned Work
Incidents can indicate that work needs to be done
Your SLOs and error budgets are part of your postmortem discussion
Revisit and prioritize work based on the outcomes of a major incident

Lifecycle
• Research user behavior
• Measure and monitor for reliability
and performance
• Set goals, write SLIs, establish
SLOs
• Work to keep SLOs in the green
• Verify SLOs and error budgets in
incident post mortems
• Adjust SLOs to new business
requirements

Summary
SLIs prioritize the User Experience
SLOs quantify “good” vs “bad” experience to a quantitative goal
Error Budgets tell your team where you stand
They all feed back into the work prioritization process
You can change them when they no longer work for you

Resources
Talks at SLOConf: https://www.sloconf.com/talks
Google’s SRE Books are available online: https://sre.google/books/
Implementing SLOs:
https://www.oreilly.com/library/view/implementing-service-level/9781492076803/
Gremlin Free: https://www.gremlin.com/buttons/
Sign up for a PagerDuty trial at https://pagerduty.com/sign-up
Gremlin Certified Chaos Engineering Professional certification:
https://www.gremlin.com/certification

Julie’s random
slides

Moving to the cloud
Verify host failure, autoscaling rules, and memory.
Migrating to microservices
Validate that each new service can fail independently.
Protect against cascading failures and knock-on effects.
Adopting Kubernetes
The devil is the in details. Have you configured everything correctly?
Are you running one large cluster?
Find your monitoring gaps, reduce signal to noise
“We’ll get paged if that breaks”, until you don’t.
A false sense of security is worse than nothing.
Train your teams
We run fire drills, train firefights, and first responders.
Are you investing in your operations teams?
55

Addo reducing trauma in organizations with SLOs and chaos engineering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Addo reducing trauma in organizations with SLOs and chaos engineering

Similar to Addo reducing trauma in organizations with SLOs and chaos engineering (20)

More from Mandi Walls

More from Mandi Walls (20)

Recently uploaded

Recently uploaded (20)

Addo reducing trauma in organizations with SLOs and chaos engineering