Reliable Observability at Scale
Error Budgets for 1,000+
Observability Practitioner’s Summit
11/18/2019
#observabilitysummit @phredmoyer
ZSRE
Error Budgets
Hi, I’m Fred
SLOgician
Thinks about SLOs, SLIs, Error Budgets
Observability Hacker
TSDBs, StatsD, Prometheus, Histograms
Software Engineer (SRE)
15+ yrs C, Perl, Ruby, Go, Python, blabla
Dad
Two kids, needs more sleep/coffee
#observabilitysummit @phredmoyer
Agenda
THE HARD THING ABOUT EBS & SLOS
TOOLING, APPROACH, IMPLEMENTATION
ERROR BUDGET DEMOCRATIZATION
QUESTIONS
ZENDESK ARCHITECTURE FLYOVER
#observabilitysummit @phredmoyer
ZENDESK
ARCHITECTURE
FLYOVER
#observabilitysummit @phredmoyer
ProxyCDN
RoR
#observabilitysummit @phredmoyer
ProxyCDN RoR μSVC
#observabilitysummit @phredmoyer
Proxy
CDN
μSVC
GuideSupportRoR
#observabilitysummit @phredmoyer
Proxy
CDN
μSVC
Chat
GuideSupportRoR
#observabilitysummit @phredmoyer
Proxy
CDN
μSVC
Chat
GuideSupport ExploreRoR
#observabilitysummit @phredmoyer
Proxy
CDN
μSVC
Chat
GuideSupport ExploreRoR
#observabilitysummit @phredmoyer
Proxy
CDN
μSVC
Chat
Guide TalkSupport ExploreRoR
#observabilitysummit @phredmoyer
Proxy
CDN
μSVC
Chat
Guide TalkSupport Explore
Sell
RoR
#observabilitysummit @phredmoyer
ERROR BUDGET
DEMOCRATIZATION
2016 2018 2019
A BRIEF HISTORY OF SLOs/SLIs/EBs
SRECON EUR:
Developing
Effective SLIs
and SLOs
BLOG:
Latency SLOs
Done RIght
OPS K8SCON:
Latency SLOs
Done RIght
@LIZTHE
GREY TALK:
Effective SLOs
SRECON US:
Latency SLOs
Done Right
BAYLISA:
Practical SLOs
with EBs
SRECON APAC:
Latency SLOs
Done Right
SRECON DUB:
Several
SLO/EB Talks
SCALE17X:
Latency SLOs
Done Right
#observabilitysummit @phredmoyer
SLIs
Delineates ‘Good’ vs
‘Bad’ Requests
#observabilitysummit @phredmoyer
95th percentile home page latency over 5
minutes < 500ms
Home page request response code != 5xx
Home page request served in < 100ms
EXAMPLE SLIS
#observabilitysummit @phredmoyer
95th percentile home page latency over 5
minutes < 500ms
Home page request response code != 5xx
Home page request served in < 100ms
Metric Identifier
[Metric Identifier] [Operator] [Metric Value]
EXAMPLE SLIS
#observabilitysummit @phredmoyer
95th percentile home page latency over 5
minutes < 500ms
Home page request response code != 5xx
Homepage request served in < 100ms
Operator
[Metric Identifier] [Operator] [Metric Value]
EXAMPLE SLIS
#observabilitysummit @phredmoyer
95th percentile home page latency over 5
minutes < 500ms
Home page request response code != 5xx
Home page request served in < 100ms
Metric Value
[Metric Identifier] [Operator] [Metric Value]
EXAMPLE SLIS
#observabilitysummit @phredmoyer
95th percentile home page latency over 5
minutes < 500ms
Home page request response code != 5xx
Home page request served in < 100ms
[Metric Identifier] [Operator] [Metric Value]
EXAMPLE SLIS
#observabilitysummit @phredmoyer
SLOs
Binding target for SLIs
#observabilitysummit @phredmoyer
SLO =
#goodreqs /
#totalreqs
+ Time range
#observabilitysummit @phredmoyer
99% of 95th percentile home page latency
over 5 minutes < 500ms over the trailing
month
99% of home page request response code
!= 5xx over last 7 days
95% of home page requests served in <
100ms over last 24 hours
EXAMPLE SLOS
#observabilitysummit @phredmoyer
[Success Objective] [SLI] [Period]
Success Objective
99% of 95th percentile home page latency
over 5 minutes < 500ms over the trailing
month
99% of home page request response code
!= 5xx over last 7 days
95% of home page requests served in <
100ms over last 24 hours
EXAMPLE SLOS
#observabilitysummit @phredmoyer
EXAMPLE SLOS
[Success Objective] [SLI] [Period]
SLI
99% of 95th percentile home page latency
over 5 minutes < 500ms over the trailing
month
99% of home page request response code
!= 5xx over last 7 days
95% of home page requests served in <
100ms over last 24 hours
#observabilitysummit @phredmoyer
EXAMPLE SLOS
99% of 95th percentile home page latency
over 5 minutes < 500ms over the trailing
month
99% of home page request response code
!= 5xx over last 7 days
95% of home page requests served in <
100ms over last 24 hours
[Success Objective] [SLI] [Period]
Period
#observabilitysummit @phredmoyer
EXAMPLE SLOS
99% of 95th percentile home page latency
over 5 minutes < 500ms over the trailing
month
99% of home page request response code
!= 5xx over last 7 days
95% of home page requests served in <
100ms over last 24 hours
[Success Objective] [SLI] [Period]
#observabilitysummit @phredmoyer
Nobody’s Perfect
Error Budget = 1-SLO
#observabilitysummit @phredmoyer
Success Objective == 99%
Error Budget = 1-0.99 == 1%
#observabilitysummit @phredmoyer
EXAMPLE EBS
Allow 1% failure of 95th percentile home
page latency over 5 minutes < 500ms over
the trailing month
Allow 1% failure of home page request
response code != 5xx over last 7 days
Allow 5% failure of home page requests
served in < 100ms over last 24 hours
#observabilitysummit @phredmoyer
EXAMPLE EBS
Allow 1% failure of 95th percentile home
page latency over 5 minutes < 500ms over
the trailing month
Allow 1% failure of home page request
response code != 5xx over last 7 days
Allow 5% failure of home page requests
served in < 100ms over last 24 hours
[Error Budget] [SLI] [Period]
Error Budget
#observabilitysummit @phredmoyer
EXAMPLE EBS
Allow 1% failure of 95th percentile home
page latency over 5 minutes < 500ms over
the trailing month
Allow 1% failure of home page request
response code != 5xx over last 7 days
Allow 5% failure of home page requests
served in < 100ms over last 24 hours
[Error Budget] [SLI] [Period]
SLI
#observabilitysummit @phredmoyer
EXAMPLE EBS
Allow 1% failure of 95th percentile home
page latency over 5 minutes < 500ms over
the trailing month
Allow 1% failure of home page request
response code != 5xx over last 7 days
Allow 5% failure of home page requests
served in < 100ms over last 24 hours
[Error Budget] [SLI] [Period]
Period
#observabilitysummit @phredmoyer
EXAMPLE EBS
Allow 1% failure of 95th percentile home
page latency over 5 minutes < 500ms over
the trailing month
Allow 1% failure of home page request
response code != 5xx over last 7 days
Allow 5% failure of home page requests
served in < 100ms over last 24 hours
[Error Budget] [SLI] [Period]
#observabilitysummit @phredmoyer
Keys to Error Budget Democratization
Real world examples that are easy to reference
Formulas that can be parsed by humans and code
Be explicit; small details make big differences
#observabilitysummit @phredmoyer
TOOLING, APPROACH,
IMPLEMENTATION
TOOLING
#observabilitysummit @phredmoyer
Lots of teams; lots of tools
Metrics: Prometheus / StatsD => Datadog
Logs: JSON => [ ELK, Datadog, AWS ]
APM: Datadog
Network: [ Datadog, ThousandEyes ]
Distributed Tracing: WIP
#observabilitysummit @phredmoyer
StatsD - not just for servers
Measuring service performance is (mostly) easy
Client apps are more difficult
Disconnects
Caching (CDN, Proxy)
Large browser & device variance
#observabilitysummit @phredmoyer
Logs, Traces, Metrics
Conway’s Law; experts for each ‘pillar’
Democratize Expertise
#ask-sre
Reliability Champions
`Observability 101`
`Hands On With Datadog`
#observabilitysummit @phredmoyer
APPROACH
#observabilitysummit @phredmoyer
Metrics for SLIs
Lies, Darn Lies, and Percentiles
Easy to get the math wrong
Missing the X Factor - Sample Volume
Many vendors have bugs in percentile tools
Can’t aggregate them (well, most of them)
#observabilitysummit @phredmoyer
Metrics for SLIs
Counters
Easy to understand
Easy to implement
Easy to aggregate
Easy to get the math right
#observabilitysummit @phredmoyer
Metrics for SLIs
Latency SLIs via counters
Request time < 500ms
Count em’ up, divide by total reqs
Add success objective and time range for SLO
99% of request times < 500ms over trailing week
#observabilitysummit @phredmoyer
IMPLEMENTATION
#observabilitysummit @phredmoyer
Metrics for SLIs
Flexible Latency SLIs
Histogram based
# reqs 100-200ms, 200-300ms, etc
One time series for each latency band
zen.app.request.sli{path:/foo;bin:gt_500_le_600}
#observabilitysummit @phredmoyer
Metrics for SLIs
Flexible Latency SLIs
10..20...100ms
100..200...1,000ms
1,000..1,500...10,000ms
10,000..15,000...60,000ms
Latency == 547ms, metric tag `le_600`, `gt_500_le_600`
#observabilitysummit @phredmoyer
Metrics for SLIs
Flexible Latency SLIs
Low errors per latency band
Not as precise as HDR Histograms
Possible cardinality expansion issues
Can implement on any monitoring vendor or TSDB
#observabilitysummit @phredmoyer
#observabilitysummit @phredmoyer
#observabilitysummit @phredmoyer
The hard thing about
Error Budgets and SLOs
ProxyCDN ROR μSVC
Search
Chat
Guide
Talk
#observabilitysummit @phredmoyer
Proxy
CDN
ROR μSVC
#observabilitysummit @phredmoyer
Proxy
CDN
ROR μSVC
SLI_1
SLI_2
SLI_3 SLI_4
#observabilitysummit @phredmoyer
Proxy
CDN
ROR μSVC
SLI_1
SLI_2
SLI_3 SLI_4
Need for low variance increases
#observabilitysummit @phredmoyer
Real Users
Varied usage patterns
iPads, phones, laptops
LTE, Fiber, DSL, 3G
#observabilitysummit @phredmoyer
Proxy
CDN
ROR μSVC
SLI_1
SLI_2
SLI_3 SLI_4
Focus on the users!
#observabilitysummit @phredmoyer
Different SLOs/EBs for Different Folks
99% of home page requests < 500ms over...
5 minutes - NOC / SRE
1 hour - Product Engineers
1 week - Product Managers
1 month - VPs
1 quarter - CXOs
#observabilitysummit @phredmoyer
Keys to Error Budgets at Scale
Give everyone a formula to follow for SLIs/SLOs/EBs
Use simple tools that can deliver rich results
Use latency bands (histograms) for duration data
Measure SLIs as close to the client as possible
Use EBs with appropriate time ranges for audiences
#observabilitysummit @phredmoyer
Thank you
#observabilitysummit @phredmoyer

Reliable observability at scale: Error Budgets for 1,000+