Reliable observability at scale: Error Budgets for 1,000+

Reliable Observability at Scale
Error Budgets for 1,000+
Observability Practitioner’s Summit
11/18/2019
#observabilitysummit @phredmoyer

Hi, I’m Fred
SLOgician
Thinks about SLOs, SLIs, Error Budgets
Observability Hacker
TSDBs, StatsD, Prometheus, Histograms
Software Engineer (SRE)
15+ yrs C, Perl, Ruby, Go, Python, blabla
Dad
Two kids, needs more sleep/coffee

THE HARD THING ABOUT EBS & SLOS
TOOLING, APPROACH, IMPLEMENTATION
ERROR BUDGET DEMOCRATIZATION
QUESTIONS
ZENDESK ARCHITECTURE FLYOVER

ProxyCDN
RoR

ProxyCDN RoR μSVC

Proxy
CDN
μSVC
GuideSupportRoR

Proxy
CDN
μSVC
Chat
GuideSupportRoR

Proxy
CDN
μSVC
Chat
GuideSupport ExploreRoR

Proxy
CDN
μSVC
Chat
Guide TalkSupport ExploreRoR

Proxy
CDN
μSVC
Chat
Guide TalkSupport Explore
Sell
RoR

2016 2018 2019
A BRIEF HISTORY OF SLOs/SLIs/EBs
SRECON EUR:
Developing
Effective SLIs
and SLOs
BLOG:
Latency SLOs
Done RIght
OPS K8SCON:
Latency SLOs
Done RIght
@LIZTHE
GREY TALK:
Effective SLOs
SRECON US:
Latency SLOs
Done Right
BAYLISA:
Practical SLOs
with EBs
SRECON APAC:
Latency SLOs
Done Right
SRECON DUB:
Several
SLO/EB Talks
SCALE17X:
Latency SLOs
Done Right

SLIs
Delineates ‘Good’ vs
‘Bad’ Requests

95th percentile home page latency over 5
minutes < 500ms
Home page request response code != 5xx
Home page request served in < 100ms
EXAMPLE SLIS

minutes < 500ms
Metric Identifier
[Metric Identifier] [Operator] [Metric Value]
EXAMPLE SLIS

minutes < 500ms
Homepage request served in < 100ms
Operator
EXAMPLE SLIS

minutes < 500ms
Metric Value
EXAMPLE SLIS

minutes < 500ms
EXAMPLE SLIS

SLOs
Binding target for SLIs

SLO =
#goodreqs /
#totalreqs
+ Time range

99% of 95th percentile home page latency
over 5 minutes < 500ms over the trailing
month
99% of home page request response code
!= 5xx over last 7 days
95% of home page requests served in <
100ms over last 24 hours
EXAMPLE SLOS

[Success Objective] [SLI] [Period]
Success Objective
month
EXAMPLE SLOS

EXAMPLE SLOS
SLI
month

EXAMPLE SLOS
month
Period

EXAMPLE SLOS
month

Nobody’s Perfect
Error Budget = 1-SLO

Success Objective == 99%
Error Budget = 1-0.99 == 1%

EXAMPLE EBS
Allow 1% failure of 95th percentile home
page latency over 5 minutes < 500ms over
the trailing month
Allow 1% failure of home page request
response code != 5xx over last 7 days
Allow 5% failure of home page requests
served in < 100ms over last 24 hours

EXAMPLE EBS
the trailing month
[Error Budget] [SLI] [Period]
Error Budget

EXAMPLE EBS
the trailing month
SLI

EXAMPLE EBS
the trailing month
Period

EXAMPLE EBS
the trailing month

Keys to Error Budget Democratization
Real world examples that are easy to reference
Formulas that can be parsed by humans and code
Be explicit; small details make big differences

TOOLING, APPROACH,
IMPLEMENTATION

TOOLING

Lots of teams; lots of tools
Metrics: Prometheus / StatsD => Datadog
Logs: JSON => [ ELK, Datadog, AWS ]
APM: Datadog
Network: [ Datadog, ThousandEyes ]
Distributed Tracing: WIP

StatsD - not just for servers
Measuring service performance is (mostly) easy
Client apps are more difficult
Disconnects
Caching (CDN, Proxy)
Large browser & device variance

Logs, Traces, Metrics
Conway’s Law; experts for each ‘pillar’
Democratize Expertise
#ask-sre
Reliability Champions
`Observability 101`
`Hands On With Datadog`

APPROACH

Metrics for SLIs
Lies, Darn Lies, and Percentiles
Easy to get the math wrong
Missing the X Factor - Sample Volume
Many vendors have bugs in percentile tools
Can’t aggregate them (well, most of them)

Metrics for SLIs
Counters
Easy to understand
Easy to implement
Easy to aggregate
Easy to get the math right

Metrics for SLIs
Latency SLIs via counters
Request time < 500ms
Count em’ up, divide by total reqs
Add success objective and time range for SLO
99% of request times < 500ms over trailing week

IMPLEMENTATION

Metrics for SLIs
Flexible Latency SLIs
Histogram based
# reqs 100-200ms, 200-300ms, etc
One time series for each latency band
zen.app.request.sli{path:/foo;bin:gt_500_le_600}

Metrics for SLIs
10..20...100ms
100..200...1,000ms
1,000..1,500...10,000ms
10,000..15,000...60,000ms
Latency == 547ms, metric tag `le_600`, `gt_500_le_600`

Metrics for SLIs
Low errors per latency band
Not as precise as HDR Histograms
Possible cardinality expansion issues
Can implement on any monitoring vendor or TSDB

The hard thing about
Error Budgets and SLOs

ProxyCDN ROR μSVC
Search
Chat
Guide
Talk

Proxy
CDN
ROR μSVC

Proxy
CDN
ROR μSVC
SLI_1
SLI_2
SLI_3 SLI_4

Proxy
CDN
ROR μSVC
SLI_1
SLI_2
SLI_3 SLI_4
Need for low variance increases

Real Users
Varied usage patterns
iPads, phones, laptops
LTE, Fiber, DSL, 3G

Proxy
CDN
ROR μSVC
SLI_1
SLI_2
SLI_3 SLI_4
Focus on the users!

Different SLOs/EBs for Different Folks
99% of home page requests < 500ms over...
5 minutes - NOC / SRE
1 hour - Product Engineers
1 week - Product Managers
1 month - VPs
1 quarter - CXOs

Keys to Error Budgets at Scale
Give everyone a formula to follow for SLIs/SLOs/EBs
Use simple tools that can deliver rich results
Use latency bands (histograms) for duration data
Measure SLIs as close to the client as possible
Use EBs with appropriate time ranges for audiences

Thank you

Reliable observability at scale: Error Budgets for 1,000+

More Related Content

What's hot

Similar to Reliable observability at scale: Error Budgets for 1,000+

More from Fred Moyer

Recently uploaded

Reliable observability at scale: Error Budgets for 1,000+