From SLO to GOTY

Brought to you by
From SLOs to GOTY
How observability and service-level objectives
can help get you to “Game of the Year”
Charity Majors
CTO of Honeycomb

@mipsytipsy
How observability and service-level objectives can help get you to
“Game Of The Year”
From SLOs to GOTY

@mipsytipsy
“Your System Is Broken”

@mipsytipsy
engineer/cofounder/CTO
https://charity.wtf
“Second Life Is Not A Game” — Linden Lab,
incessantly

Your imagination
Your tools & telemetry
Compelling game design / play
Fast, glitch-free experience

Monitoring
Observability
+ SLOs

Averages are bullshit.
99th% is bullshit.
99.999% is bullshit.
Every individual player experience counts.
Observability
Any ONE player who can’t log in can start
a shitstorm on the forums.

“My system has four nines”— this statement can be true, and yet:
• Everybody who logged into the game today with state saved on an unresponsive
shard can think you are 100% down
• Latency on the /login endpoint could be timing out for everybody in a
particular region
• Upserts to /payment could be failing upon registration
• Certain Android device types could be silently dropping push notifications
• Game state saves could be pointed at a read replica instead of a write replica
Observability

Without observability, your team will resort to
guessing and iterating blindly, without evidence,
and you will struggle to connect feedback loops
that result in a fast, glitch-free experience.
Observability is the hidden link between
engineering experience and player
experience.
Observability lets you inspect cause and
effect at a granular level. Observability
enables engineers to have ownership
over the lifecycle of their software.

Can you understand what’s happening inside your games, just by asking
questions from the outside? Can you debug your code and reconstruct
any user’s experience using the output of your tools?
Can you understand new scenarios without shipping new code?
o11y for game developers:
If you can’t see it, you can’t improve it

Game devs were some of the first to run up
against the limits of low-cardinality tools
https://www.eveonline.com/news/view/introducing-
quasar
• Complex, highly distributed architectures
• Designed and developed by a multitude of teams
• Played across thousands of device types
• Enormous concurrency issues and thundering
herds

• High cardinality
• High dimensionality
• Based on arbitrarily-wide structured events
• …with span ids, to support tracing
• Exploratory, open-ended investigation of raw events
• No indexes, schemas, or pre-aggregation
• Bundles the full context of the request across network hops
First Principles of Observability

https://www.honeycomb.io/blog/so-you-want-to-build-an-observability-tool/
• Achievable with metrics-based tools (Prometheus, DataDog, etc)
• Compatible with write-time aggregation
• The same thing as monitoring
• Anything whatsoever to do with “pillars”
Observability IS NOT

The fundamental building block of
observability tools is the arbitrarily-wide
structured data blob, or ‘canonical log line’,
which can have hundreds of k/v pairs.
The fundamental building block of monitoring tools is the metric.

If we rely on metrics, logs, and post-hoc
monitoring, we will find most of our problems
via customer reporting. This sucks.
Complexity is exploding everywhere,
but our tools were designed
for predictable worlds
Most bugs will never turn up in any staging
environment or be caught by our test
suites. Many bugs aren’t even bugs! —
they’re simply user interactions!

We need to shift our focus away from writing
infinity tests and monitoring checks and trying
to predict what will break, or checking for the
same error states again and again.
We need to embrace the fact that we all test in production,
and give ourselves the tools to do this well.
Instrument code for observability, shrink the deploy cycle to
a few minutes, ship one mergeset by one developer at a
time, and look at our code in production.

Write code with instrumentation
Run your systems with
SLOs
Tighten up your feedback loops with deploys
The solution:

O.D.D.
Observability-Driven Development
Write code with instrumentation
• Kickstart the virtuous cycle of “you build it, you own it”
• By instrumenting your code as you write it
• Never accept a PR unless you can explain how to tell if it breaks
• Watch your code go out as it deploys
• Observe your code through the lens of your instrumentation, asking:
• “Is it working as intended? Does anything else look weird?”

Tight feedback loops with fast deploys
Fifteen minutes or bust.
One mergeset by one dev at a time
Many times a
day

Management:
Engineering:
Operations:
Service-Level Objectives
In search of a common
language:
How broken is too broken?
What does good enough mean?
Combatting alert fatigue

Eligible: “Had an http status code”
Good: “… that was a 200, and was served in
under 500 ms”
Good events
———————
Eligible events
Service Level Indicator =

We ALWAYS store incoming user data
Default dashboards USUALLY load in <1
sec
Queries OFTEN return in <10 sec
99.99%
99.9%
99%
~4.3 minutes
~45 minutes
7.3 hours
(Honeycomb
SLOs)

Monitoring Checks, Symptom-based alerts
• A disk is 89% full on one of your database primaries
• CPU saturation on your export cluster is at 90% average
• 5% of requests to a partner API are returning HTTP 500
• The queue of iOS push notifications has been filling up and not
draining for the past 15 minutes
Instead of alerting on hundreds or thousands of symptom-
based monitoring checks, alert only on a few precious SLOs
that directly reflect user pain. Less noise, better service.

Status of an SLO
How have we done?
SLO examples
…

Service-Level Objectives +
Observability…for easy debugging

You have an observable system
when your team can quickly and reliably diagnose
any new behavior with no prior knowledge.
Observability begins with
rich instrumentation, putting you in
constant conversation with your code
It brings everyone up to the level
of the best debugger in each area
With SLOs, it is how you get to a
fast, glitch-free experience

This is how you get a jump on
glitches.
Instrument your code for observability.
Tighten up your feedback loops by deploying a single merges by a single
engineer at a time, very fast, many times a day (fifteen minutes or less!)
Replace your floods of symptom alerts with a few well-chosen
SLOs
Watch your code as it goes out.
Find the errors before your users do.

Brought to you by
Charity Majors
CTO of Honeycomb

From SLO to GOTY

Recommended

Recommended

More Related Content

Similar to From SLO to GOTY

Similar to From SLO to GOTY (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

From SLO to GOTY