Three Pillars, No Answers: Helping Platform
Teams Solve Real Observability Problems
Austin Parker, Principal Developer Advocate at Lightstep
Who Am I?
Austin Parker
Principal Developer Advocate
@austinlparker
austin@lightstep.com✉
Part 1: A
Critique
The Conventional Wisdom
● Observing microservices is hard
● Google and Facebook solved this (right???)
● They used Metrics, Logging, and Distributed Tracing…
● So we should, too.
The Three Pillars of Observability
- Metrics
- Logging
- Distributed Tracing
Metrics
Logging
Tracing
Fatal Flaws
A word nobody knew in 2015…
Dimensions (aka “tags”) can explain
variance in timeseries data (aka “metrics”)
…… but cardinality
Logging Data Volume: a reality check
transaction rate
x all microservices
x cost of net+storage
x weeks of retention
-----------------------
way too much $$$$
The Life of Transaction Data: Dapper
Stage Overhead affects… Retained
Instrumentation Executed App 100.00%
Buffered within app process App 000.10%
Flushed out of process App 000.10%
Centralized regionally Regional network + storage 000.10%
Centralized globally WAN + storage 000.01%
Fatal Flaws: A Review
Logs Metrics Dist. Traces
TCO scales gracefully
– ✓ ✓
Accounts for all data
(i.e., unsampled) ✓ ✓ –
Immune to cardinality
✓ – ✓
Data vs UI
Data vs UI
Data vs UI
Metrics
Logs
Traces
Metrics, Logs, and Traces are
Just Data,
… not a feature or use case.
Part 2: A New
Scorecard for
Observability
Mental Model: Goals and Activities
● Goals: how our services perform in the eyes of
their consumers
● Activities: what we (as operators) actually do
to further our goals
Quick Vocab Refresher: SLIs
“SLI” = “Service Level Indicator”
TL;DR: An SLI is an indicator of health that a
service’s consumers would care about.
… not an indicator of its inner workings
Observability: 2 Fundamental Goals
Gradually improving an SLI
Rapidly restoring an SLI
Reminder: “SLI” = “Service Level Indicator”
NOW!!!!
days, weeks, months…
Observability: 2 Fundamental Activities
1. Detection: measuring SLIs precisely
2. Refinement: reducing the search
space for plausible explanations
An interlude about stats frequency
Scorecard: Detection
1. Specificity:
- Cost of cardinality ($ per tag value)
- Stack support (mobile/web platforms, managed services, “black-box
OSS infra” like Kafka/Cassandra)
2. Fidelity:
- Correct stats!!! (global p95, p99)
- High stats frequency (stats sampling frequency, in seconds)
3. Freshness (lag from real-time, in seconds)
Why “Refinement”?
# of things your users
actually care about
# of microservices
# of failure modes
Must reduce
the search space!
The Refinement Process
Discover Variance
Explain Variance
Deploy
Fix
Histograms vs “p99”
Scorecard: Refinement
Identifying Variance:
- Cardinality ($ per tag value)
- Robust stats (histograms (see prev slide))
- Retention horizons for plausible queries (time duration)
Explaining variance:
- Correct stats!!! (global p95, p99)
- “Suppress the messengers” of microservice failures
Wrapping Up...
(first, a hint at my
perspective)
A fun game! (“Observability Whack-a-Mole”)
Design your own observability system:
❏ High-throughput
❏ High-cardinality
❏ Lengthy retention window
❏ Unsampled
Choose three
The Life of Trace Data:
Dapper
Stage Overhead affects… Retained
Instrumentation Executed App 100.00%
Buffered within app process App 000.10%
Flushed out of process App 000.10%
Centralized regionally Regional network + storage 000.10%
Centralized globally WAN + storage 000.01%
The Life of Trace Data:
Dapper Other Approaches
Stage Overhead affects… Retained
Instrumentation Executed App 100.00%
Buffered within app process App 100.00%
Flushed out of process App 100.00%
Centralized regionally Regional network + storage 100.00%
Centralized globally WAN + storage “fancy”
An Observability Scorecard
Detection
- Specificity: cardinality cost,
stack coverage
- Fidelity: correct stats, high stats
frequency
- Freshness: ≤ 5 seconds
Refinement
- Identifying variance: cardinality
cost, correct stats, hi-fi
histograms, retention horizons
- “Suppress the messengers”
LightStep: Observability with context
Automatic deployment and regression detection
System and service diagrams
Real-time and historical root cause analysis
Correlations
Custom alerting
Easy Setup with no vendor lock-in
No cardinality limitations, really
Q&A
Get Started Today
go.lightstep.com/trial
Extra Slides
Ideal Measurement: Robust
Ideal Measurement: High-Dimensional
Ideal Refinement: Real-time
Must be able to test and eliminate hypotheses
quickly
Actual data must be ≤10s fresh
UI / API latency must be very low
Ideal Refinement: Global
Ideal Refinement: Context-Rich
We can’t expect humans to know what’s normal

Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Problems

  • 1.
    Three Pillars, NoAnswers: Helping Platform Teams Solve Real Observability Problems Austin Parker, Principal Developer Advocate at Lightstep
  • 2.
    Who Am I? AustinParker Principal Developer Advocate @austinlparker austin@lightstep.com✉
  • 3.
  • 4.
    The Conventional Wisdom ●Observing microservices is hard ● Google and Facebook solved this (right???) ● They used Metrics, Logging, and Distributed Tracing… ● So we should, too.
  • 5.
    The Three Pillarsof Observability - Metrics - Logging - Distributed Tracing
  • 6.
  • 7.
  • 8.
  • 10.
  • 11.
    A word nobodyknew in 2015… Dimensions (aka “tags”) can explain variance in timeseries data (aka “metrics”) …… but cardinality
  • 12.
    Logging Data Volume:a reality check transaction rate x all microservices x cost of net+storage x weeks of retention ----------------------- way too much $$$$
  • 13.
    The Life ofTransaction Data: Dapper Stage Overhead affects… Retained Instrumentation Executed App 100.00% Buffered within app process App 000.10% Flushed out of process App 000.10% Centralized regionally Regional network + storage 000.10% Centralized globally WAN + storage 000.01%
  • 14.
    Fatal Flaws: AReview Logs Metrics Dist. Traces TCO scales gracefully – ✓ ✓ Accounts for all data (i.e., unsampled) ✓ ✓ – Immune to cardinality ✓ – ✓
  • 16.
  • 17.
  • 18.
  • 19.
    Metrics, Logs, andTraces are Just Data, … not a feature or use case.
  • 20.
    Part 2: ANew Scorecard for Observability
  • 21.
    Mental Model: Goalsand Activities ● Goals: how our services perform in the eyes of their consumers ● Activities: what we (as operators) actually do to further our goals
  • 22.
    Quick Vocab Refresher:SLIs “SLI” = “Service Level Indicator” TL;DR: An SLI is an indicator of health that a service’s consumers would care about. … not an indicator of its inner workings
  • 23.
    Observability: 2 FundamentalGoals Gradually improving an SLI Rapidly restoring an SLI Reminder: “SLI” = “Service Level Indicator” NOW!!!! days, weeks, months…
  • 24.
    Observability: 2 FundamentalActivities 1. Detection: measuring SLIs precisely 2. Refinement: reducing the search space for plausible explanations
  • 25.
    An interlude aboutstats frequency
  • 26.
    Scorecard: Detection 1. Specificity: -Cost of cardinality ($ per tag value) - Stack support (mobile/web platforms, managed services, “black-box OSS infra” like Kafka/Cassandra) 2. Fidelity: - Correct stats!!! (global p95, p99) - High stats frequency (stats sampling frequency, in seconds) 3. Freshness (lag from real-time, in seconds)
  • 27.
    Why “Refinement”? # ofthings your users actually care about # of microservices # of failure modes Must reduce the search space!
  • 28.
    The Refinement Process DiscoverVariance Explain Variance Deploy Fix
  • 29.
  • 30.
    Scorecard: Refinement Identifying Variance: -Cardinality ($ per tag value) - Robust stats (histograms (see prev slide)) - Retention horizons for plausible queries (time duration) Explaining variance: - Correct stats!!! (global p95, p99) - “Suppress the messengers” of microservice failures
  • 31.
  • 32.
    (first, a hintat my perspective)
  • 33.
    A fun game!(“Observability Whack-a-Mole”) Design your own observability system: ❏ High-throughput ❏ High-cardinality ❏ Lengthy retention window ❏ Unsampled Choose three
  • 34.
    The Life ofTrace Data: Dapper Stage Overhead affects… Retained Instrumentation Executed App 100.00% Buffered within app process App 000.10% Flushed out of process App 000.10% Centralized regionally Regional network + storage 000.10% Centralized globally WAN + storage 000.01%
  • 35.
    The Life ofTrace Data: Dapper Other Approaches Stage Overhead affects… Retained Instrumentation Executed App 100.00% Buffered within app process App 100.00% Flushed out of process App 100.00% Centralized regionally Regional network + storage 100.00% Centralized globally WAN + storage “fancy”
  • 36.
    An Observability Scorecard Detection -Specificity: cardinality cost, stack coverage - Fidelity: correct stats, high stats frequency - Freshness: ≤ 5 seconds Refinement - Identifying variance: cardinality cost, correct stats, hi-fi histograms, retention horizons - “Suppress the messengers”
  • 37.
    LightStep: Observability withcontext Automatic deployment and regression detection System and service diagrams Real-time and historical root cause analysis Correlations Custom alerting Easy Setup with no vendor lock-in No cardinality limitations, really
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
    Ideal Refinement: Real-time Mustbe able to test and eliminate hypotheses quickly Actual data must be ≤10s fresh UI / API latency must be very low
  • 43.
  • 44.
    Ideal Refinement: Context-Rich Wecan’t expect humans to know what’s normal