Three Pillars with Zero Answers
A New Observability Scorecard
March 5, 2020
● Katia Bazzi
● Software Engineer @ LightStep
● katia@lightstep.com
● lightstep.com
Part I
A Critique
Observing microservices is hard
Google and Facebook solved this (right???)
They used Metrics, Logging, and Distributed Tracing…
So we should, too.
The Conventional Wisdom
The Three Pillars of Observability
- Metrics
- Logging
- Distributed Tracing
Metrics!
Logging!
Tracing!
Fatal Flaws
A word nobody knew in 2015…
Dimensions (aka “tags”) can explain variance
in timeseries data (aka “metrics”) …
… but cardinality
Logging Data Volume: a reality check
transaction rate
x all microservices
x cost of net+storage
x weeks of retention
-----------------------
way too much $$$$
The Life of Transaction Data: Dapper
Stage Overhead affects… Retained
Instrumentation Executed App 100.00%
Buffered within app process App 000.10%
Flushed out of process App 000.10%
Centralized regionally Regional network + storage 000.10%
Centralized globally WAN + storage 000.01%
Logs Metrics Dist. Traces
TCO scales gracefully
– ✓ ✓
Accounts for all data
(i.e., unsampled) ✓ ✓ –
Immune to cardinality
✓ – ✓
Fatal Flaws: A Review
Data vs UI
Data vs UI
Data vs UI
Metrics
Logs
Traces
Metrics, Logs, and Traces are
Just Data,
… not a feature or use case.
Part II
A New Scorecard
for Observability
Mental Model: “Goals” and “Activities”
Goals: how our services perform in the
eyes of their consumers
Activities: what we (as operators) actually
do to further our goals
“SLI” = “Service Level Indicator”
TL;DR: An SLI is an indicator of health that
a service’s consumers would care about.
… not an indicator of its inner workings
Quick Vocab Refresher: SLIs
Observability: Two Fundamental Goals
- Gradually improving an SLI
- Rapidly restoring an SLI
Reminder: “SLI” = “Service Level Indicator”
NOW!!!!
days, weeks, months…
1. Detection: measuring SLIs precisely
2. Refinement: reducing the search
space for plausible explanations
Observability: Two Fundamental Activities
An interlude about stats frequency
Specificity:
- Cost of cardinality ($ per tag value)
- Stack support (mobile/web platforms, managed services, “black-
box OSS infra” like Kafka/Cassandra)
Fidelity:
- Correct stats!!! (global p95, p99)
- High stats frequency (stats sampling frequency, in seconds)
Freshness (lag from real-time, in seconds)
Scorecard: Detection
# of things your users
actually care about
# of microservices
# of failure modes
Must reduce
the search space!
Why “Refinement”?
The Refinement Process
Discover Variance
Explain Variance
Deploy
Fix
Histograms vs “p99”
Scorecard: Refinement
Identifying Variance:
- Cardinality ($ per tag value)
- Robust stats (histograms (see prev slide))
- Retention horizons for plausible queries (time duration)
Explaining variance:
- Correct stats!!! (global p95, p99)
- “Suppress the messengers” of microservice failures
Wrapping up…
(first, a hint at my perspective)
A fun game!
Design your own observability system:
High-throughput
High-cardinality
Lengthy retention window
Unsampled
Choose three.
(“Observability Whack-a-Mole”)
The Life of Trace Data: Dapper
Stage Overhead affects… Retained
Instrumentation Executed App 100.00%
Buffered within app process App 000.10%
Flushed out of process App 000.10%
Centralized regionally Regional network + storage 000.10%
Centralized globally WAN + storage 000.01%
(Review)
The Life of Trace Data: Dapper Other Approaches
Stage Overhead affects… Retained
Instrumentation Executed App 100.00%
Buffered within app process App 100.00%
Flushed out of process App 100.00%
Centralized regionally Regional network + storage 100.00%
Centralized globally WAN + storage “fancy”
Refinement
- Identifying variance:
cardinality cost, correct
stats, hi-fi histograms,
retention horizons
- “Suppress the messengers”
Detection
- Specificity: cardinality
cost, stack coverage
- Fidelity: correct stats,
high stats frequency
- Freshness: ≤ 5 seconds
An Observability Scorecard
● Automatic deployment and regression detection
● System and service diagrams
● Real-time and historical root cause analysis
● Correlations
● Custom alerting
● Easy Setup with no vendor lock-in
● No cardinality limitations, really
LightStep: Observability with context
Get Started Today
lightstep.com/play

Three Pillars, Zero Answers: Rethinking Observability

  • 1.
    Three Pillars withZero Answers A New Observability Scorecard March 5, 2020
  • 2.
    ● Katia Bazzi ●Software Engineer @ LightStep ● katia@lightstep.com ● lightstep.com
  • 3.
  • 4.
    Observing microservices ishard Google and Facebook solved this (right???) They used Metrics, Logging, and Distributed Tracing… So we should, too. The Conventional Wisdom
  • 5.
    The Three Pillarsof Observability - Metrics - Logging - Distributed Tracing
  • 6.
  • 7.
  • 8.
  • 10.
  • 11.
    A word nobodyknew in 2015… Dimensions (aka “tags”) can explain variance in timeseries data (aka “metrics”) … … but cardinality
  • 12.
    Logging Data Volume:a reality check transaction rate x all microservices x cost of net+storage x weeks of retention ----------------------- way too much $$$$
  • 13.
    The Life ofTransaction Data: Dapper Stage Overhead affects… Retained Instrumentation Executed App 100.00% Buffered within app process App 000.10% Flushed out of process App 000.10% Centralized regionally Regional network + storage 000.10% Centralized globally WAN + storage 000.01%
  • 14.
    Logs Metrics Dist.Traces TCO scales gracefully – ✓ ✓ Accounts for all data (i.e., unsampled) ✓ ✓ – Immune to cardinality ✓ – ✓ Fatal Flaws: A Review
  • 16.
  • 17.
  • 18.
  • 19.
    Metrics, Logs, andTraces are Just Data, … not a feature or use case.
  • 20.
    Part II A NewScorecard for Observability
  • 21.
    Mental Model: “Goals”and “Activities” Goals: how our services perform in the eyes of their consumers Activities: what we (as operators) actually do to further our goals
  • 22.
    “SLI” = “ServiceLevel Indicator” TL;DR: An SLI is an indicator of health that a service’s consumers would care about. … not an indicator of its inner workings Quick Vocab Refresher: SLIs
  • 23.
    Observability: Two FundamentalGoals - Gradually improving an SLI - Rapidly restoring an SLI Reminder: “SLI” = “Service Level Indicator” NOW!!!! days, weeks, months…
  • 24.
    1. Detection: measuringSLIs precisely 2. Refinement: reducing the search space for plausible explanations Observability: Two Fundamental Activities
  • 25.
    An interlude aboutstats frequency
  • 26.
    Specificity: - Cost ofcardinality ($ per tag value) - Stack support (mobile/web platforms, managed services, “black- box OSS infra” like Kafka/Cassandra) Fidelity: - Correct stats!!! (global p95, p99) - High stats frequency (stats sampling frequency, in seconds) Freshness (lag from real-time, in seconds) Scorecard: Detection
  • 27.
    # of thingsyour users actually care about # of microservices # of failure modes Must reduce the search space! Why “Refinement”?
  • 28.
    The Refinement Process DiscoverVariance Explain Variance Deploy Fix
  • 29.
  • 30.
    Scorecard: Refinement Identifying Variance: -Cardinality ($ per tag value) - Robust stats (histograms (see prev slide)) - Retention horizons for plausible queries (time duration) Explaining variance: - Correct stats!!! (global p95, p99) - “Suppress the messengers” of microservice failures
  • 31.
  • 32.
    (first, a hintat my perspective)
  • 33.
    A fun game! Designyour own observability system: High-throughput High-cardinality Lengthy retention window Unsampled Choose three. (“Observability Whack-a-Mole”)
  • 34.
    The Life ofTrace Data: Dapper Stage Overhead affects… Retained Instrumentation Executed App 100.00% Buffered within app process App 000.10% Flushed out of process App 000.10% Centralized regionally Regional network + storage 000.10% Centralized globally WAN + storage 000.01% (Review)
  • 35.
    The Life ofTrace Data: Dapper Other Approaches Stage Overhead affects… Retained Instrumentation Executed App 100.00% Buffered within app process App 100.00% Flushed out of process App 100.00% Centralized regionally Regional network + storage 100.00% Centralized globally WAN + storage “fancy”
  • 36.
    Refinement - Identifying variance: cardinalitycost, correct stats, hi-fi histograms, retention horizons - “Suppress the messengers” Detection - Specificity: cardinality cost, stack coverage - Fidelity: correct stats, high stats frequency - Freshness: ≤ 5 seconds An Observability Scorecard
  • 37.
    ● Automatic deploymentand regression detection ● System and service diagrams ● Real-time and historical root cause analysis ● Correlations ● Custom alerting ● Easy Setup with no vendor lock-in ● No cardinality limitations, really LightStep: Observability with context
  • 38.