SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
The Hurricane's Butterfly: Debugging pathologically performing systems
The Hurricane's Butterfly: Debugging pathologically performing systems
1.
The Hurricane’s Butterfly
Debugging pathologically performing systems
CTO
bryan@joyent.com
Bryan Cantrill
@bcantrill
2.
Debugging system failure
• Failures are easiest to debug when they are explicit and fatal
• A system that fails fatally stops: it ceases to make forward
progress, leaving behind a snapshot of its state — a core dump
• Unfortunately, these are not all problems…
• A broad class of problems are non-fatal: the system continues
to operate despite having failed, often destroying evidence
• Worst of all are those non-fatal failures that are also implicit
3.
Implicit, non-fatal failure
• The most difficult, time-consuming bugs to debug are those in
which the system failure is unbeknownst to the system itself
• The system does the wrong thing or returns the wrong result or
has pathological side effects (e.g., resource leaks)
• Of these, the gnarliest class are those failures that are not
strictly speaking failure at all: the system is operating correctly,
but is failing to operate in a timely or efficient fashion
• That is, it just… sucks
4.
The stack of abstraction
• Our software systems are built as stacks of abstraction
• These stacks allow us to stand on the shoulders of history — to
reuse components without rebuilding them
• We can do this because of the software paradox: software is
both information and machine, exhibiting properties of both
• Our stacks are higher and run deeper than we can see or know:
software is silent and opaque; the nature of abstraction is to
seal us from what runs beneath!
• They run so deep as to challenge our definition of software…
5.
The Butterflies
• When the stack of abstraction performs pathologically, its power
transmogrifies to peril: layering amplifies performance
pathologies but hinders insight
• Work amplifies as we go down the stack
• Latency amplifies as we go up the stack
• Seemingly minor issues in one layer can cascade into systemic
pathological performance
• These are the butterflies that cause hurricanes
8.
Butterfly III: Kernel page-table isolation
Data courtesy Scaleway, running a PHP workload with KPTI patches for Linux. Thank you Edouard Bonlieu and team!
9.
The Hurricane
• With pathologically performing systems, we are faced with
Leventhal’s Conundrum: given a hurricane, find the butterflies!
• This is excruciatingly difficult:
• Symptoms are often far removed from root cause
• There may not be a single root cause but several
• The system is dynamic and may change without warning
• Improvements to the system are hard to model and verify
• Emphatically, this is not “tuning” — it is debugging
10.
Performance debugging
• When we think of it as debugging, we can stop pretending that
understanding (and rectifying) pathological system performance
is rote or mechanical — or easy
• We can resist the temptation to be guided by folklore: just
because someone heard about something causing a problem
once doesn’t mean it’s the problem now!
• We can resist the temptation to change the system before
understanding it: just as you wouldn’t (or shouldn’t!) debug by
just changing code, you shouldn’t debug a pathologically
performing system by randomly altering it!
11.
How do we debug?
• To debug methodically, we must resist the temptation to quick
hypotheses, focusing rather on questions and observations
• Iterating between questions and observations gathers the facts
that will constrain future hypotheses
• These facts can be used to disconfirm hypotheses!
• How do we ask questions?
• How do we make observations?
12.
Asking questions
• For performance debugging, the initial question formulation is
particularly challenging: where does one start?
• Resource-centric methodologies like the USE Method
(Utilization/Saturation/Errors) can be excellent starting points…
• But keep these methodologies in their context: they provide
initial questions to ask — they are not recipes for debugging
arbitrary performance pathologies!
13.
Making observations
• Questions are answered through observation
• The observability of the system is paramount
• If the system cannot be observed, one is reduced to guessing,
making changes, and drawing inferences
• If it must be said, drawing inferences based only on change is
highly flawed: correlation does not imply causation!
• To be observable, systems must be instrumentable: they must
be able to be altered to emit a datum in the desired condition
14.
Observability through instrumentation
• Static instrumentation modifies source to provide semantically
relevant information, e.g., via logging or counters
• Dynamic instrumentation allows for the system to be changed
while running to emit data, e.g. DTrace, OpenTracing
• Both mechanisms of instrumentation are essential!
• Static instrumentation provides the observations necessary for
early question formulation…
• Dynamic instrumentation answers deeper, ad hoc questions
15.
Aside: Monitoring vs. observability
• Monitoring is an essential operational activity that can indicate a
pathologically performing system and provide initial questions
• But monitoring alone is often insufficient to completely debug a
pathologically performing system, because the questions that it
can answer are limited to that which is monitored
• As we increasingly deploy developed systems rather than
received ones, it is a welcome (and unsurprising!) development
to see the focus of monitoring expand to observability!
16.
Aggregation
• When instrumenting the system, it can become overwhelmed
with the overhead of instrumentation
• Aggregation is essential for scalable, non-invasive
instrumentation — and is a first-class primitive in (e.g.) DTrace
• But aggregation also eliminates important dimensions of data,
especially with respect to time; some questions may only be
answered with disaggregated data!
• Use aggregation for performance debugging — but also
understand its limits!
17.
Visualization
• The visual cortex is unparalleled at detecting patterns
• The value of visualizing data is not merely providing answers,
but also (and especially) provoking new questions
• Our systems are so large, complicated and abstract that there is
not one way to visualize them, but many
• The visualization of systems and their representations is an
essential skill for performance debugging!
18.
Visualization: Gnuplot
• Graphs are terrific — so much so that we should not restrict
ourselves to the captive graphs found in bundled software!
• An ad hoc plotting tool is essential for performance debugging;
and Gnuplot is an excellent (if idiosyncratic) one
• Gnuplot is easily combined with workhorses like awk or perl
• That Gnuplot is an essential tool helps to set expectation
around performance debugging tools: they are not magicians!
21.
Visualization: Statemaps
• Especially when trying to understand interplay between different
entities, it can be useful to visualize their state over time
• Time is the critical element here!
• We are experimenting with statemaps whereby state transitions
are instrumented (e.g., with DTrace) and then visualized
• This is not necessarily a new way of visualizing the system
(e.g., early thread debuggers often showed thread state over
time), but with a new focus on post hoc visualization
• Primordial implementation: https://github.com/joyent/statemap
27.
The hurricane’s butterfly
• Finding the source(s) of pathologically performing systems must
be thought of as debugging — albeit the hardest kind
• Debugging isn’t about making guesses; it’s about asking
questions and answering them with observations
• We must enshrine observability to assure debuggability!
• Debugging rewards persistence, grit, and resilience more than
intuition or insight — it is more perspiration than inspiration!
• We must have the faith that our systems are — in the end —
purely synthetic; we can find the hurricane’s butterfly!