can be applied to the nascent world of microservices.
Put some SRE
in your microservices
Hard-won lessons from the world of SRE…
The many faces of
Theo Schlossnagle
@postwait
CEO Circonus
The nature of the problem
Software Sucks
Once you’ve run software at scale,
you have a deep understanding of
how it is all tied together with
loose string and hope.
All software will fail, but
good software
fails well
• Consider the phrase:
“have you used X in anger.”
Never undervalue grace in failure.
Rule . 𝛌1 Crash landings should be both
fast and controlled.
What it means to
fail quickly & safely
• The scope of failure should
collapse completely.
• The time to failure should be
measured in small multiples of
normal service time
• Nothing outside the scope of
failure should be impacted.
https://www.youtube.com/watch?v=5SL1A2d2e7M
Autopsies: not just for medicine.
Rule . 𝛌2 Post-mortems are
fundamental.
Pragmatic analysis is required to
understand failure’s
true nature
• Post-mortem analysis is critical
• Stack traces
• Forensic logs
• Images (cores, dumps, etc.)
The difference between a shock and electrocution is real.
Rule . 𝛌3 Use circuit breakers.
Circuit breakers are designed to
avoid
cascading failure
• it’s not all about,
especially with microservices
• protect yourselves and others
• circuit breakers of many type
• timing
• queue depth
• concurrency
http://melissaomarkham.com
You cannot understand what you cannot measure.
Rule . 𝛌4 Behavior is complex.
Understand it.
Don’t measure to assess availability
measure to understand
Build robust models of behavior
Understand performance changes
Don’t use averages
Don’t use percentiles alone
Don’t measure to assess availability
measure to understand
Build robust models of behavior
Understand performance changes
Don’t use averages
Don’t use percentiles alone
It’s easy to demand perfection; it’s also stupid.
Rule . 𝛌5 Have an failure budget.
Avoid failure is simply impossible,
expect and manage
failure
• use failure budgets
• set expectations reasonably
• define and reward successes on
improvement and competency,
not just uptime.
Justice should be blind; operations should not.
Rule . 𝛌6 Instrumentation &
Observability have no equals.
For every “I wonder what X is right now?”
in production,
you must have answers
DTrace
eBPF
Instrument code for observability
https://www.pinterest.com/pin/441775044670412234/
Thank you.

Applying SRE techniques to micro service design

  • 1.
    can be appliedto the nascent world of microservices. Put some SRE in your microservices Hard-won lessons from the world of SRE…
  • 2.
    The many facesof Theo Schlossnagle @postwait CEO Circonus
  • 3.
    The nature ofthe problem Software Sucks Once you’ve run software at scale, you have a deep understanding of how it is all tied together with loose string and hope.
  • 4.
    All software willfail, but good software fails well • Consider the phrase: “have you used X in anger.”
  • 5.
    Never undervalue gracein failure. Rule . 𝛌1 Crash landings should be both fast and controlled.
  • 6.
    What it meansto fail quickly & safely • The scope of failure should collapse completely. • The time to failure should be measured in small multiples of normal service time • Nothing outside the scope of failure should be impacted. https://www.youtube.com/watch?v=5SL1A2d2e7M
  • 7.
    Autopsies: not justfor medicine. Rule . 𝛌2 Post-mortems are fundamental.
  • 8.
    Pragmatic analysis isrequired to understand failure’s true nature • Post-mortem analysis is critical • Stack traces • Forensic logs • Images (cores, dumps, etc.)
  • 9.
    The difference betweena shock and electrocution is real. Rule . 𝛌3 Use circuit breakers.
  • 10.
    Circuit breakers aredesigned to avoid cascading failure • it’s not all about, especially with microservices • protect yourselves and others • circuit breakers of many type • timing • queue depth • concurrency http://melissaomarkham.com
  • 11.
    You cannot understandwhat you cannot measure. Rule . 𝛌4 Behavior is complex. Understand it.
  • 12.
    Don’t measure toassess availability measure to understand Build robust models of behavior Understand performance changes Don’t use averages Don’t use percentiles alone
  • 13.
    Don’t measure toassess availability measure to understand Build robust models of behavior Understand performance changes Don’t use averages Don’t use percentiles alone
  • 14.
    It’s easy todemand perfection; it’s also stupid. Rule . 𝛌5 Have an failure budget.
  • 15.
    Avoid failure issimply impossible, expect and manage failure • use failure budgets • set expectations reasonably • define and reward successes on improvement and competency, not just uptime.
  • 16.
    Justice should beblind; operations should not. Rule . 𝛌6 Instrumentation & Observability have no equals.
  • 17.
    For every “Iwonder what X is right now?” in production, you must have answers DTrace eBPF Instrument code for observability https://www.pinterest.com/pin/441775044670412234/
  • 18.