reliability
appropriate
sustainably
@otterbook
Johanna Pung/CC-BY-SA-3.0
@otterbook
 hire only coders
 have an SLA for your service
 measure and report performance against SLA
 Use Error Budgets and gate launches on them
 Common staffing pool for SRE and DEV
 Excess Ops work overflows to DEV team
 Cap SRE operational load at 50%
 Share 5% of ops work with DEV team
 Oncall teams at least 8 people, or 6x2
 Maximum of 2 events per oncall shift
 Post mortem for every event
 Post mortems are blameless and focus on process and technology, not people
SLO monitor decide
@otterbook
RELIABILITY
FRESHNESS
CORRECTNESS
THROUGHPU
T
AVAILABILITY
DURABILITY
QUALITY
COVERAGE
LATENCY
@otterbook
RELIABILITY
FRESHNESS
CORRECTNESS
THROUGHPUT
AVAILABILITY
DURABILITY
QUALITY
COVERAGE
LATENCY
@otterbook
@otterbook
 … as measured at the load balancer
 … as measured at the client
 … as reported in the server log
 … as determined by the app
@otterbook
@otterbook
@otterbook
Objectives



@otterbook
@otterbook
@otterbook
SLO: 70%
0%
SLO: 90%
100%
SLO: 60%
0%
0%
100%
100%
+plan/policy
@otterbook
exceed your error budget?
•
•
•
•
•
exhaust your error budget?
•
•
•
•
•
@otterbook
exceed your error budget?
•
•
•
•
•
exhaust your error budget?
•
•
•
•
•
Make a plan.
Follow the plan.
@otterbook
@otterbook
@otterbook
@otterbook
For more info about SRE:
http://aka.ms/intro-sre-tlv18
David N.
Blank-Edelman
Senior Cloud Ops Advocate
@otterbook
dnb@microsoft.com
/in/dnblankedelman

Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsDays Tel Aviv 2018

Editor's Notes

  • #4 Three crucial words: reliability, appropriate and sustainably Reliability is the central concern
  • #5 First, why reliability (switch to next side)
  • #6 This slide intentionally blank with a white background. This is what a PHP app looks like when it fails.
  • #7 This is what a Java app can look like when it fails.
  • #8 This slide intentionally blank with a white background. Back to the PHP app. Showing this because you can put great effort and resources into building the greatest app, but if it isn’t up, it is of no value. This is why reliability is a primary property to strive for.
  • #9 Now the word appropriate. Except in rare cases, 100% reliability is never the right goal. This is different from our previous operational thinking. Not the right goal because: 1) often unreachable (dependencies or intermediate path may not be 100% reliable), 2) really expensive, 3) leaves you with no slack (error budgets), 4) everything is reactive (since it can only go downhill from there) If your dependencies are not 100% reliable, then you can’t have 100% reliability. If there is 100% reliability, how can you have the slack to change things..
  • #10 And finally: sustainably. SRE, like DevOps, recognizes that it is crucial that an operations practice takes into account the people (for example, with on-call rotations and the type of work they are expected to do). Can’t make reliable systems out of burned out people. Sustainability is about the type of work as well. Spending all the time on tickets, frequent on-call pages, and other interrupt driven work pulls time and attention away from projects that increase reliability.
  • #11 Image licensed from The Noun Project (https://thenounproject.com/search/?q=tug%20of%20war&i=1870929), receipt available upon request
  • #12 Source: https://commons.wikimedia.org/wiki/File:Evolution-des-wissens.jpg
  • #16 All graphics on this page licensed from The Noun Project (https://thenounproject.com), receipts available upon request
  • #39 Note, this graphic (and all of the other “comic” like graphics in this presentation) have been licensed from studiostoks (http://www.studioks.ru) via creativemarket (https://creativemarket.com/studiostoks) under their standard license. Receipts available upon request.