About SRE –
and how (not) to apply it
Björn “Beorn” Rabenstein, Production Engineer, SoundCloud Ltd.
Berlin | November 20–21, 2018
BEFORE IT WAS COOL
FAILING AT SRE
Also episode #260 of the Changelog:
https://changelog.com/podcast/260
Scale
Scale
Scale
Culture
~1B users
~10k engineers
engineers > services
it’s complicated
~100M users
~100 engineers
engineers < services
it’s complicated
SRE is what happens
when you ask a software
engineer to design an
operations team.
Ben Treynor, Google Inc.
Symptom-based alertingOperational underload
Everything you always wanted to know about SRE
but were afraid to ask
DevOps vs. SRE
Dickerson's hierarchy of service reliability
Error budgets
On-callProdEng at SoundCloud
Less than 50% ops work
Postmortems
Less than 50% ops work.
“Solve production problems with software. It’s all just software.”
traffic * complexity
operational load
e.g. pages
Error budgets
Dickerson’s hierarchy of service reliability
Site Reliability Engineering – How Google Runs Production Systems, B. Beyer et al. (ed.), O’Reilly 2016, p104
DevOps vs. SRE
“True” DevOps is if there are no separate dev and
ops teams anymore, and not even designated dev
or ops roles within a team.
Björn Rabenstein & Matthias Rampke, SoundCloud Ltd.
DevOps is a set of practices intended to reduce the
time between committing a change to a system and
the change being placed into normal production,
while ensuring high quality.
Len Bass, Ingo Weber, Liming Zhu: DevOps: A Software Architect's Perspective.
DevOps is if you have CI/CD and run containers.
J. Random Manager: During the last strategy meeting.
CALMS (John Willis, Damon Edwards, Jez Humble)
Culture
Automation
Lean (management or continuous improvement)
Metrics
Sharing
[The first SRE book] explicitly
references Culture, Automation,
Metrics, and Sharing alongside
anecdotes about Google’s journey to
continuously improve.
Andrew Clay Shafer: The Site Reliability Workbook.
I cringe when I hear someone say “SRE versus
DevOps.”
Andrew Clay Shafer: Foreword II.
Ultimately, I know DevOps when I see it and I
see SRE at Google, in theory and practice, as one
of the most advanced implementations.
Andrew Clay Shafer: Foreword II.
The principles from the first SRE book align so
well with what I always imagined DevOps to be,
and the practices are insightful, even when they
aren’t 100% applicable outside of Google.
Andrew Clay Shafer: Foreword II.
On-call
● Minimal size of an on-call rotation:
○ 6 if following the sun.
○ 8 otherwise.
● Minimal size of a dedicated SRE team: 8.
● Feasible percentage of all engineers in SRE: 5%? 10%?
● Number of SRE teams SoundCloud could afford: 1.
Take your own pager!
It’s still “SRE in spirit”…
In Site Reliability Engineering, we did not make it sufficiently
clear that product development teams in Google own their
service by default. SRE is neither available nor warranted for
the bulk of services, although SRE principles still inform how
services are managed throughout Google.
Chapter 1: How SRE relates to DevOps
Production Engineering (ProdEng)
I have been waiting for this book ever since I left Google’s
enchanted castle.It is the gospel I am preaching to my peers
at work.
Beorn’s praise for the 1st SRE book
Finally, this volume and its predecessor are not intended to
be gospel. Please don’t treat them that way. Even after all
these years, we’re still finding conditions and cases that
cause us to tweak (or in some cases, replace) previously
firmly held beliefs.
Preface of the Site Reliability Workbook
https://github.com/beorn7/talks

Björn Rabenstein - About SRE – and how (not) to apply it - Codemotion Berlin 2018

  • 1.
    About SRE – andhow (not) to apply it Björn “Beorn” Rabenstein, Production Engineer, SoundCloud Ltd. Berlin | November 20–21, 2018
  • 7.
    BEFORE IT WASCOOL FAILING AT SRE
  • 8.
    Also episode #260of the Changelog: https://changelog.com/podcast/260
  • 9.
    Scale Scale Scale Culture ~1B users ~10k engineers engineers> services it’s complicated ~100M users ~100 engineers engineers < services it’s complicated
  • 11.
    SRE is whathappens when you ask a software engineer to design an operations team. Ben Treynor, Google Inc.
  • 12.
    Symptom-based alertingOperational underload Everythingyou always wanted to know about SRE but were afraid to ask DevOps vs. SRE Dickerson's hierarchy of service reliability Error budgets On-callProdEng at SoundCloud Less than 50% ops work Postmortems
  • 13.
    Less than 50%ops work.
  • 14.
    “Solve production problemswith software. It’s all just software.” traffic * complexity operational load e.g. pages
  • 15.
  • 16.
    Dickerson’s hierarchy ofservice reliability Site Reliability Engineering – How Google Runs Production Systems, B. Beyer et al. (ed.), O’Reilly 2016, p104
  • 17.
  • 18.
    “True” DevOps isif there are no separate dev and ops teams anymore, and not even designated dev or ops roles within a team. Björn Rabenstein & Matthias Rampke, SoundCloud Ltd.
  • 19.
    DevOps is aset of practices intended to reduce the time between committing a change to a system and the change being placed into normal production, while ensuring high quality. Len Bass, Ingo Weber, Liming Zhu: DevOps: A Software Architect's Perspective. DevOps is if you have CI/CD and run containers. J. Random Manager: During the last strategy meeting.
  • 21.
    CALMS (John Willis,Damon Edwards, Jez Humble) Culture Automation Lean (management or continuous improvement) Metrics Sharing [The first SRE book] explicitly references Culture, Automation, Metrics, and Sharing alongside anecdotes about Google’s journey to continuously improve. Andrew Clay Shafer: The Site Reliability Workbook.
  • 22.
    I cringe whenI hear someone say “SRE versus DevOps.” Andrew Clay Shafer: Foreword II.
  • 23.
    Ultimately, I knowDevOps when I see it and I see SRE at Google, in theory and practice, as one of the most advanced implementations. Andrew Clay Shafer: Foreword II.
  • 24.
    The principles fromthe first SRE book align so well with what I always imagined DevOps to be, and the practices are insightful, even when they aren’t 100% applicable outside of Google. Andrew Clay Shafer: Foreword II.
  • 25.
  • 26.
    ● Minimal sizeof an on-call rotation: ○ 6 if following the sun. ○ 8 otherwise. ● Minimal size of a dedicated SRE team: 8. ● Feasible percentage of all engineers in SRE: 5%? 10%? ● Number of SRE teams SoundCloud could afford: 1.
  • 28.
    Take your ownpager! It’s still “SRE in spirit”…
  • 29.
    In Site ReliabilityEngineering, we did not make it sufficiently clear that product development teams in Google own their service by default. SRE is neither available nor warranted for the bulk of services, although SRE principles still inform how services are managed throughout Google. Chapter 1: How SRE relates to DevOps
  • 30.
  • 31.
    I have beenwaiting for this book ever since I left Google’s enchanted castle.It is the gospel I am preaching to my peers at work. Beorn’s praise for the 1st SRE book Finally, this volume and its predecessor are not intended to be gospel. Please don’t treat them that way. Even after all these years, we’re still finding conditions and cases that cause us to tweak (or in some cases, replace) previously firmly held beliefs. Preface of the Site Reliability Workbook
  • 32.