Björn Rabenstein - About SRE – and how (not) to apply it - Codemotion Berlin 2018

About SRE –
and how (not) to apply it
Björn “Beorn” Rabenstein, Production Engineer, SoundCloud Ltd.
Berlin | November 20–21, 2018

BEFORE IT WAS COOL
FAILING AT SRE

Also episode #260 of the Changelog:
https://changelog.com/podcast/260

Scale
Scale
Scale
Culture
~1B users
~10k engineers
engineers > services
it’s complicated
~100M users
~100 engineers
engineers < services
it’s complicated

SRE is what happens
when you ask a software
engineer to design an
operations team.
Ben Treynor, Google Inc.

Symptom-based alertingOperational underload
Everything you always wanted to know about SRE
but were afraid to ask
DevOps vs. SRE
Dickerson's hierarchy of service reliability
Error budgets
On-callProdEng at SoundCloud
Less than 50% ops work
Postmortems

“Solve production problems with software. It’s all just software.”
traffic * complexity
operational load
e.g. pages

Dickerson’s hierarchy of service reliability
Site Reliability Engineering – How Google Runs Production Systems, B. Beyer et al. (ed.), O’Reilly 2016, p104

“True” DevOps is if there are no separate dev and
ops teams anymore, and not even designated dev
or ops roles within a team.
Björn Rabenstein & Matthias Rampke, SoundCloud Ltd.

DevOps is a set of practices intended to reduce the
time between committing a change to a system and
the change being placed into normal production,
while ensuring high quality.
Len Bass, Ingo Weber, Liming Zhu: DevOps: A Software Architect's Perspective.
DevOps is if you have CI/CD and run containers.
J. Random Manager: During the last strategy meeting.

CALMS (John Willis, Damon Edwards, Jez Humble)
Culture
Automation
Lean (management or continuous improvement)
Metrics
Sharing
[The first SRE book] explicitly
references Culture, Automation,
Metrics, and Sharing alongside
anecdotes about Google’s journey to
continuously improve.
Andrew Clay Shafer: The Site Reliability Workbook.

I cringe when I hear someone say “SRE versus
DevOps.”
Andrew Clay Shafer: Foreword II.

Ultimately, I know DevOps when I see it and I
see SRE at Google, in theory and practice, as one
of the most advanced implementations.

The principles from the first SRE book align so
well with what I always imagined DevOps to be,
and the practices are insightful, even when they
aren’t 100% applicable outside of Google.

● Minimal size of an on-call rotation:
○ 6 if following the sun.
○ 8 otherwise.
● Minimal size of a dedicated SRE team: 8.
● Feasible percentage of all engineers in SRE: 5%? 10%?
● Number of SRE teams SoundCloud could afford: 1.

Take your own pager!
It’s still “SRE in spirit”…

In Site Reliability Engineering, we did not make it sufficiently
clear that product development teams in Google own their
service by default. SRE is neither available nor warranted for
the bulk of services, although SRE principles still inform how
services are managed throughout Google.
Chapter 1: How SRE relates to DevOps

Production Engineering (ProdEng)

I have been waiting for this book ever since I left Google’s
enchanted castle.It is the gospel I am preaching to my peers
at work.
Beorn’s praise for the 1st SRE book
Finally, this volume and its predecessor are not intended to
be gospel. Please don’t treat them that way. Even after all
these years, we’re still finding conditions and cases that
cause us to tweak (or in some cases, replace) previously
firmly held beliefs.
Preface of the Site Reliability Workbook

https://github.com/beorn7/talks

Björn Rabenstein - About SRE – and how (not) to apply it - Codemotion Berlin 2018

More Related Content

What's hot

Similar to Björn Rabenstein - About SRE – and how (not) to apply it - Codemotion Berlin 2018

More from Codemotion

Recently uploaded

Björn Rabenstein - About SRE – and how (not) to apply it - Codemotion Berlin 2018