Bjorn Rabenstein, Production Engineer at SoundCloud
SRE, DevOps, Google, and you
Site Reliability Engineering (SRE) was originally conceived internally at Google. By now, it has become public knowledge via various channels like conferences or books. But how can you apply SRE principles in your organization, given that you are not Google and cannot just blindly do everything exactly as Google does? And how does SRE relate to DevOps, which you might or might not have indulged in already? The speaker has seen both sides, with many years working as an SRE at Google and later as a Production Engineer at SoundCloud, a much smaller startup running many service using a highly innovative tech stack and a radical DevOps approach. Let’s dive into questions of culture and scale and come up with some helpful pointers how you can learn from the giant without losing you own way.
Björn Rabenstein is a Production Engineer at SoundCloud and a Prometheus developer. Previously, Björn was a Site Reliability
Engineer at Google and a number cruncher for science.
15. “True” DevOps is if there are no separate dev and
ops teams anymore, and not even designated dev
or ops roles within a team.
Björn Rabenstein & Matthias Rampke, SoundCloud Ltd.
16. DevOps is a set of practices intended to reduce the
time between committing a change to a system and
the change being placed into normal production,
while ensuring high quality.
Len Bass, Ingo Weber, Liming Zhu: DevOps: A Software Architect's Perspective.
DevOps is if you have CI/CD and run containers.
J. Random Manager: During the last strategy meeting.
19. CALMS (John Willis, Damon Edwards, Jez Humble)
Culture
Automation
Lean (management or continuous improvement)
Metrics
Sharing
[The first SRE book] explicitly
references Culture, Automation,
Metrics, and Sharing alongside
anecdotes about Google’s journey to
continuously improve.
Andrew Clay Shafer: The Site Reliability Workbook.
20. I cringe when I hear someone say “SRE versus
DevOps.”
Andrew Clay Shafer: Foreword II.
21. Ultimately, I know DevOps when I see it and I
see SRE at Google, in theory and practice, as one
of the most advanced implementations.
Andrew Clay Shafer: Foreword II.
22. The principles from the first SRE book align so
well with what I always imagined DevOps to be,
and the practices are insightful, even when they
aren’t 100% applicable outside of Google.
Andrew Clay Shafer: Foreword II.
25. ● Minimal size of an on-call rotation:
○ 6 if following the sun.
○ 8 otherwise.
● Minimal size of a dedicated SRE team: 8.
● Feasible percentage of all engineers in SRE: 5%? 10%?
● Number of SRE teams SoundCloud could afford: 1.
28. SRE is what happens
when you ask a software
engineer to design an
operations team.
Ben Treynor, Google Inc.
“True” DevOps is if there are
no separate dev and ops
teams anymore, and not
even designated dev or ops
roles within a team.
Björn Rabenstein & Matthias Rampke,
SoundCloud Ltd.
29. Take your own pager! It’s still “SRE in spirit”:
● Automate operations as far as possible.
● On-call rotations of the right size.
● Strictly (much) less than 50% ops work.
● Metrics everywhere.
● Effective self-regulation of features vs. stability.
● …
30. In Site Reliability Engineering, we did not make it sufficiently
clear that product development teams in Google own their
service by default. SRE is neither available nor warranted for
the bulk of services, although SRE principles still inform how
services are managed throughout Google.
Chapter 1: How SRE relates to DevOps
31. Production Engineering (ProdEng)
● In charge of Kubernetes & Prometheus.
● Leading the postmortem review.
● Identify and tackle cross-domain failures, champion holistic
system stability, foster exchange of knowledge and good
practices throughout the organization.
● Opt-in consulting: systems design, production reviews (but
not “readiness” or “launch” review – no formal power to block
anything).
32. I have been waiting for this book ever since I left Google’s
enchanted castle.It is the gospel I am preaching to my peers
at work.
Beorn’s praise for the 1st SRE book
Finally, this volume and its predecessor are not intended to
be gospel. Please don’t treat them that way. Even after all
these years, we’re still finding conditions and cases that
cause us to tweak (or in some cases, replace) previously
firmly held beliefs. SRE is a journey as much as it is a
discipline.
Preface of the Site Reliability Workbook