(without introducing more risk)
The Two Sides
Puppet
Gareth Rushgrove
Of Google Infrastructure for Everyone Else
(without introducing more risk)
@garethr
(without introducing more risk)
Gareth Rushgrove
(without introducing more risk)
Introduction
A strange format for a talk
This is a debate
Gareth Rushgrove
I’ll be debating both sides
Gareth Rushgrove
Taking opposing viewpoints on
the same issue, as a way of
exploring it in-depth
Gareth Rushgrove
The talk is split into two parts;
a For part and an Against part
Gareth Rushgrove
I’d like to explore:
- Technical practice evolution
- How we adopt software
- The organisational context
Gareth Rushgrove
This house believes…
Gareth Rushgrove
Successful companies will look
like Google in the future, so we
should adopt Google-like
software and practices today
Gareth Rushgrove
Important disclaimer
I’ve never worked for Google
Gareth Rushgrove
(without introducing more risk)For
You’re probably:
1 Struggling with distributed systems
2 Missing out on machine learning
3 Wondering how to scale operations
Gareth Rushgrove
Gareth Rushgrove
have a 10+ year head start
publish research that
influences out industry
Gareth Rushgrove
Gareth Rushgrove
MapReduce
Gareth Rushgrove
Chubby
Gareth Rushgrove
Borg
releases (and inspires)
software we use
Gareth Rushgrove
Gareth Rushgrove
Gareth Rushgrove
Go
Gareth Rushgrove
from
(without introducing more risk)
GFS = HDFS
BigTable = HBase
Protocol Buffers = Thrift or Avro (serialization)
Stubby = Thrift or Avro (RPC)
ColumnIO = Parquet
Dremel = Impala
Omega = Mesos
Blaze = Pants or Buck
FlumeJava = Crunch
Logsaver = Scribe or Flume
Millwheel = Storm or Samza?
Borgmon/Monarch = Graphite
Dapper = Zipkin
2014 from @avibryant, @joshwills, @skamille, @marius, @wickmanGareth Rushgrove
We have a term for this; #GIFEE
Gareth Rushgrove
Google Infrastructure for
Everyone Else
Gareth Rushgrove
Distributed systems are hard
Gareth Rushgrove
Building your own in-house
framework is likely a waste of time
Gareth Rushgrove
Gareth Rushgrove From Adrian Colyer, Accel, https://speakerdeck.com/acolyer/making-sense-of-it-all
Kubernetes is the 3rd generation
of Googles cluster management
software
Gareth Rushgrove
Gareth Rushgrove
The Kubernetes API provides
primitives that make doing the
right thing easier
- Orchestration
- Logging
- Configuration
- Self-healing
- Storage
Gareth Rushgrove
- Load balancing
- Service discovery
- Scaling
- Batch workloads
- Lots more
Gareth Rushgrove
Exposed via a modern API
Machine learning is going
to be massive
Gareth Rushgrove
Soon We Won’t Program
Computers. We’ll Train
Them Like Dogs
Gareth Rushgrove
”
“
TensorFlow is an open source
software library for numerical
computation
Gareth Rushgrove
(without introducing more risk)
Gareth Rushgrove
…
- Nearest neighbour
- Linear regression
- Recurrent neural networks
- Multilayer perceptron
- Lots more
Gareth Rushgrove
Gareth Rushgrove
Introductory ML docs
How do I do devops?
Gareth Rushgrove
Everyone ever
”
“
Gareth Rushgrove
explain how they work too
Gareth Rushgrove
SRE: Have software engineers
do operations
Gareth Rushgrove
Dan Luu, ex Google ”
“
http://danluu.com/google-sre-book/
(without introducing more risk)
Gareth Rushgrove
Dev SRE Ops
From http://web.devopstopologies.com/ by Matthew Skelton
The familiar:
- Capacity planning
- Performance
- Change management
- Monitoring
Gareth Rushgrove
The unfamiliar:
- Error budget
- Strong software engineering skills
- 50% operations work cap
Gareth Rushgrove
A growing ecosystem
Gareth Rushgrove
Gareth Rushgrove
Friendly vendors
Gareth Rushgrove
More friendly vendors
Gareth Rushgrove
Even more nice vendors
(without introducing more risk)
Summing up
For
“infrastructure” is shifting to a
higher level of abstraction
Gareth Rushgrove
It’s fine to just be a consumer
Gareth Rushgrove
You should be standing on the
shoulders of giants
Gareth Rushgrove
You should be standing on the
shoulders of
Gareth Rushgrove
(without introducing more risk)Against
Your organisation doesn’t
look like Google
Gareth Rushgrove
YOUR
ORGANISATION
DOESN’T LOOK
LIKE GOOGLEGareth Rushgrove
Could your organisation
look like Google?
Gareth Rushgrove
How many employees do you
have? Google have about 60,000
Gareth Rushgrove
What proportion of your
organisation are software
engineers or operations?
Gareth Rushgrove
50 percent?
Based on the Google annual report December 2014
Gareth Rushgrove
How much do you pay
software engineers?
Gareth Rushgrove
Gareth Rushgrove Data from Glassdoor, June 2016, based on 14k salaries
Gareth Rushgrove
The $3million engineer?
Gareth Rushgrove
Gareth Rushgrove
Build your own chips?
Could your organisation
really look like Google?
Gareth Rushgrove
So much of the information in
the SRE book makes PERFECT
sense if you’re Google
Gareth Rushgrove
John Vincent, Ops Hero ”
“
The reality outside Google
Gareth Rushgrove
<1% of US workers are software
engineers or programmers
Gareth Rushgrove US Bureau of Labor Statistics 2002. 1,069,000 jobs in working age population of 185million
Strategic vendor relationships
Gareth Rushgrove
Different application
constrains as well as different
organisational constrains
Gareth Rushgrove
Goal of SRE team isn’t zero
outages – SRE and product devs
are incentive aligned to spend the
error budget to get maximum
feature velocity
Gareth Rushgrove
Dan Luu, ex Google ”
“
http://danluu.com/google-sre-book/
What if you’re operating an air
traffic control system or a nuclear
power station? Your goal is
probably closer to zero outages
Gareth Rushgrove
Gareth Rushgrove
John Vincent SRE review
bringing a software engineering
perspective to a problem isn’t
always the best or right solution
Gareth Rushgrove
”
“
John Vincent, Ops Hero
Many of Google’s conclusions to
operations problems are not unique
Gareth Rushgrove
Gareth Rushgrove
Gareth Rushgrove
Innovation happens elsewhere
applies as much to Google as to
other organisations
Gareth Rushgrove
(without introducing more risk)
Summing up
Against
If a human operator needs to touch
your system during normal
operations, you have a bug. The
definition of normal changes as
your systems grow
Gareth Rushgrove
Carla Geisser, Google SRE
”
“
What is normal for Google
may not be suitable for
your organisation
Gareth Rushgrove
Your startup with a single-purpose
application does not have the
luxury of having your operations
team say I’m sorry you’re over
your error budget
Gareth Rushgrove
John Vincent, Ops Hero ”
“
Gareth Rushgrove
(without introducing more risk)
Conclusions
If all you take away is…
Who votes…
Gareth Rushgrove
For
Who votes…
Gareth Rushgrove
Against
Who thinks it’s the wrong question?
Gareth Rushgrove
Context is king
Gareth Rushgrove
Gareth Rushgrove
The Overwhelming power
of context
Gareth Rushgrove
Charity Majors, Ops Person Extraordinaire”
“
The technology we run, and how
we run it, are interlinked
Gareth Rushgrove
(without introducing more risk)
The field of Sociotechnical
Systems suggests that all human
systems include both a technical
system and a social system
Gareth Rushgrove
https://en.wikipedia.org/wiki/Coevolution#Technological_coevolution
(without introducing more risk)
Better outcomes are usually
obtained by a reciprocal process
of joint optimization, through
which both the technical system
and the social system change
Gareth Rushgrove
https://en.wikipedia.org/wiki/Coevolution#Technological_coevolution
Containers will not fix your
broken culture
Gareth Rushgrove
Bridget Kromhout, Worlds nicest Ops Person”
“
Awesome culture will not fix your
broken containers
Gareth Rushgrove
Me, paraphrasing Bridget ”
“
We are all collectively evolving the
practice of operations
Gareth Rushgrove
Keep sharing, because it’s a
pretty amazing ride
Gareth Rushgrove
(without introducing more risk)
Questions
And thanks for listening

Two Sides of Google Infrastructure for Everyone Else