Humans By The Hundred
DevOps Days Ohio 2015
$ whoami
SRE Manager at Yelp
CWRU Alum
Pittsburgh native
<3 web operations
Just a dude
Yelp’s Mission:
Connecting people with great
local businesses.
Yelp Stats:
As of Q3 2015
89M 3271%90M
Growth
Growth means embracing change.
Growth means embracing DevOps.
DevOps: someone having your back.
Dogma
In the Beginning...
Deployment: the early days
Get a few people together in slack/irc/etc.
Merge up the code
Run the tests
Poke at it in stage
Cross your fingers
Things get slower...
Tests take longer to run
More hosts = longer downloads, bounces
More developers = more eyeballs
More features = more code
The Problem:
The Problem: Humans Are Fallible
“…oh @$#&”
The Problem, With Math
Assume:
Every change has a chance of success: 98%
That means no test failures, no reverts, etc.
Every deploy has a number of changes: n
Any failure in the pipeline invalidates the deploy
Let’s figure out the probability of a
successful deployment: p
The Problem, With Math
Only you
p = 98%
You and a friend
p = .98 * .98 = 96%
You and nine co-workers
p = .98 * .98 * .98 * … * .98 = 82%
The Problem, With Math
p = (.98)n
The Problem, With Math
p = (.98)n
exponential decay!
This doesn’t scale!
More developers = more changes
More changes = longer deploys
Longer deploys = less time to develop
Less time to develop = slower to iterate
Mitigating Exponential Decay
p = (.98)n
Mitigating Exponential Decay
p = (.98)n
Making it harder to screw up
Write more tests
Write better tests
Get better code reviews
Get better infrastructure
Switch programming languages
Use better tools
Just write better software and stop
making mistakes!
PROBLEM SOLVED
The Real World
Testing builds confidence in our changes
Testing does not protect you from failure
Better tools, tests, and infrastructure can
raise our success rates
Mitigating Exponential Decay
p = (.98)n
Mitigating Exponential Decay
p = (.98)n
Service-Oriented Architecture
Large monolith → smaller services
Services communicate over network
Usually HTTP, but you can do RPC, SOAP, etc.
Service = independent code base
Independent deployments
Service-Oriented Architecture
Benefits
Smaller code bases = upper bound to n
Failure domains become isolated
Technology independence
Federated responsibility
Service-Oriented Architecture
Drawbacks
everything becomes decoupled
function calls start looking like HTTP requests
versioning can be a nightmare
tracking dependencies is hard
data consistency becomes challenging
end-to-end testing becomes hard(er), if not
impossible
SOA scales people, not code.
Conquering SOA
With the monolith, it’s easy to focus on
mean time between failures (MTBF)
Conquering SOA
In a SOA, focus on mean time to recovery
(MTTR)
Conquering SOA
Fail fast
Anticipate failure
Leverage iteration speed to recover fast
Conquering SOA
Treat everything as distributed
Pick a size that works for you
micro
macro
somewhere in between
Size doesn’t have to be uniform!
PROBLEM SOLVED
Spreading the Love
The Problem: Humans Specialize
Ops Deputies
Developers ‘deputized’ to do operations
Elevated privileges
Tackle infrastructure needs for their team
Contribute improvements to shared infra
Become first-hop for operations questions
DEVOPS
Reinventing the Shuttle
vs.
Build Buy
Build Buy
Composition
You
You
?
Your new thing
You
You fill in
these bits
PaaSTA
=
Glue
Define interfaces between components
Makes it easy to swap out later
Expect change
Minimize code, keep it simple
Think about EOL when you start
Your company is changing, and so will your needs
Parting Words
Embrace DevOps
DevOps culture, not just the technology!
Do what’s right for you
Don’t let dogma rule! Tailor your approaches to the
talent around you.
Plan for change
Bit rot is real. Plan how you’re going to deal with it!
@YelpEngineering
YelpEngineers
engineeringblog.yelp.com
github.com/yelp
yelp.com/careers
THANK YOU

Humans by the hundred (DevOps Days Ohio)

Editor's Notes

  • #2 this is what the speaker notes will look like
  • #3 4.5 years at Yelp, 80 people -> hundreds Just going to talk about what I’ve learned along that way
  • #4 For this talk to make sense, we have to also talk about what Yelp is. Connect people w/ great local businesses
  • #5 Approx. 83 million UMVs via mobile More than 83 million reviews contributed since inception Approx. 68% of all searches on Yelp came from mobile (mobile web & app) Yelp is present across 32 countries
  • #6 I’m here to talk about growth, and specifically growth in people
  • #7 You might be growing your budding startup (I should warn you that I’m from the internet. This deck is mostly silly images.)
  • #8 It might be a merger or acquisition
  • #9 Maybe you’re expanding into new markets
  • #10 More developers = more improvements, features, products
  • #11 DevOps = cultural appreciation of change, not fear
  • #12 Infra engineer = agent of flexibility DevOps provides tools, flexibility how we manage, define infra
  • #13 infrastructure is what allows our software, products to exist Infra provides flexibility to software it supports
  • #14 that infrastructure also scales people!
  • #15 and that infrastructure can be just as hard to change as your products, if not harder
  • #16 This is where DevOps changes from convenient to critical Embrace fundamental culture shift that spawned it
  • #17 DevOps is a collaborative philosophy at its core I like to think about DevOps as being about someone having your back. Allies just as much as collaborators. This is absolutely critical if you want to survive growth.
  • #18 The world is a messy place. The right answer isn’t always clear There’s always a dozen counter-examples to any decent-sounding rule
  • #19 Teams are made of very different kinds of people Every combination is unique and has its own chemistry
  • #20 In a growing company, Dogma kills. Only real advice I have You could say I have a dogma against dogmas. Dogma is anti-devops
  • #21 This is how most projects, companies, etc. start: single code repo, maybe a server or two, and one or a couple of developers The entire project is easy to conceptualize. It fits in your head. This is true for things like configuration management, too! Small number of people makes overhead easy to cope with
  • #23 And then ship it! Dump the code into production, probably restart everything. Click around, make sure stuff looks good. Maybe you’ve even got error monitoring! This works for a while, and it’s all you need when you get started.
  • #24 But then time passes, and the monolith grows. You add features. You add developers to make those features.
  • #25 As you add code and scale out your org + infrastructure, things naturally take longer. What was once a 10 minute deploy process might take closer to 30 minutes… or 45 minutes… or maybe even an hour! ...but that’s not a big deal, right?
  • #26 And here we run into a problem.
  • #27 HUMANS SCREW UP
  • #28 As you grow, you’re doing more stuff. More code, features, tests, changes More stuff means more chances to screw up, which we do, because we are humans. And when you screw up, it means back to square one… new build, new test run, new deployment. ...and everyone has lost as much time as it takes to get this far. And they’ll have to invest it all again to get their code out!
  • #33 This starts looking pretty grim even around 10 branches, and that’s assuming a (generous) 98% success rate! At 20 branches we’re below 70%!
  • #35 So… how can we do better?
  • #36 Well, we can try to improve this number...
  • #37 Make it harder to screw up! Decrease the chances of failure. This is where almost all teams start focusing their efforts first.
  • #38 Here are some common ways people to try to make screwing up more difficult.
  • #39 It’s easy!
  • #41 This doesn’t work in the real world. At the end of the day, we’re still human We’re people! We make mistakes! We just spent a long time talking about how we’re fallible. Why would this be any less true of the systems we create to prevent us from making mistakes?
  • #42 In reality, doing all those things does help
  • #43 But at the end of the day, you need more to scale an org. We want the asymptotic solutions, not the constant factor.
  • #44 ...and of course, as computing professionals, you’ve all probably been writhing in your seats, trying to tell me to do this first
  • #45 We tackle this asymptotic factor with SOA. Split up large code bases into smaller ones that can be developed independently and communicate over common interfaces. How you size this is up to you. Don’t fall prey to the hype of “microservices” if it doesn’t make sense for you.
  • #48 It is a lot harder to do SOA than a monolith, and it can decrease your rate of success dramatically! It takes a lot of effort and discipline to get it right. However, it’s very difficult to obtain the advantages it provides any other way.
  • #51 Embrace the idea that failures will happen, and be ready for them! In a world like this, you need to safeguard your deployment process. It’s a problem if it gets slow, because it’s your out when you screw up.
  • #55 In an SOA, sprawl easy. Technology diversity Infrastructure no longer becomes big, common, shared. Teams tailor to their uses and needs
  • #56 trying to support this all with a single team doesn’t work before you know it, the infra is chasing you Operator:developer ratio needed to support this increases *dramatically*
  • #57 you need to share the responsibility! this also means sharing the authority for these systems
  • #58 Not everyone cares about infrastructure! ...and that’s just fine! But there will always be some people who are interested in what happens behind the scenes. Leverage them!
  • #60 Programs like this are tunable, depending on your needs and your size. You could opt to only have a few
  • #61 Or you could give it to absolutely everyone! It just depends on how your company is structured, the nature of your products, and how specialized you need parts of your team to be Just get the people who are interested interested!
  • #62 This is one of our biggest manifestations of DevOps. We empower people to have our back, and in turn we have theirs.
  • #63 If you don’t have a structure like this, and you’re a growing company, you’re headed towards having this much bandwidth between you and the rest of the organization. And that org is changing!
  • #64 Having a program like Ops Deputies bought us high-bandwidth transit to the entire organization. We can use this to promote values we traditionally care about, while also staying in touch with planned change. This is the fabric of collaboration. This is also how you minimize MTTR.
  • #65 Reinventing the space shuttle
  • #66 Build vs. Buy Classic tradeoff Often examined as a binary choice: you’ll pick one or the other
  • #67 But, like we talked about before: the world is a messy, ambiguous place
  • #68 In reality, build vs. buy is more of a spectrum
  • #69 ...and you have a lot of freedom to move around in that spectrum as it makes sense to!
  • #70 We can navigate this spectrum using composition. This is kind of the same as the Unix philosophy: tools and systems should do one thing well. Build a solution out of that.
  • #71 Thinking about build vs. buy as binary fundamentally fails to acknowledge how diverse the world is
  • #72 This component shouldn’t be a big box that requries you to conform to it Either you buy software that doesn’t quite fit, or you re-invent wheels
  • #73 Instead, if you piece together the functionality you need from software that does it well, all you have to fill in is the glue Nobody else can provide you! You know your organization! Much easier maintenance than writing it yourself. Puzzle, contour of org will changes
  • #74 We love this pattern at Yelp. How we’ve approached big system design in the last few years, worked out well. Our emerging flagship of this is PaaSTA: Platform as a Service, Totally Awesome!