38. ● Encourage participation
● We make it easy to find code
● Metrics make it easier to make decisions
Make Feedback/Contributions Easy
Icons by Freepik CC 3.0
39. ● I'm out of icons
● Make it easiest to do the right thing
● Use the right processes & tooling
● Keep track of how it's working
Conclusion
Approx. 90 million UMVs via mobile
More than 102 million reviews contributed since inception
Approx. 70% of all searches on Yelp came from mobile (mobile web & app)
Yelp is present across 32 countries
We're about 10% of engineering
SREs, Neteng's, DBA
Work with engineers
Write tools
Share info
Devs w/ extra authority, responsibility
Fight manual processes to GSD, spread load, improve runbooks, moar automation
Other deputies: releng, i18n, web, mobile, splunk
AND OUR ENGINEERING TEAMS ROCK TOO
A lot of smart cookies.
We're just all so nice
We make decisions together
But we created a storm, a storm of pages
I'll cover two situations
Manual investigation
Had to track ppl down and tell them they did something wrong
(I know it's arrogant to quote your own tweet)
When you're a large organization
Lots of interactions are happening for the first time
And when you don't know someone, you make assumptions
You jump to conclusions
I've never met Alice before, but I know she's a dev
And I've never met Barbara before, but I know she's an SRE
We fall back on old stereotypes
Devs as cowboys (or cowgirls), breaking things
They don't care about cpu, memory, disk, network
Ops as the Police
Did you fill out form 147?
You can't do that, have you thought about the BLAHBLAHBLAH?
You need new features
You need to fix bugs
Some changes are even for performance or Ops benefit!
You can't make money if the site is down
Or get new customers. All you get is angry tweets.
And people want to do the right thing.
No one wants things to break.
So, what happened?
Initial set of hosts, tiny /tmp, big homedirs, huge scratch partition
Lots of things log to /tmp by default
Things got a bit snug as we got more devs
Ops couldn't really do much more than look for stuff to clean up, and ask ppl about their stuff
Confusion around disabled users. Interns, coming back?
Do folks want their stuff? .bash_history
Do they have some rando cron job running?
True story: we cleaned up someone's homedir, and found that the web cluster check called into a script in their home directory.
Deleting stuff is scary, yo!
What they think they do:
Coding, testing, pushing
Just tryin' to close tickets
Don't know they're causing risk
Who here is on an oncall rotation?
On the count of 3, make the sound your phone makes when you get paged
No one told me where to put my stuff, or how to clean it up.
Ops need to keep things running, build new infra
NOTE BELOW: we should auto-ticket queries we kill?!
Next time: someone wanted to know why the DB couldn't fix this itself, explain it's complicated.
What they think they do:
Coding, testing, pushing
Just tryin' to close tickets
No advance notice
OK, who here is on a DBA pager rotation?
You rock!
What sound does your phone make?
This didn't happen in dev, tests, stages
I would have stopped this if I could
The DB code is abstracted/copy-paste makes this easy to perpetuate
Fighting is for luchadores, ninjas, and vikings
And bad feelings can spread pretty easily
Someone who's never interacted with devs or ops may think that they're jerks!
Instead, let's figure this out and make things better!
Patched some of our tools to write to larger partitions by default
Wrote tools to auto-clean up
Warn users of a machine when it started to get overloaded
Sensu alerts at non-critical levels create tix & tag heavy users
Great incentives to move to new machines
White-glove moving service, some folks may just told they have to move
GQC, get better data into dev
We query Anemometer for new slow queries & ticket them
Pt-kill, long-running transaction EVERYWHERE
Graphs that monitor killed things have push annotations
We ticket, then page when we kill too many
Splunk log monitors for these client errors, and alerts (IRC, ticket) when they're up
Nothing sucks more than being surprised by bad news. Being responsible people, we don't want to be told we've done something wrong. It feels bad.
And this goes both ways - devs don't like being told they did something wrong, ops don't like getting paged.
Let ppl know in advance of the fire
We fail tests when we think a query will be gross.
We announce in the motd when a dev box is overloaded.
We announce alert warning states in IRC.
Some alerts file Jira tickets, and some page as a last resort.
Email is bad.
These tools need to help people *before* a problem arises
More importantly, lets them handle it themselves
Automation means you get logs!
We have data on our Sensu alerts
We have data on our Jira tickets
Visualize that data
The deputies programs bring a little bit of ops into each team
We make it easy to find/make adjustment
Having metrics makes it easy to make decisions
"Reality is Broken",concept, Nachas, meaning the pride in seeing others succeed after guidance/mentoring.
There is a similar pride in seeing others use tools you've written, or watching someone even make them better!