Human Ops

•Download as PPTX, PDF•

0 likes•278 views

Yelp Engineering

WE'RE NOT YELLING AT YOU Avoiding perceived conflict

Technology

Yelp’s Mission
Connecting people with great
local businesses.

● MySQL DBA since 2004
● At Yelp 5 years
● CS + Sociology = <3
● Mom
About Me

DEPUTIES
ROCK!
"Ops deputies are for life, not just for your birthday."
Yelp Ops Deputies

"Devs Just Want to Change Stuff"
Icons by Freepik CC 3.0

"Ops Wants to Keep Things Frozen"
Icons by Freepik CC 3.0

Stability is Good
Icons by Freepik CC 3.0

● Limited developer disk space, servers
● Ops would get paged when it filled up
Problemtunity #1 - Disk Space

Devs Be Like
Icons by Freepik CC 3.0
This will rock!
Logging, FTW!

● Long-running queries and txns consume resources
● DBAs gets paged
● Had to manually track down, kill & ticket queries
Problemtunity #2 - Long Queries/Txns

Devs Be Like
Icons by Freepik CC 3.0
This will rock!
select * from rad

1. Set expectations
2. Automation
3. Include the right people
4. Metrics
Solution
Icons by Freepik CC 3.0

1. Fixed tools
2. Auto backup/cleanup
3. motd early warning
4. Easy to move to shiny new machines
5. Interns work on seasonal hosts
Disk Space

1. GrossQueryChecker rocks
2. Anemometer
3. Query Annotations
4. Pt-kill, long-running transaction killah
5. Monitors - IRC, tickets
6. Hecka logs
7. Query Annotations
DREAM: automatic tix in right projects for queries IN STAGE
Long Queries/txns

● People want to do the right thing
● Make the right thing to do the easiest
● Fix the problem at the right place
Remember
Icons by Freepik CC 3.0

"The only surprise people like is a birthday cake."
- me
No One Likes Bad News
Icons by Freepik CC 3.0

● Failed tests
● motd
● IRC
● Email
● Jira tickets
● Pages
Send Signals to the Right Place!

● Tests!
● Monitors!
● Graphs!
Make Tools to Discover Info
Icons by Freepik CC 3.0

● Encourage participation
● We make it easy to find code
● Metrics make it easier to make decisions
Make Feedback/Contributions Easy
Icons by Freepik CC 3.0

● I'm out of icons
● Make it easiest to do the right thing
● Use the right processes & tooling
● Keep track of how it's working
Conclusion

@YelpEngineering
fb.com/YelpEngineers
engineeringblog.yelp.com
github.com/yelp

Similar to Human Ops

Data Con LA 2018 - What I’ve Learned About How Machines Learn by Luis Bitenco...Data Con LA

Continuous Deployment at Etsy: A Tale of Two ApproachesRoss Snyder

Engineering your culture / Oren Ellenbogengeektimecoil

Ross Snyder, Etsy, SXSW Lean Startup 2013500 Startups

Human-centered data: using data science and human-centered design to grow you...Cprime

No, we can't do continuous deliveryKris Buytaert

UX Strategy and The Questions; UX in AZ Meetup, May 2019GoDaddy

Opticon 2015-From Skeptic to ChampionOptimizely

Wiki-like collaborative development for seamless customer involvementPaolo Predonzani

Devops is a Security RequirementKris Buytaert

Small/Informal Design/Tech TalkmusHo

User Story Maps: Secrets for Better Backlogs and PlanningAaron Sanders

Doodle Redesign 2011 - Presentation UXCamp Berlin 2011Reto Laemmler

6 Things to Think About Before Building Your WebsiteFloown

An AI Bot will Build and Run your next site… eventuallyRonald Ashri

Scaling Fast: Growing Engineering Orgs From Zero to IPONick Caldwell

Personal kanbanAcquate

Hacking Culture at VelocityConfJesse Robbins

UI DESIGN - Art of creating perfect products ( Part 1 )Shervin Mashayekh

SEO and AccessibilityChristian Heilmann

Similar to Human Ops (20)

Data Con LA 2018 - What I’ve Learned About How Machines Learn by Luis Bitenco...

Continuous Deployment at Etsy: A Tale of Two Approaches

Engineering your culture / Oren Ellenbogen

Ross Snyder, Etsy, SXSW Lean Startup 2013

Human-centered data: using data science and human-centered design to grow you...

No, we can't do continuous delivery

UX Strategy and The Questions; UX in AZ Meetup, May 2019

Opticon 2015-From Skeptic to Champion

Wiki-like collaborative development for seamless customer involvement

Devops is a Security Requirement

Small/Informal Design/Tech Talk

User Story Maps: Secrets for Better Backlogs and Planning

Doodle Redesign 2011 - Presentation UXCamp Berlin 2011

6 Things to Think About Before Building Your Website

An AI Bot will Build and Run your next site… eventually

Scaling Fast: Growing Engineering Orgs From Zero to IPO

Personal kanban

Hacking Culture at VelocityConf

UI DESIGN - Art of creating perfect products ( Part 1 )

SEO and Accessibility

Recently uploaded

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

A Year of the Servo Reboot: Where Are We Now?Igalia

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Real Time Object Detection Using Open CVKhem

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Histor y of HAM Radio presentation slidevu2urc

Recently uploaded (20)

Axa Assurance Maroc - Insurer Innovation Award 2024

Exploring the Future Potential of AI-Enabled Smartphone Processors

A Year of the Servo Reboot: Where Are We Now?

08448380779 Call Girls In Friends Colony Women Seeking Men

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Boost PC performance: How more available memory can improve productivity

Finology Group – Insurtech Innovation Award 2024

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Presentation on how to chat with PDF using ChatGPT code interpreter

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Driving Behavioral Change for Information Management through Data-Driven Gree...

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Real Time Object Detection Using Open CV

The 7 Things I Know About Cyber Security After 25 Years | April 2024

How to Troubleshoot Apps for the Modern Connected Worker

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Histor y of HAM Radio presentation slide

Human Ops

1. Jenni Snyder @jcsuperstar WE'RE NOT YELLING AT YOU Avoiding perceived conflict

2. Yelp’s Mission Connecting people with great local businesses.

3. Yelp Stats As of Q1 2016 90M 3270%102M

4. ● MySQL DBA since 2004 ● At Yelp 5 years ● CS + Sociology = <3 ● Mom About Me

5. OPS ROCKS! About Yelp Operations

6. DEPUTIES ROCK! "Ops deputies are for life, not just for your birthday." Yelp Ops Deputies

7. So, What's the Problem?

8. Paging people to deliver bad news

9. It's Easy to Fall Into Old Stereotypes

10. It's Easy to Jump to Conclusions

11. "Devs Just Want to Change Stuff" Icons by Freepik CC 3.0

12. "Ops Wants to Keep Things Frozen" Icons by Freepik CC 3.0

13. Change is Good Icons by Freepik CC 3.0

14. Stability is Good Icons by Freepik CC 3.0

15. People Want to Do the Right Thing!

16. ● Limited developer disk space, servers ● Ops would get paged when it filled up Problemtunity #1 - Disk Space

17. Devs Be Like Icons by Freepik CC 3.0 This will rock! Logging, FTW!

18. Ops Be Like Icons by Freepik CC 3.0

19. Ops Feel Like

20. Especially If

21. Devs Feel Like

22. ● Long-running queries and txns consume resources ● DBAs gets paged ● Had to manually track down, kill & ticket queries Problemtunity #2 - Long Queries/Txns

23. Devs Be Like Icons by Freepik CC 3.0 This will rock! select * from rad

24. DBAs Be Like Icons by Freepik CC 3.0

25. DBAs Feel Like

26. Especially If

27. Devs Feel Like

28. Dev vs Ops

29. Dev + Ops = <3

30. 1. Set expectations 2. Automation 3. Include the right people 4. Metrics Solution Icons by Freepik CC 3.0

31. 1. Fixed tools 2. Auto backup/cleanup 3. motd early warning 4. Easy to move to shiny new machines 5. Interns work on seasonal hosts Disk Space

32. 1. GrossQueryChecker rocks 2. Anemometer 3. Query Annotations 4. Pt-kill, long-running transaction killah 5. Monitors - IRC, tickets 6. Hecka logs 7. Query Annotations DREAM: automatic tix in right projects for queries IN STAGE Long Queries/txns

33. ● People want to do the right thing ● Make the right thing to do the easiest ● Fix the problem at the right place Remember Icons by Freepik CC 3.0

34. "The only surprise people like is a birthday cake." - me No One Likes Bad News Icons by Freepik CC 3.0

35. ● Failed tests ● motd ● IRC ● Email ● Jira tickets ● Pages Send Signals to the Right Place!

36. ● Tests! ● Monitors! ● Graphs! Make Tools to Discover Info Icons by Freepik CC 3.0

37. Did I Say Metrics Yet?

38. ● Encourage participation ● We make it easy to find code ● Metrics make it easier to make decisions Make Feedback/Contributions Easy Icons by Freepik CC 3.0

39. ● I'm out of icons ● Make it easiest to do the right thing ● Use the right processes & tooling ● Keep track of how it's working Conclusion

40.

41. @YelpEngineering fb.com/YelpEngineers engineeringblog.yelp.com github.com/yelp

Editor's Notes

Approx. 90 million UMVs via mobile More than 102 million reviews contributed since inception Approx. 70% of all searches on Yelp came from mobile (mobile web & app) Yelp is present across 32 countries
We're about 10% of engineering SREs, Neteng's, DBA Work with engineers Write tools Share info
Devs w/ extra authority, responsibility Fight manual processes to GSD, spread load, improve runbooks, moar automation Other deputies: releng, i18n, web, mobile, splunk AND OUR ENGINEERING TEAMS ROCK TOO
A lot of smart cookies. We're just all so nice We make decisions together
But we created a storm, a storm of pages I'll cover two situations Manual investigation Had to track ppl down and tell them they did something wrong
(I know it's arrogant to quote your own tweet) When you're a large organization Lots of interactions are happening for the first time And when you don't know someone, you make assumptions
You jump to conclusions I've never met Alice before, but I know she's a dev And I've never met Barbara before, but I know she's an SRE
We fall back on old stereotypes Devs as cowboys (or cowgirls), breaking things They don't care about cpu, memory, disk, network
Ops as the Police Did you fill out form 147? You can't do that, have you thought about the BLAHBLAHBLAH?
You need new features You need to fix bugs Some changes are even for performance or Ops benefit!
You can't make money if the site is down Or get new customers. All you get is angry tweets.
And people want to do the right thing. No one wants things to break. So, what happened?
Initial set of hosts, tiny /tmp, big homedirs, huge scratch partition Lots of things log to /tmp by default Things got a bit snug as we got more devs Ops couldn't really do much more than look for stuff to clean up, and ask ppl about their stuff Confusion around disabled users. Interns, coming back? Do folks want their stuff? .bash_history Do they have some rando cron job running? True story: we cleaned up someone's homedir, and found that the web cluster check called into a script in their home directory. Deleting stuff is scary, yo!
What they think they do: Coding, testing, pushing Just tryin' to close tickets Don't know they're causing risk
Who here is on an oncall rotation? On the count of 3, make the sound your phone makes when you get paged
No one told me where to put my stuff, or how to clean it up.
Ops need to keep things running, build new infra NOTE BELOW: we should auto-ticket queries we kill?! Next time: someone wanted to know why the DB couldn't fix this itself, explain it's complicated.
What they think they do: Coding, testing, pushing Just tryin' to close tickets No advance notice
OK, who here is on a DBA pager rotation? You rock! What sound does your phone make?
This didn't happen in dev, tests, stages I would have stopped this if I could The DB code is abstracted/copy-paste makes this easy to perpetuate
Fighting is for luchadores, ninjas, and vikings And bad feelings can spread pretty easily Someone who's never interacted with devs or ops may think that they're jerks!
Instead, let's figure this out and make things better!
Patched some of our tools to write to larger partitions by default Wrote tools to auto-clean up Warn users of a machine when it started to get overloaded Sensu alerts at non-critical levels create tix & tag heavy users Great incentives to move to new machines White-glove moving service, some folks may just told they have to move
GQC, get better data into dev We query Anemometer for new slow queries & ticket them Pt-kill, long-running transaction EVERYWHERE Graphs that monitor killed things have push annotations We ticket, then page when we kill too many Splunk log monitors for these client errors, and alerts (IRC, ticket) when they're up
Nothing sucks more than being surprised by bad news. Being responsible people, we don't want to be told we've done something wrong. It feels bad. And this goes both ways - devs don't like being told they did something wrong, ops don't like getting paged.
Let ppl know in advance of the fire We fail tests when we think a query will be gross. We announce in the motd when a dev box is overloaded. We announce alert warning states in IRC. Some alerts file Jira tickets, and some page as a last resort. Email is bad.
These tools need to help people *before* a problem arises More importantly, lets them handle it themselves
Automation means you get logs! We have data on our Sensu alerts We have data on our Jira tickets Visualize that data
The deputies programs bring a little bit of ops into each team We make it easy to find/make adjustment Having metrics makes it easy to make decisions "Reality is Broken",concept, Nachas, meaning the pride in seeing others succeed after guidance/mentoring. There is a similar pride in seeing others use tools you've written, or watching someone even make them better!
[Social Media Slide]

Human Ops

Recommended

Recommended

More Related Content

Similar to Human Ops

Similar to Human Ops (20)

More from Yelp Engineering

More from Yelp Engineering (16)

Recently uploaded

Recently uploaded (20)

Human Ops

Editor's Notes