Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

128 views

Published on

DevOpsDays Tel Aviv 2017

Published in: Technology
  • Be the first to comment

A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

  1. 1. Joe Smith November 14, 2017 1 A Culture Of Automation
  2. 2. Joe Smith Operations Engineer, Slack Application Operations Team ● Build/Run core systems responsible for Slack ● CDNs, Edge Regions, Web tier, Websockets, development workflow, etc ● Previously: ○ Tech Lead, Aurora/Mesos SRE at Twitter ○ Internal Technology Resident at Google
  3. 3. Folks with the desire to build resilient systems These roles may have different names, but share the above goal ● Reliability Engineering ● Operations ● DevOps ● Site Reliability Engineering ● Production Engineering ● Systems Engineering Audience
  4. 4. Agenda 1. Running Production Services 2. Runbooks 3. Automation
  5. 5. Running Production Services
  6. 6. Two Pizza Rule A team should be sized to share around two large pizzas
  7. 7. “” “We will take the site down today! 💥” – No one when they wake up
  8. 8. ● Careful Planning and Procedures ● Extensive Documentation ● Good Communication Strategies
  9. 9. Planning ● Some components are prioritized for speed while others are meant to be canaried and analyzed ● Changes need to be staged to coordinate with each other ● Give teams the tools and visibility they need to make improvements and understand impact ● Identify your rollback strategy ahead of time
  10. 10. Documentation ● Each code or procedure change should be paired with an update to easily- readable text ● Help your teammates and yourself weeks from now when you need to understand how things work. ● Do not just describe how systems are structured, explain why they are built that way! ● Additional context can inform future decisions
  11. 11. Communication ● Change Management - Coordinating release schedules can be difficult ● Launch Channel - Announce changes, link to more details in a feature- specific Slack channel for the change ● Add links to commits, code reviews, threads in Slack, mailing list posts, and StackOverflow questions ● This enables your team to benefit from the research you've done
  12. 12. ● Careful Planning and Procedures ● Extensive Documentation ● Good communication Strategies
  13. 13. ● Unexpected changes, forced roll-forwards ● Outdated Runbooks ● Missed Notifications Growing Pains
  14. 14. Good Problems to Have As the team grows, it's no longer possible to understand everything that's happening at once. The scope of work is also increasing!
  15. 15. Runbooks
  16. 16. “” “Checklists for commonly repeated operational tasks.” – Slack, Runbook README
  17. 17. 1. Location 2. Format 3. Contents
  18. 18. Location ● Google Docs ○ Good Formatting, Mobile Apps, external service ● Wiki ○ Web Interface, Track Changes ● Markdown in git repo (paired with Github) ○ Formatting, offline support, normal Pull Request flow
  19. 19. Markdown in Git Repo ● Track changes across revisions ● Optional peer review ● Link to relevant sections ● Clone repo for offline support
  20. 20. Runbook Template (thanks to my teammate Megan!) ● ApplicationServer ○ README.md ○ standard_actions.md ○ other_actions.md ○ alerts/ ■ box_failure.md ■ some_alert_name.md
  21. 21. README.md
  22. 22. standard_actions.md
  23. 23. other_action.md
  24. 24. alert.md
  25. 25. box_failure.md
  26. 26. Example
  27. 27. Content This is not the place for Design Documentation. These are highly-actionable, succinct descriptions of next steps.
  28. 28. Automation
  29. 29. “” "Test until fear turns to boredom." – JUnit FAQ, http://junit.sourceforge.net/doc/faq/faq.htm#tests_6
  30. 30. “” "Automate once fear turns to boredom" – Ancient SRE Corrollary
  31. 31. Beyond Runbooks ● Turn a manual checklist into a testable, repeatable set of steps anyone can run ● Anytime you discover a sharp edge or workaround, this can be codified in the tool ● Reduce sections of "but if this happens, check this dashboard and then do one of three things"
  32. 32. The Tooling Workflow Initial Steps This process can evolve over a long time and generally improve things. ● One person has all the knowledge in their head ● That person writes down everything they know in a runbook ● Someone sees an annoying or complicated piece and writes a small script to be run instead for a tiny part of the process The next jump will be the most difficult part!
  33. 33. The Tooling Workflow Maintenance ● It feels great to have written your first tool! ● You may be lucky and have no bugs ● Most likely there are some edge cases- that is okay and expected! ● Take some time to figure out what went wrong and how to make things better.
  34. 34. The Tooling Workflow Completion ● Later on, another part of the process can be added in and the documentation further updated ● Over time- the runbook becomes "Run this tool we wrote, send bugs to the authors" ● Finally- there is no longer an entry! The tool is run automatically, or the system itself is able to solve that problem
  35. 35. “” "The value of humans is to execute Judgement, the value of computers is to execute instructions" – Aron, teammate at Slack
  36. 36. Runbook to Automated Workflow 1. Brain 2. Runbook 3. Start of automation 4. Automation evolution (safeguards) 5. Self-contained Tool 6. Fully Automated
  37. 37. Process in Code ● Using libraries like fabric, pychef, and boto3 can ease automation ● When there are issues, the code can be reviewed for process changes, git history can be consulted, etc ● No more "I forgot that was changed and followed the old process!" ● Each time someone submits an improvement or workflow tweak, that will always be useful from now on!
  38. 38. Thank You! 38 For more information go to: slack.com/jobs
  39. 39. Joe Smith Operations Engineer, Slack Application Operations Team ● Come build the future of work! ○ https://slack.com/careers/641062/senior-site-reliability-engineer ● Please reach out and say hi! ○ @Yasumoto on Twitter ● Tools Scaffold ○ https://github.com/Yasumoto/tools

×