Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Service Ownership @Slack

215 views

Published on

Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2Ffp4js.

Holly Allen talks about the bumps and scrapes, triumphs and pitfalls of Slackโ€™s journey from a centralized ops team to development teams that own the full lifecycle of their systems. Filmed at qconsf.com.

Holly Allen is a leader in Service Engineering at Slack, with SRE, Safety Engineering, and Storage in her portfolio. She is tireless in her efforts to make Slack the software reliable and scalable, and Slack the company a delightful place to work. Prior to Slack, she worked at startups, DreamWorks Animation, and was Director of Engineering at 18F, a civic tech startup in the US government.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Service Ownership @Slack

  1. 1. Service Ownership Learn Faster Holly Allen Service Engineering @hollyjallen
  2. 2. InfoQ.com: News & Community Site โ€ข Over 1,000,000 software developers, architects and CTOs read the site world- wide every month โ€ข 250,000 senior developers subscribe to our weekly newsletter โ€ข Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) โ€ข Post content from our QCon conferences โ€ข 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building โ€ข 96 deep dives on innovative topics packed as downloadable emags and minibooks โ€ข Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ slack-devops
  3. 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  4. 4. Holly Allen Software development and leadership for 18 years
  5. 5. @hollyjallen,#QConSF Nov 2018
  6. 6. @hollyjallen,#QConSF Nov 2018
  7. 7. @hollyjallen,#QConSF Nov 2018
  8. 8. Software! ๐Ÿ˜ @hollyjallen,#QConSF Nov 2018
  9. 9. @hollyjallen,#QConSF Nov 2018
  10. 10. S L O W ๐Ÿ˜ฉ @hollyjallen,#QConSF Nov 2018
  11. 11. @hollyjallen,#QConSF Nov 2018
  12. 12. @hollyjallen,#QConSF Nov 2018 MeasureDesign Learn
  13. 13. Toyota Production System @hollyjallen,#QConSF Nov 2018
  14. 14. @hollyjallen,#QConSF Nov 2018
  15. 15. @hollyjallen,#QConSF Nov 2018
  16. 16. @hollyjallen,#QConSF Nov 2018
  17. 17. @hollyjallen,#QConSF Nov 2018
  18. 18. @hollyjallen,#QConSF Nov 2018
  19. 19. โ€œโ€ Kaizen Continuous Improvement @hollyjallen,#QConSF Nov 2018
  20. 20. @hollyjallen,#QConSF Nov 2018 MeasureDesign Learn
  21. 21. @hollyjallen,#QConSF Nov 2018
  22. 22. @hollyjallen,#QConSF Nov 2018
  23. 23. @hollyjallen,#QConSF Nov 2018
  24. 24. โ€œโ€ Executive dedication to learning @hollyjallen,#QConSF Nov 2018
  25. 25. โ€œโ€ High Trust Teams @hollyjallen,#QConSF Nov 2018
  26. 26. @hollyjallen,#QConSF Nov 2018 MeasureDesign Learn
  27. 27. @hollyjallen,#QConSF Nov 2018
  28. 28. Slack launched February 2014 @hollyjallen,#QConSF Nov 2018 ๐Ÿš€
  29. 29. @hollyjallen,#QConSF Nov 2018 5 Years Grew to 13+ million weekly active users, with active sessions of 10+ hours a day
  30. 30. @hollyjallen,#QConSF Nov 2018 5 Years From 10 to 15,000 servers In 25 cloud data centers world-wide
  31. 31. @hollyjallen,#QConSF Nov 2018 5 Years From 8 to 1,200 people In 9 offices world-wide
  32. 32. @hollyjallen,#QConSF Nov 2018 MeasureDesign Learn
  33. 33. @hollyjallen,#QConSF Nov 2018
  34. 34. โ€œโ€ โœ… Continuous Deployment โœ… Experiment Frameworks โœ… User Research @hollyjallen,#QConSF Nov 2018
  35. 35. Something didn't scale... @hollyjallen,#QConSF Nov 2018
  36. 36. Centralized Operations @hollyjallen,#QConSF Nov 2018 ๐Ÿ˜ญ
  37. 37. โ€œโ€ Who should be responsible for the management, monitoring and operation of a production application? @hollyjallen,#QConSF Nov 2018
  38. 38. โ€œโ€ Centralized Operations Division of Labor @hollyjallen,#QConSF Nov 2018
  39. 39. Devs Features Scale Architecture @hollyjallen,#QConSF Nov 2018 Ops Cloud Infra Deployment Monitoring
  40. 40. โ€œโ€ Ops is getting the pages @hollyjallen,#QConSF Nov 2018
  41. 41. โ€œโ€ Product Development grew faster than Operations, A lot faster @hollyjallen,#QConSF Nov 2018
  42. 42. 20 Product Developers @hollyjallen,#QConSF Nov 2018 1 Ops Engineer
  43. 43. โ€œโ€ How can operations reliably reach the developers when there's a problem? @hollyjallen,#QConSF Nov 2018
  44. 44. โ€œโ€ "Call Maude, she knows how this works" @hollyjallen,#QConSF Nov 2018
  45. 45. Devs I've never been on-call before, this is scary! @hollyjallen,#QConSF Nov 2018 Ops Now I know I can find a developer when I need to.
  46. 46. โ€œโ€ Ops is getting the pages first pages Ultra-senior devs on-call @hollyjallen,#QConSF Nov 2018
  47. 47. @hollyjallen,#QConSF Nov 2018 MeasureDesign Learn
  48. 48. โ€œโ€ How can operations reliably reach the developers when there's a problem? @hollyjallen,#QConSF Nov 2018
  49. 49. Most devs go on-call Fall 2017 @hollyjallen,#QConSF Nov 2018 ๐Ÿ“Ÿ
  50. 50. โ€œโ€ Kaizen Continuous Improvement @hollyjallen,#QConSF Nov 2018
  51. 51. โ€œโ€ "Wait, I'm on-call now?" @hollyjallen,#QConSF Nov 2018
  52. 52. Devs I'm glad I'm only on call a few times a year @hollyjallen,#QConSF Nov 2018 Ops I'll be able to reach a search engineer if I need to.
  53. 53. โ€œโ€ Learn by Doing @hollyjallen,#QConSF Nov 2018
  54. 54. โ€œโ€ On-call 3 times a year ๐Ÿค” @hollyjallen,#QConSF Nov 2018
  55. 55. โ€œโ€ Ops is getting the pages first pages Ultra-senior devs on-call Seven One dev rotations @hollyjallen,#QConSF Nov 2018
  56. 56. โ€œโ€ Continuous Deployment 100+ prod deploys a day @hollyjallen,#QConSF Nov 2018
  57. 57. โ€œโ€ What Changed? @hollyjallen,#QConSF Nov 2018
  58. 58. โ€œโ€ @hollyjallen,#QConSF Nov 2018
  59. 59. โ€œโ€ @hollyjallen,#QConSF Nov 2018
  60. 60. โ€œโ€ Page the dev @hollyjallen,#QConSF Nov 2018
  61. 61. Devs I don't understand this part of the code @hollyjallen,#QConSF Nov 2018 Ops These are the machine alerts I'm seeing
  62. 62. โ€œโ€ Human Routers @hollyjallen,#QConSF Nov 2018
  63. 63. โ€œโ€ "Call Andy, he knows how this works" @hollyjallen,#QConSF Nov 2018
  64. 64. โ€œโ€ Postmortems weren't a great place for learning @hollyjallen,#QConSF Nov 2018
  65. 65. โ€œโ€ Can we catch problems earlier? @hollyjallen,#QConSF Nov 2018
  66. 66. โ€œโ€ @hollyjallen,#QConSF Nov 2018
  67. 67. โ€œโ€ @hollyjallen,#QConSF Nov 2018
  68. 68. โ€œโ€ @hollyjallen,#QConSF Nov 2018
  69. 69. โ€œโ€ Investing in tech to make detection and remediation faster @hollyjallen,#QConSF Nov 2018
  70. 70. Reorg! Fall 2017 @hollyjallen,#QConSF Nov 2018 Operations is out Service Engineering is in
  71. 71. โ€œโ€ How can Slack ensure that developers know when there's a problem? @hollyjallen,#QConSF Nov 2018
  72. 72. โ€œโ€ Centralized Operations Service Ownership @hollyjallen,#QConSF Nov 2018
  73. 73. @hollyjallen,#QConSF Nov 2018 MeasureDesign Learn
  74. 74. โ€œโ€ "We are the toolsmith and specialists. We empower Service Ownership" @hollyjallen,#QConSF Nov 2018
  75. 75. Devs Features Reliability Performance Postmortems @hollyjallen,#QConSF Nov 2018 Service Cloud Platform Observability tools Service Discovery Define best practice
  76. 76. I joined Slack in February 2018 @hollyjallen,#QConSF Nov 2018 ๐Ÿ‘‹
  77. 77. โ€œโ€ How to empower development teams to improve service reliability? @hollyjallen,#QConSF Nov 2018
  78. 78. Define service health and operational maturity @hollyjallen,#QConSF Nov 2018 โ€ข At least one alerting health metric, like latency or throughput
  79. 79. โ€œโ€ Send metrics to Prometheus Observability team is here to help! ๐Ÿ”ฎ @hollyjallen,#QConSF Nov 2018
  80. 80. Define service health and operational maturity @hollyjallen,#QConSF Nov 2018 โ€ข Team should be on-call ready โ€ข At least 4, preferably 6 engineers participating to make it sustainable โ€ข 24/7 or during the weekday, depending on the service
  81. 81. Define service health and operational maturity @hollyjallen,#QConSF Nov 2018 โ€ข Runbooks for standard actions and troubleshooting โ€ข Central location in our code repository โ€ข Up to date and useable by any engineer
  82. 82. Define service health and operational maturity @hollyjallen,#QConSF Nov 2018 โ€ข Paging alerts should link to the runbook โ€ข Make responding to an page easy โ€ข Practice incident response
  83. 83. โ€œโ€ Incident Lunch โ›‘ @hollyjallen,#QConSF Nov 2018
  84. 84. Site Reliability Engineers @hollyjallen,#QConSF Nov 2018 โ€ข Devops generalists โ€ข Emotional intelligence โ€ข Mentoring โ€ข Ambassadors โ€ข Operational maturity
  85. 85. โ€œโ€ SRE embedded in dev teams @hollyjallen,#QConSF Nov 2018
  86. 86. โ€œโ€ @hollyjallen,#QConSF Nov 2018 Devs OpsSRE
  87. 87. Devs Um, where are the SREs? @hollyjallen,#QConSF Nov 2018 SREs I'm over here doing operational tasks
  88. 88. โ€œโ€ SRE Ops is still getting the first pages @hollyjallen,#QConSF Nov 2018
  89. 89. โ€œโ€ How do we lower operational burden on the SREs? @hollyjallen,#QConSF Nov 2018
  90. 90. โ€œโ€ Plan: Send paging alerts to the development teams @hollyjallen,#QConSF Nov 2018
  91. 91. Devs We need training @hollyjallen,#QConSF Nov 2018 SREs We're going to plan this out perfectly
  92. 92. @hollyjallen,#QConSF Nov 2018
  93. 93. โ€œโ€ Host level alerts Hundreds of them @hollyjallen,#QConSF Nov 2018
  94. 94. โ€œโ€ Test with the users @hollyjallen,#QConSF Nov 2018
  95. 95. @hollyjallen,#QConSF Nov 2018
  96. 96. Everything was fine! @hollyjallen,#QConSF Nov 2018 ๐Ÿ’ช
  97. 97. โ€œโ€ Empowered Continuous Improvement @hollyjallen,#QConSF Nov 2018
  98. 98. โ€œโ€ @hollyjallen,#QConSF Nov 2018 Devs OpsSRE
  99. 99. โ€œโ€ How do we test our understanding of how Slack will fail? @hollyjallen,#QConSF Nov 2018
  100. 100. โ€œโ€ "Disasterpiece Theater is an ongoing series of exercises in which we will purposely cause a part of Slack to fail." @hollyjallen,#QConSF Nov 2018
  101. 101. @hollyjallen,#QConSF Nov 2018 MeasureDesign Learn
  102. 102. Success Metrics @hollyjallen,#QConSF Nov 2018 โ€ข Increased engineer confidence โ€ข Validate reliability improvements โ€ข Learn something new โ€ข Practice incident response
  103. 103. โ€œโ€ Centralized Operations Service Ownership @hollyjallen,#QConSF Nov 2018
  104. 104. โ€œโ€ How do we ensure the teams are being alerted, instead of skillsets? @hollyjallen,#QConSF Nov 2018
  105. 105. โ€œโ€ How do we make postmortems a place for learning again? @hollyjallen,#QConSF Nov 2018
  106. 106. โ€œโ€ How do we make sure a capable incident commander is available for all incidents? @hollyjallen,#QConSF Nov 2018
  107. 107. @hollyjallen,#QConSF Nov 2018 MeasureDesign Learn
  108. 108. @hollyjallen,#QConSF Nov 2018
  109. 109. @hollyjallen,#QConSF Nov 2018
  110. 110. โ€œโ€ Copy the questions Not the answers @hollyjallen,#QConSF Nov 2018
  111. 111. โ€œโ€ Change is possible You don't have to be ready @hollyjallen,#QConSF Nov 2018
  112. 112. โ€œโ€ Empowered Continuous Improvement @hollyjallen,#QConSF Nov 2018
  113. 113. @hollyjallen,#QConSF Nov 2018 MeasureDesign Learn
  114. 114. โ€œโ€ Learn Faster @hollyjallen,#QConSF Nov 2018
  115. 115. Holly Allen Service Engineering @hollyjallen Thank You
  116. 116. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ slack-devops

ร—