Successfully reported this slideshow.
Your SlideShare is downloading. ×

All Day DevOps: Calling Out A Terrible On-Call System

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 137 Ad

All Day DevOps: Calling Out A Terrible On-Call System

Download to read offline

Back when our team was small, all the devs participated in a single on-call rotation. As our team started to grow, that single rotation became problematic. Eventually, the team was so big that people were going on-call every 2-3 months. This may seem like a dream come true, but in reality, it was far from it. Because shifts were so infrequent, devs did not get the on-call experience they needed to know how to handle on-call issues confidently. Morale began to suffer and on-call became something everyone dreaded.

We knew the system had to change if we wanted to continue growing and not lose our developer talent, but the question was how? Despite all of the developers working across a single application with no clearly defined lines of ownership, we devised a plan that broke our single rotation into 3 separate rotations. This allowed teams to take on-call ownership over smaller pieces of the application while still working across all of it. These individual rotations paid off in many different ways.

With a new sense of on-call ownership, the dev teams began improving alerting and monitoring for their respective systems. The improved alerting led to faster incident response because the monitoring was better and each team was more focused on a smaller piece of the system. In addition, having 3 devs on-call at once means no one ever feels alone because there are always 2 other people who are on-call with you. Finally, cross-team communication and awareness also drastically improved with the new system.

Back when our team was small, all the devs participated in a single on-call rotation. As our team started to grow, that single rotation became problematic. Eventually, the team was so big that people were going on-call every 2-3 months. This may seem like a dream come true, but in reality, it was far from it. Because shifts were so infrequent, devs did not get the on-call experience they needed to know how to handle on-call issues confidently. Morale began to suffer and on-call became something everyone dreaded.

We knew the system had to change if we wanted to continue growing and not lose our developer talent, but the question was how? Despite all of the developers working across a single application with no clearly defined lines of ownership, we devised a plan that broke our single rotation into 3 separate rotations. This allowed teams to take on-call ownership over smaller pieces of the application while still working across all of it. These individual rotations paid off in many different ways.

With a new sense of on-call ownership, the dev teams began improving alerting and monitoring for their respective systems. The improved alerting led to faster incident response because the monitoring was better and each team was more focused on a smaller piece of the system. In addition, having 3 devs on-call at once means no one ever feels alone because there are always 2 other people who are on-call with you. Finally, cross-team communication and awareness also drastically improved with the new system.

Advertisement
Advertisement

More Related Content

Similar to All Day DevOps: Calling Out A Terrible On-Call System (20)

Advertisement

Recently uploaded (20)

All Day DevOps: Calling Out A Terrible On-Call System

  1. 1. T R A C K : S I T E R E L I A B I LT Y E N G I N E E R I N G N OV E M B E R 1 2 , 2 0 2 0 Molly Struve Calling Out a Terrible On-Call System
  2. 2. @molly_struve Calling Out a Terrible On-Call System Hi! My name is Molly Struve and before I get started I want to point out that my Twitter handle is in the lower right hand corner of all of the slides. Molly underscore Struve. I have already tweeted out this slide deck so if you would like to follow along…
  3. 3. @molly_struve Calling Out a Terrible On-Call System head over there and click on the link. Welcome to Calling Out a Terrible On-Call System! I am the Lead Site Reliability Engineer for the community software provider Forem which is what the technical blogging platform dev.to is built on. Being a…
  4. 4. @molly_struve Site Reliability Engineer Site Reliability Engineer means I am one of those weird people that thrives on being on-call. The adrenalin rush of having to figure out a bug as quick as possible really gets me going. But, I’m pretty positive the vast majority of engineers are not like myself. Raise your hand if you…
  5. 5. @molly_struve 5 hate being on-call or in the past have had a horrible on-call experience? I can’t see any of you through my computer but I am sure, the majority of you have your hands up. On-call is a necessity to support the applications we build but…
  6. 6. @molly_struve 6 it SHOULD NOT, I repeat, it should NOT make people miserable. If your engineers are miserable during on-call then you have a problem. I am here today to give you some suggestions and strategies you can use to help you fix this common problem. All of these strategies….
  7. 7. @molly_struve 7 I am about to share unfortunately didn’t just hit me while I was sleeping one night. To figure all of this out I had to live through one of those terrible on-call systems and that experience showed me first hand the toll a broken system can take on all of those involved. Here is the story of a terrible on-call system in the making!
  8. 8. @molly_struve 8 In the beginning… In the beginning, I was working on a small engineering team and everyone participated in a…
  9. 9. @molly_struve 9 👩💻 👩💻 👨💻 👩💻 👨💻 single on-call rotation. Every dev on the team was in the rotation and each dev would go on-call for…
  10. 10. @molly_struve 10 👩💻 👩💻 👨💻 👩💻 👨💻 1 week one week at a time. When we first started the rotation the team had 5 devs on it and it worked great! Everyone was very experienced with the application and with being on call bc everyone did it relatively often However, as the years went by….
  11. 11. @molly_struve 11 👩💻 👩💻 👨💻 👩💻 👨💻 👨💻 👩💻 👩💻 👩💻 👨💻 1 week The team started to grow. Despite the team growth we still stuck with this single rotation. Eventually…
  12. 12. @molly_struve 12 👩💻 👨💻 👩💻 👨💻 👨💻 👩💻 👩💻👩💻 👨💻 👨💻 👩💻 👩💻 👩💻 👩💻 1 week the team got so big that people were going on-call once every 3-4 months. Being on-call once every few months may seem like a dream come true, but in reality, it’s far from it. This giant single rotation was making…
  13. 13. @molly_struve 13 😱 ☹ 😭 😣 😡 😱 😡😞 😖 😣 😭 😬 😬 ☹ 1 week All of the devs miserable for a variety of reasons. For starters this large, single rotation meant….
  14. 14. @molly_struve 14 😱 ☹ 😭 😣 😡 😱 😡😞 😖 😣 😭 😬 😬 ☹ Infrequent Shifts Infrequent on-call shifts. As I mentioned, devs were going on-call once every few months. Bc on-call shifts were so infrequent, devs were not able to get the experience and practice they needed to know how to handle on-call issues effectively. In addition, the code base…
  15. 15. @molly_struve 15 😱 ☹ 😭 😣 😡 😱 😡😞 😖 😣 😭 😬 😬 ☹ Growing, complex codebase had grown tremendously and was vastly more complex than when we had started. There were so many things being developed at once that when a problem arose, there was a solid chance the on-call dev knew nothing about it or the code that was causing it. And what happens when an alarm goes off and you have no idea what to do…
  16. 16. @molly_struve 16 You panic! And who can blame you? We have all been there. When you have to fix something you know nothing about its terrifying! When the devs would panic, they would turn to the people they knew could likely fix the problem the fastest and that was…
  17. 17. @molly_struve 17 Site Reliability Engineering Team The Site Reliability Engineering team. Of course, the devs were right in their assumption, usually the Site Reliability team could fix the problem the fastest, but relying on a small set of people for everything doesn’t scale. Constantly having to jump in and help….
  18. 18. @molly_struve 18 Site Reliability Engineering Team with on-call issues quickly began to drain a lot of the Site Reliability team's time and resources. Essentially, the team began to act as if they were on-call 24/7. The constant bombardment of questions and requests…
  19. 19. @molly_struve 19 Site Reliability Engineering Team 😩 😩 😩 Began to burnout the Site Reliability team. Besides having a burned out and inefficiently used Site Reliability team, another problem…
  20. 20. @molly_struve 20 😱 ☹ 😭 😣 😡 😱 😡😞 😖 😣 😭 😬 😬 ☹ with this single giant single on-call rotation was that developers felt like they had
  21. 21. @molly_struve 21 😱 ☹ 😭 😣 😡 😱 😡😞 😖 😣 😭 😬 😬 ☹ No Ownership no ownership over the code they were responsible for while on-call. One person would write code and another person would be the one debugging it if it broke. The app was so big that there was no way anyone could have a sense of ownership over the production code since there was way too much of it. This…
  22. 22. @molly_struve 22 👩💻 👨💻 👩💻 👨💻 👨💻 👩💻 👩💻👩💻 👨💻 👨💻 👩💻 👩💻 👩💻 👩💻 1 week Giant, seemingly innocuous On-Call rotation might seem harmless enough but what it leads to is…
  23. 23. @molly_struve 23 👩💻 👨💻 👩💻 👨💻 👨💻 👩💻 👩💻👩💻 👨💻 👨💻 👩💻 👩💻 👩💻 👩💻 Infrequent Shifts Infrequent On-Call Shifts which means less on-call experience and practice for developers. Lack of experience being on call leads to…
  24. 24. @molly_struve 24 👩💻 👨💻 👩💻 👨💻 👨💻 👩💻 👩💻👩💻 👨💻 👨💻 👩💻 👩💻 👩💻 👩💻 Infrequent Shifts Panicked Devs Panicked Devs who have no clue how to handle issues when they arise. When those panicked and stressed out devs need constant help that leads to…
  25. 25. @molly_struve 25 👩💻 👨💻 👩💻 👨💻 👨💻 👩💻 👩💻👩💻 👨💻 👨💻 👩💻 👩💻 👩💻 👩💻 Infrequent Shifts Panicked Devs Burned Out Site Reliability Team A burned out Site Reliability Team. No one in this entire on-call rotation situation was happy. To top it all off…
  26. 26. @molly_struve 26 👩💻 👨💻 👩💻 👨💻 👨💻 👩💻 👩💻👩💻 👨💻 👨💻 👩💻 👩💻 👩💻 👩💻 No Ownership There was no feeling of ownership for anyone over the code they were supporting. As many of you know, lack of ownership leads to ambivalence at best. All of..
  27. 27. @molly_struve 27 👩💻 👨💻 👩💻 👨💻 👨💻 👩💻 👩💻👩💻 👨💻 👨💻 👩💻 👩💻 👩💻 👩💻 Infrequent Shifts Panicked Devs Burned Out Site Reliability Team No Ownership These problems began adding up and eventually, it got …
  28. 28. @molly_struve 28 so bad that we knew something had to change. Now, before I tell you all about our solution I first want to briefly cover…
  29. 29. @molly_struve 29 Team Organization How the engineering team was organized at the time so you have some context about how we ended up with the solution we did. In the engineering department at the time there were…
  30. 30. @molly_struve 30 👨💻 👨💻 👩💻 👩💻 👨💻 👨💻 👨💻 👩💻 👩💻 👨💻 👨💻 👨💻 👩💻 👩💻 👨💻 Team Organization 3 separate dev teams. Each team had 5-7 devs on it…
  31. 31. @molly_struve 31 👨💻 👨💻 👩💻 👩💻 👨💻 👨💻 👨💻 👩💻 👩💻 👨💻 👨💻 👨💻 👩💻 👩💻 👨💻 👨💼 👩💼 👨💼 Team Organization plus a manager. Each team had its own set of projects but all of the teams worked across one…
  32. 32. @molly_struve 32 One Monolithic Application 👨💻 👨💻 👩💻 👩💻 👨💻 👨💻 👨💻 👩💻 👩💻 👨💻 👨💻 👨💻 👩💻 👩💻 👨💻 👨💼 👩💼 👨💼 single, monolithic Rails application. Unlike other apps that might have very separate backend components owned by individual teams, there were no clear or obvious lines of ownership within…
  33. 33. @molly_struve 33 👨💻 👨💻 👩💻 👩💻 👨💻 👨💻 👨💻 👩💻 👩💻 👨💻 👨💻 👨💻 👩💻 👩💻 👨💻 👨💼 👩💼 👨💼 One Monolithic Application this single, monolithic rails application. This would prove to be the biggest hurdle when it came to fixing this broken on-call system. Now that you have a little background about the team organization, lets get to the good stuff…
  34. 34. @molly_struve 34 👩💻 👨💻 👩💻 👨💻 👨💻 👩💻 👩💻👩💻 👨💻 👨💻 👩💻 👩💻 👩💻 👩💻 The Solution The solution to fixing this terrible and broken on-call system. First and foremost, we knew we had to break up…
  35. 35. @molly_struve 35 👩💻 👨💻 👩💻 👨💻 👨💻 👩💻 👩💻👩💻 👨💻 👨💻 👩💻 👩💻 👩💻 👩💻 This giant single rotation if we wanted to continue growing. Despite all of these developers working….
  36. 36. @molly_struve 36 👩💻 👨💻 👩💻 👨💻 👨💻 👩💻 👩💻👩💻 👨💻 👨💻 👩💻 👩💻 👩💻 👩💻 One Monolithic Application across one monolithic application, we decided to break the single rotation into 3, one rotation…
  37. 37. @molly_struve 37 👨💻 👨💻 👩💻 👩💻 👨💻 👨💻 👨💻 👩💻 👩💻 👨💻 👨💻 👨💻 👩💻 👩💻 👨💻 👨💼 👩💼 👨💼 for each of the 3 dev teams. Having 3 small rotations led..
  38. 38. @molly_struve 38 👨💻 👨💻 👩💻 👩💻 👨💻 👨💻 👨💻 👩💻 👩💻 👨💻 👨💻 👨💻 👩💻 👩💻 👨💻 👨💼 👩💼 👨💼 More Frequent Shifts to more frequent on-call shits which meant more practice and experience for those handling on-call. As backward as it may sound, being on-call on a regular cadence is a benefit because devs become…
  39. 39. @molly_struve 39 👨💻 👨💻 👩💻 👩💻 👨💻 👨💻 👨💻 👩💻 👩💻 👨💻 👨💻 👨💻 👩💻 👩💻 👨💻 👨💼 👩💼 👨💼 More Frequent Shifts More Comfortable a lot more comfortable with it and are able to really figure out a strategy that works best for them. So the first strategy we implemented…
  40. 40. @molly_struve Overhauling On-Call 40 1 2 3 4 5 6 When Overhauling this on-call system was..
  41. 41. @molly_struve Overhauling On-Call 41 1 2 3 4 3 Smaller On-Call Rotations 5 6 to split our giant rotation into 3 smaller on-call rotations. Those 3 On-Call rotations solved the problem of shift frequency but that still left the biggest problem of all…
  42. 42. @molly_struve 42 Application Ownership Application ownership. Who supports what? How do you define what team is going to cover what code? When it all boils down to it, no one..
  43. 43. @molly_struve 43 wants to support something they don't feel like they own. To accomplish this we choose to split up the on-call…
  44. 44. @molly_struve 44 Application Ownership Team 1 Team 2 Team 3 application ownership amongst the 3 dev teams. Even though I am about to breeze through this split I want to be clear, this did not happen overnight. During this process there were a lot of meetings and planning and collaborating …
  45. 45. @molly_struve 45 Team 1 Team 2 Team 3 Site Reliability Between the Site Reliability team and the dev teams to figure out the best and most logical way to split up the components of our monolithic application. I really want to highlight that this was not the Site Reliability team calling the shots and handing over the “assignments” to the dev teams. We wanted this whole process of…
  46. 46. @molly_struve Splitting App Ownership Team 1 Team 2 Team 3 Splitting up the application ownership to be as collaborative as possible bc we knew that was going to give us the highest chance of succeeding. We first started by splitting up our…
  47. 47. @molly_struve Splitting App Ownership Team 1 Team 2 Team 3 Background Workers Background workers. Team 1
  48. 48. @molly_struve Splitting App Ownership Team 1 Team 2 Team 3 Background Workers Data Processing Workers got the Data processing workers. Team 2…
  49. 49. @molly_struve Splitting App Ownership Team 1 Team 2 Team 3 Background Workers Data Processing Workers Overnight Reporting Workers Got the Overnight reporting workers and finally Team 3…
  50. 50. @molly_struve Splitting App Ownership Team 1 Team 2 Team 3 Background Workers Data Processing Workers Overnight Reporting Workers User Communication Workers Got the Client communication workers. The next thing we needed to split up were our…
  51. 51. @molly_struve Splitting App Ownership Team 1 Team 2 Team 3 Background Workers Data Processing Workers Overnight Reporting Workers User Communication Workers Service Alerts Service Alerts. When I say service alerts here I am referring to alerts that were set up within the monitoring system to monitor things like databases, infrastructure, and code performance. Before it was a single person staying on top of all of these different alerts. With this new system we decided to split them up as well. We gave…
  52. 52. @molly_struve Splitting App Ownership Team 1 Team 2 Team 3 Background Workers Data Processing Workers Overnight Reporting Workers User Communication Workers Service Alerts Redis and Worker Queue Alerts Team 1 the Redis and Worker queue alerts. We gave…
  53. 53. @molly_struve Splitting App Ownership Team 1 Team 2 Team 3 Background Workers Data Processing Workers Overnight Reporting Workers User Communication Workers Service Alerts Redis and Worker Queue Alerts Elasticsearch and API Alerts Team 2 the Elasticsearch and API Alerts. And finally we gave…
  54. 54. @molly_struve Splitting App Ownership Team 1 Team 2 Team 3 Background Workers Data Processing Workers Overnight Reporting Workers User Communication Workers Service Alerts Elasticsearch and API Alerts Redis and Worker Queue Alerts MySQL and Page Load Alerts Team 3 the MySQL and Page load alerts. Now that existing service alerts and our background workers were split up, the last thing to split up was the
  55. 55. @molly_struve Splitting App Ownership Team 1 Team 2 Team 3 Background Workers Service Alerts Application Code Data Processing Workers Overnight Reporting Workers User Communication Workers Redis and Worker Queue Alerts Elasticsearch and API Alerts MySQL and Page Load Alerts Application components/Code. We were running a Rails application so this involved splitting up things like Models and Controllers within the codebase. We started by giving…
  56. 56. @molly_struve Splitting App Ownership Team 1 Team 2 Team 3 Background Workers Service Alerts Application Code Data Processing Workers Overnight Reporting Workers User Communication Workers Redis and Worker Queue Alerts Elasticsearch and API Alerts MySQL and Page Load Alerts Data Processing Code All the Data processing Code to Team 1. We figured this would pair well with the background workers they were also assigned. We gave…
  57. 57. @molly_struve Splitting App Ownership Team 1 Team 2 Team 3 Background Workers Service Alerts Application Code Data Processing Workers Overnight Reporting Workers User Communication Workers Redis and Worker Queue Alerts Elasticsearch and API Alerts MySQL and Page Load Alerts Data Processing Code Reporting and Emailing Code Team 2 the emailing and reporting code which paired well with their overnight workers. And finally we gave..
  58. 58. @molly_struve Splitting App Ownership Team 1 Team 2 Team 3 Background Workers Service Alerts Application Code Data Processing Workers Overnight Reporting Workers User Communication Workers Redis and Worker Queue Alerts Elasticsearch and API Alerts MySQL and Page Load Alerts User and App Alert Code Data Processing Code Reporting and Emailing Code Team 3 the User and in App Alerting Code which paired well with their user communication workers. Once..
  59. 59. @molly_struve Splitting App Ownership Team 1 Team 2 Team 3 Background Workers Service Alerts Application Code Data Processing Workers Overnight Reporting Workers User Communication Workers Redis and Worker Queue Alerts Elasticsearch and API Alerts MySQL and Page Load Alerts User and App Alert Code Data Processing Code Reporting and Emailing Code the lines had been drawn, we stressed to each of the dev teams that despite doing our best to balance the code equally we might still have to move things around. This showed the devs that we were fully invested in making sure this new on-call rotation was …
  60. 60. @molly_struve Splitting App Ownership Team 1 Team 2 Team 3 Background Workers Service Alerts Application Code Data Processing Workers Overnight Reporting Workers User Communication Workers Redis and Worker Queue Alerts Elasticsearch and API Alerts MySQL and Page Load Alerts User and App Alert Code Data Processing Code Reporting and Emailing Code fair and better for everyone. I know I got in the weeds a bit here by breaking this all down but I wanted to get a little specific with how we split up our application so that hopefully it can give you…
  61. 61. @molly_struve Splitting App Ownership Team 1 Team 2 Team 3 Background Workers Service Alerts Application Code Data Processing Workers Overnight Reporting Workers User Communication Workers Redis and Worker Queue Alerts Elasticsearch and API Alerts MySQL and Page Load Alerts User and App Alert Code Data Processing Code Reporting and Emailing Code some ideas about how you might go about splitting up ownership in a single application where lines might not be clearly drawn. And with that…
  62. 62. @molly_struve Overhauling On-Call 62 1 2 3 4 3 Smaller On-Call Rotations Split Up Application Ownership 5 6 Splitting up the application ownership slides into spot 2 in our overhauling on-call list. Now when it comes to instilling a feeling of ownership another big obstacle is constantly…
  63. 63. @molly_struve 63 Changing Code Changing code. Having 15 devs meant we could turn out a lot of features, but then the question became, how did teams stay on top of the code they were responsible for when on-call and its changes. For this we……
  64. 64. @molly_struve 64 CODEOWNERS took advantage of Gitlab's CODEOWNERS file. The CODEOWNERS file lives in…
  65. 65. @molly_struve 65 .github/CODEOWNERS the .github directory of your application. This file allows you to specify who or what teams in your organization own a file. Here is an example of…
  66. 66. @molly_struve 66 /*.md @org/team-1 /app/controllers/reporting/ @org/team-2 /app/workers/data_processing/ @org/team-1 /config/database.yml @org/team-3 CODEOWNERS a CODEOWNERS file. As you can see, you can assign…
  67. 67. @molly_struve 67 /*.md @org/team-1 /app/controllers/reporting/ @org/team-2 /app/workers/data_processing/ @org/team-1 /config/database.yml @org/team-3 CODEOWNERS types of files to a team or person. You can assign…
  68. 68. @molly_struve 68 /*.md @org/team-1 /app/controllers/reporting/ @org/team-2 /app/workers/data_processing/ @org/team-1 /config/database.yml @org/team-3 CODEOWNERS entire directories to a team or person. Or you can assign…
  69. 69. @molly_struve 69 /*.md @org/team-1 /app/controllers/reporting/ @org/team-2 /app/workers/data_processing/ @org/team-1 /config/database.yml @org/team-3 CODEOWNERS just a single file to a team or person. Once you have..
  70. 70. @molly_struve 70 /*.md @org/team-1 /app/controllers/reporting/ @org/team-2 /app/workers/data_processing/ @org/team-1 /config/database.yml @org/team-3 CODEOWNERS this file in place, when any file in your app directory is updated in a Pull Request, the owner of the file can automatically be tagged for review. This allowed the 3 teams to work across the entire codebase while also staying on top of what was changing in the components they were responsible for during On-Call. Using…
  71. 71. @molly_struve Overhauling On-Call 71 1 2 3 4 3 Smaller On-Call Rotations Split Up Application Ownership 5 Use a CODEOWNERS file 6 A CODEOWNERS file slips into the 3rd spot in our overhauling on-call strategy list. With the application components split up and a CODEOWNERS file to support and empower that ownership feeling, next…
  72. 72. @molly_struve Overhauling On-Call 72 1 2 3 4 3 Smaller On-Call Rotations Split Up Application Ownership 5 Use a CODEOWNERS file 6 on our list was to make sure every team and every single person on each team was completely comfortable with the application components they had been given ownership over. To do this the SRE team…
  73. 73. @molly_struve 73 On-Call Training Sessions Hosted mini on-call training sessions. During these sessions we sat down…
  74. 74. @molly_struve 74 👨💻 👨💻 👩💻 👩💻 👨💻👩💼 👨💻 👨💻 👩💻 👩💻 👨💻👨💼 👨💻 👨💻 👩💻 👩💻 👨💼 👨💻 with each dev team to thoroughly review the…
  75. 75. @molly_struve 75 👨💻 👨💻 👩💻 👩💻 👨💻👩💼 👨💻 👨💻 👩💻 👩💻 👨💻👨💼 Code ✅ Workers ✅ Alerts ✅ 👨💻 👨💻 👩💻 👩💻 👨💼 👨💻 code, workers, and alerts they were responsible for covering during on-call. During these…
  76. 76. @molly_struve On-Call Training Sessions on-call training sessions we went over things like…
  77. 77. @molly_struve On-Call Training Sessions • Common issues common issues that might pop up. For example, when this alert goes off usually it means xyz is broken with this piece of the code. We also took the time to…
  78. 78. @molly_struve On-Call Training Sessions • Common issues • Code Functionality Dive into all of the code functionality. We made sure every person on every team knew exactly what each piece of code they covered did. And last but not least, we made sure each team understood…
  79. 79. @molly_struve On-Call Training Sessions • Common issues • Code Functionality • Larger Application Impact How their components impacted the rest of the application. For example, if say Redis went down, how did that affect the rest of the application as a whole. These On-Call training sessions gave devs…
  80. 80. @molly_struve 80 Confidence a lot more confidence in their ability to handle on-call situations because they now had a clear picture of what they were responsible for and how to handle it. Even though they hadn’t built some of the code themselves, they had an understanding of exactly how it all worked. Hosting…
  81. 81. @molly_struve Overhauling On-Call 81 1 2 3 4 3 Smaller On-Call Rotations Split Up Application Ownership 5 Use a CODEOWNERS file On-Call Training Sessions 6 On-call training sessions takes the 4th spot in our overhauling on-call list. As I mentioned earlier, the purpose of these training sessions was to not only educate the devs about the code they were supporting, but also to give them confidence. Another confidence booster for devs who are on-call is…
  82. 82. @molly_struve 82 On-Call Support On-call support. What exactly do I mean by this? When a person is paged they aren’t always going to have all of the answers. Sometimes they need help and support from someone else to figure out the problem. Originally…
  83. 83. @molly_struve 83 On-Call Support Site Reliability the Site Reliability team acted as support for the on-call dev. If the on-call dev had questions or needed help they would talk to the Site Reliability team member that was on-call that week. The problem with this approach was that our Site Reliability team only…
  84. 84. @molly_struve 84 On-Call Support 🤓🤓 🤓 Had 3 people on it at the time and when you have 3 people trying to support 15+ its…
  85. 85. @molly_struve 85 On-Call Support 😥😕 😫 It’s not going to end well or be scalable! Our Site Reliability team got burned out pretty quick being the constant support system for the on-call devs. Not to mention, it cut into our time to do our own projects. With…
  86. 86. @molly_struve 86 👨💻 👨💻 👩💻 👩💻 👨💻👩💼 👨💻 👨💻 👩💻 👩💻 👨💻👨💼 👨💻 👨💻 👩💻 👩💻 👨💼 👨💻 the new system, each…
  87. 87. @molly_struve 87 👨💻 👨💻 👩💻 👩💻 👨💻👩💼 👨💻 👨💻 👩💻 👩💻 👨💻👨💼 👨💻 👨💻 👩💻 👩💻 👨💼 👨💻 dev that is on-call..
  88. 88. @molly_struve 88 👨💻 👨💻 👩💻 👩💻 👨💻👩💼 👨💻 👨💻 👩💻 👩💻 👨💻👨💼 👨💻 👨💻 👩💻 👩💻 👨💼 👨💻 acts as support for the others. If any one finds themselves overwhelmed or stuck on an issue they have two people they can reach out to for help. Having a support system like this is crucial for crafting an on-call system that is comfortable for everyone. No one wants to feel alone when they are on-call so ensuring that…
  89. 89. @molly_struve Overhauling On-Call 89 1 2 3 4 3 Smaller On-Call Rotations Split Up Application Ownership 5 Use a CODEOWNERS file On-Call Training Sessions On-Call Support System 6 On-Call has a solid support system in place is crucial. The last improvement we made to our system that was welcomed by everyone was that we…
  90. 90. @molly_struve 90 Focused On-Call Responsibilities Focused the responsibilities for the On-Call devs. With our…
  91. 91. @molly_struve The old system… 91 😦 old system the lone on-call dev was responsible for….
  92. 92. @molly_struve The old system… 92 😦 Technically debugging and fixing the problem Technically debugging and fixing the problem which we know is a very big task in itself. In addition, they were responsible…
  93. 93. @molly_struve The old system… 93 😦 Technically debugging and fixing the problem Setting a status page if needed For Setting a status page if needed. And last, it was their job to handle…
  94. 94. @molly_struve The old system… 94 😦 Technically debugging and fixing the problem Communicating the problem to the rest of the team Setting a status page if needed communicating the problem to the rest of the team. Needless to say the duties of the on-call dev were WAY overloaded. With the new system…
  95. 95. @molly_struve The new system… 95 😦 Technically debugging and fixing the problem Communicating the problem to the rest of the team Setting a status page if needed the ONLY responsibility an on-call dev had was debugging and fixing the problem. Narrowing the scope was crucial to…
  96. 96. @molly_struve The new system… 96 😊 Technically debugging and fixing the problem Communicating the problem to the rest of the team Setting a status page if needed improving the on-call experience. It allowed the devs to focus on what they did best, fixing the technical problem at hand. The responsibility of setting the status page was moved…
  97. 97. @molly_struve The new system… 97 😊 Technically debugging and fixing the problem Setting a status page if we need it Communicating the problem to the rest of the team to the support team. This made sense to us bc the support team is the closest to the customer, and therefore, are the best equipped to communicate any problems. When an incident occurred, the support team was notified and was responsible for determining if a status page or any customer communication was needed. The responsibility of…
  98. 98. @molly_struve The new system… 98 😊 Technically debugging and fixing the problem Setting a status page if we need it Communicating the problem to the rest of the team Communicating the problem internally was then moved to the manager of the on-call dev’s team. If updates need to be spread internally across the tech organization during an incident, the on-call dev’s manager was responsible for doing it. Narrowing the scope of the on-call responsibilities was…
  99. 99. @molly_struve Overhauling On-Call 99 1 2 3 4 3 Smaller On-Call Rotations Split Up Application Ownership 5 Use a CODEOWNERS file On-Call Training Sessions On-Call Support System 6 Narrowing On-Call Responsibility Scope The last piece of the puzzle when it came to overhauling this terrible on-call system. I am sure many of you are thinking, “That sounds great, but what does an on-call system like that get me? How can my team.. .
  100. 100. @molly_struve Overhauling On-Call 100 1 2 3 4 3 Smaller On-Call Rotations Split Up Application Ownership 5 Use a CODEOWNERS file On-Call Training Sessions On-Call Support System 6 Narrowing On-Call Responsibility Scope benefit from implementing some of these strategies?!” Earlier, I touched lightly on some of the benefits, but now I want to really drive them home and talk about…
  101. 101. @molly_struve The Payoff the Payoff of overhauling this terrible on-call system. The first big payoff was…
  102. 102. @molly_struve 102 Improved Alerting Improved alerting. Originally the Site Reliability team had set up all the alerting and monitoring tools. However, once we split up the alerts and handed them over to each of the 3 dev teams, the dev teams took them and ran. Because each team felt a renewed sense of ownership over their alerts they started to improve and build on them. Not only did they make more alerts, …
  103. 103. @molly_struve 103 Improved Alerting but they improved the accuracy of the existing ones. The improved alerts in turn, led to happier on-call developers bc there were less false positives and alerts were tweaked to alert on problems sooner before they became a bigger issue. Improved alerting…
  104. 104. @molly_struve 1 2 3 4 Improved Alerting 5 104 The Payoff Wasn’t the only payoff we saw after overhauling the system. As I briefly mentioned, there was…
  105. 105. @molly_struve 105 Sense of Ownership A renewed sense of ownership among all of the devs. Even though one team would edit the code that another team supported, there was still a keen sense of ownership for the supporting team. The supporting team acted as the domain experts over their section of code. The key strategy for ensuring this sense of ownership was…
  106. 106. @molly_struve 106 CODEOWNERS using the CODEOWNERS file. The CODEOWNERS file ensured that the supporting team was always aware and could sign off on any changes made to the code they supported. In addition, splitting up the code between the 3 teams meant each team had…
  107. 107. @molly_struve 107 Manageable Code Chunks A manageable chunk of code that they could actually learn and support, unlike before where every dev had to support the entire codebase which was way too much for any single person to handle. Shrinking…
  108. 108. @molly_struve 108 Manageable Code Chunks down the code that each dev was responsible for along with keeping them updated on any changes to that code gave devs that…
  109. 109. @molly_struve 1 2 3 4 Improved Alerting 5 Sense Of Ownership 109 The Payoff Sense of ownership again, and that sense of ownership made them excited to support their code. Another benefit of the new system was…
  110. 110. @molly_struve 110 Faster Incident Response Faster incident response time. Incident response time improved for a couple of reasons. For one, with 3 devs on-call at once and each one of them focusing on a smaller piece of the application, they could…
  111. 111. @molly_struve 111 Identify Problems Faster Identify problems faster. As I also mentioned before, each team took time to improve their own alerts so that the alerts would notify them of problems before they turned into…
  112. 112. @molly_struve 112 Identify Problems Faster major issues. This decreased incident response times and even helped prevent some incidents altogether. In addition to identifying problems sooner, debugging and figuring out the root cause of problems…
  113. 113. @molly_struve 113 Debugging is Quicker became quicker was well because teams were intimately familiar with their alerts and the pieces of code they owned. When a problem arose, each team could debug it much more efficiently than before. Faster…
  114. 114. @molly_struve 1 2 3 4 Improved Alerting 5 Sense Of Ownership Faster Incident Response 114 The Payoff incident response is always the goal of any Site Reliability team and to be able to achieve it with a new on-call system was pretty awesome. Another payoff of this new system was that the person who was on-call was…
  115. 115. @molly_struve 115 Never Alone Never alone. Having 3 devs on-call at once meant that none of the devs were ever alone when they were on-call. If things started to fall apart in one section of the application, the dev that owned that section knew there were two others available to…
  116. 116. @molly_struve 116 Never Alone help if they needed it. Being On-call can be stressful, but knowing that there is always someone easily accessible to help can do wonders for a dev’s confidence. Ensuring that no one is ever alone may seem like a small positive, but I want to add..
  117. 117. @molly_struve 117 Never Alone this was the most requested attribute of an on-call system from the devs. Before starting this overhaul process, I spoke with a few devs to get a feel for what they wanted out of the new system and at…
  118. 118. @molly_struve 118 Never Alone the very top of the list was having help and support when on-call. Don’t underestimate how much a multiple dev on-call system can improve the on-call experience for your devs. Devs…
  119. 119. @molly_struve 1 2 3 4 Improved Alerting Never Alone 5 Sense Of Ownership Faster Incident Response 119 The Payoff Never being alone while on call takes the 4th spot in our list of benefits from overhauling this broken on-call system. The last benefit we discovered with the new system was…
  120. 120. @molly_struve 120 Better Cross-Team Communication better cross-team communication. As I stated before, each of the 3 dev teams worked across an entire single application. This meant teams were often changing the code that another team was responsible for during on-call. Having the CODEOWNERS file ensured that the on-call team was alerted to those changes. This..
  121. 121. @molly_struve 121 Better Cross-Team Communication not only allowed for a good technical review process but it kept each of the teams up to date on what the other teams were working on and doing. Team communication during incidents also improved. For example, let's say Team 3 got…
  122. 122. @molly_struve 122 High MySQL Load Alert a high MySQL load alert. A ton of the code touched the MySQL database so it was not a guarantee that Team 3 also owned the application code causing the problem. However, because Team 3 was experienced with the MySQL alert…
  123. 123. @molly_struve 123 High MySQL Load Alert they knew how to triage it quickly and efficiently. Once team 3 had triaged it and identified the code that might be the issue they would communicate and work with the on-call owner for that code to determine what the problem was. This additional…
  124. 124. @molly_struve 124 Better Cross-Team Communication team communication helped teams stay current with each other's work while also ensuring that code domain experts were able to review and help with code before it got pushed to production. And with that…
  125. 125. @molly_struve 1 2 3 4 Improved Alerting Never Alone 5 Sense Of Ownership Faster Incident Response Better Cross-Team Communication 125 The Payoff The list of payoffs of overhauling this terrible on-call system is complete. At the top…
  126. 126. @molly_struve 1 2 3 4 Improved Alerting Never Alone 5 Sense Of Ownership Faster Incident Response Better Cross-Team Communication 126 The PayoffThe Payoff Improved alerting. Any small Site Reliability team knows that any outside help you can get with your alerting and monitoring systems is hugely appreciated and benefits everyone. Devs got a renewed…
  127. 127. @molly_struve 1 2 3 4 Improved Alerting Never Alone 5 Sense Of Ownership Faster Incident Response Better Cross-Team Communication 127 The Payoff sense of ownership making them more enthusiastic about their on-call responsibilities…
  128. 128. @molly_struve 1 2 3 4 Improved Alerting Never Alone 5 Sense Of Ownership Faster Incident Response Better Cross-Team Communication 128 The Payoff Incident response got faster thanks to the improved alerting and in depth knowledge of each team over their on-call components. On-call devs were
  129. 129. @molly_struve 1 2 3 4 Improved Alerting Never Alone 5 Sense Of Ownership Faster Incident Response Better Cross-Team Communication 129 The Payoff never alone when they were on call giving them peace of mind and confidence. And finally…
  130. 130. @molly_struve 1 2 3 4 Improved Alerting Never Alone 5 Sense Of Ownership Faster Incident Response Better Cross-Team Communication 130 The Payoff cross-team communication improved which benefited the entire technical organization. I think we can agree that these..
  131. 131. @molly_struve 1 2 3 4 Improved Alerting Never Alone 5 Sense Of Ownership Faster Incident Response Better Cross-Team Communication 131 The Payoff Benefits would benefit all of our teams and to get them from an on-call system is huge. If these are the benefits that your team is looking for and your devs are struggling within your on-call system then consider overhauling it…
  132. 132. @molly_struve Overhauling On-Call 132 1 2 3 4 Smaller On-Call Rotations Split Up Application Ownership 5 Use a CODEOWNERS file On-Call Training Sessions On-Call Support System 6 Narrow On-Call Responsibility Scope With the help of these 6 strategies. Smaller rotations to increase on-call frequency. Split up application ownership so devs can once again feel like they own what they are supporting. Use…
  133. 133. @molly_struve Overhauling On-Call 133 1 2 3 4 Smaller On-Call Rotations Split Up Application Ownership 5 Use a CODEOWNERS file On-Call Training Sessions On-Call Support System 6 Narrow On-Call Responsibility Scope a CODEOWNERS file to further instill that sense of ownership for your devs. Host on-call training sessions for your teams and ensure that those who are on-call always have a support system. Finally…
  134. 134. @molly_struve Overhauling On-Call 134 1 2 3 4 Smaller On-Call Rotations Split Up Application Ownership 5 Use a CODEOWNERS file On-Call Training Sessions On-Call Support System 6 Narrow On-Call Responsibility Scope keep the responsibilities for your on-call devs in check so they can focus on what they do best, fixing technical problems. On-call…
  135. 135. @molly_struve On-Call Shouldn’t Suck is something that many people in this industry dread and it shouldn't be that way. If people are dreading on-call then something is broken with your system. Sure, everyone at some point will get that late night or weekend page that is a pain, but that pain…
  136. 136. @molly_struve On-Call Shouldn’t Suck shouldn't be the norm. If on-call makes people want to pull their hair out ALL the time, then you have a problem that you need to fix. I hope this talk has given you some ideas to help you improve your own on-call system so that it can benefit everyone. Thank you!
  137. 137. T R A C K : S I T E R E L I A B I LT Y E N G I N E E R I N G THANK YOU TO OUR SPONSORS

×