LeadDev NYC 2022: Calling Out a Terrible On-call System

Calling Out a Terrible
On-Call System
April 6, 2022
Hi! My name is Molly Struve and I want to welcome you to calling out a terrible on-call system! Currently I am a Site Reliability Engineer at Netflix but the story I want to share with you today is from my time at a previous
start up company. Being a…

Site Reliability
Engineer
@molly_struve
Site Reliability Engineer means I am one of those weird people that thrives on being on-call. The adrenalin rush of having to figure out a bug as quick as possible really gets me going. But, I’m pretty positive the vast
majority of engineers are not like myself. Raise your hand if you…

@molly_struve
hate being on-call or in the past have had a horrible on-call experience? Ah yes, MANY of you! On-call is a necessity to support the applications we build but…

@molly_struve
it SHOULD NOT, I repeat, it should NOT make people miserable. If your engineers are miserable during on-call then you have a problem. I am here today to give you some suggestions and strategies you can use to
help you fix this common problem. All of these strategies….

@molly_struve
I am about to share unfortunately didn’t just hit me while I was sleeping one night. To figure all of this out I had to live through one of those terrible on-call systems and that experience showed me first hand the toll a
broken system can take on all of those involved. Here is the story of a terrible on-call system in the making!

In the beginning…
@molly_struve
In the beginning, I was working at a startup on a small engineering team and everyone participated in a…

@molly_struve
👩💻
👩💻
👨💻
👩💻
👨💻
single on-call rotation. Every developer was in the rotation and each one of us would go on-call for…

@molly_struve
👩💻
👩💻
👨💻
👩💻
👨💻
1 week
one week at a time. When we first started the rotation the team had 5 developers on it and it worked great! Everyone was very experienced with the application and with being on call bc everyone did it relatively often
However, as the years went by….

@molly_struve
👩💻
👩💻
👨💻
👩💻
👨💻
👨💻
👩💻
👩💻
👩💻
👨💻
1 week
The team started to grow. Despite the team growth we still stuck with the single rotation. Eventually…

@molly_struve
👩💻
👨💻
👩💻
👨💻
👨💻
👩💻
👩💻
👩💻
👨💻
👨💻
👩💻
👩💻
👩💻
👩💻
1 week
the team got so big that people were going on-call once every 3-4 months. Being on-call once every few months may seem like a dream come true, but in reality, it’s far from it. This giant single rotation was making…

@molly_struve
😱
☹
😭
😣
😡
😱
😡
😞
😖
😭
😬
😬
☹
1 week
😣
All of the developers miserable for a variety of reasons. For starters this large, single rotation meant….

@molly_struve
😱
☹
😭
😣
😡
😱
😡
😞
😖
😭
😬
😬
☹
Infrequent
Shifts
😣
Infrequent on-call shifts. As I mentioned, developers were going on-call once every few months. Bc on-call shifts were so infrequent, developers were not able to get the experience and practice they needed to know how
to handle on-call issues effectively. In addition, our code base…

@molly_struve
😱
☹
😭
😣
😡
😱
😡
😞
😖
😭
😬
😬
☹
Growing, complex
codebase
😣
had grown tremendously and was vastly more complex than when we had started. There were so many things being developed at once that when a problem arose there was a solid chance the on-call developer knew
nothing about it or the code that was causing it. And what happens when an alarm goes off and you have no idea what to do…

Presentation 
Title
@molly_struve
You panic! And who can blame you? We have all been there. When you have to fix something you know nothing about its terrifying! When the developers would panic, they would turn to the people they knew could likely
fix it the fastest and that was…

@molly_struve
Site Reliability
Engineering Team
The Site Reliability Engineering team. Of course the devs were right in their assumption, usually the Site Reliability team could fix the problem the fastest, but the Site Reliability team only had 3 people on it and
relying on a small set of people for everything doesn’t scale. Constantly having to jump in and help….

@molly_struve
Site Reliability
Engineering Team
with on-call issues quickly began to drain a lot of the team's time and resources. Essentially, the SRE team began to act as if they were on-call 24/7. The constant bombardment of questions and requests…

@molly_struve
Site Reliability
Engineering Team
😩 😩 😩
Began to burnout the Site Reliability team. Besides having a burned out and inefficiently used Site Reliability team, another problem…

@molly_struve
😱
☹
😭
😣
😡
😱
😡
😞
😖
😭
😬
😬
☹ 😣
with this single giant single on-call rotation was that developers felt like they had

@molly_struve
😱
☹
😭
😣
😡
😱
😡
😞
😖
😭
😬
😬
☹ 😣
No
Ownership
no ownership over the code they were responsible for while on-call. One person would write code and another person would be the one debugging it if it broke. The app was so big that there was no way anyone
could have a sense of ownership over the production code since there was just too much of it. This…

@molly_struve
👩💻
👨💻
👩💻
👨💻
👨💻
👩💻
👩💻
👩💻
👨💻
👨💻
👩💻
👩💻
👩💻
👩💻
Giant seemingly innocuous On-Call rotation might seem harmless enough but what it leads to is…

@molly_struve
👩💻
👨💻
👩💻
👨💻
👨💻
👩💻
👩💻
👩💻
👨💻
👨💻
👩💻
👩💻
👩💻
👩💻
Infrequent Shifts
Infrequent On-Call Shifts which means less on-call experience and practice for developers. Lack of experience being on call leads to…

@molly_struve
👩💻
👨💻
👩💻
👨💻
👨💻
👩💻
👩💻
👩💻
👨💻
👨💻
👩💻
👩💻
👩💻
👩💻
Infrequent Shifts
Panicked Devs
Panicked Devs who have no clue how to handle issues when they arise. When those panicked and stressed out developers need constant help that leads to…

@molly_struve
👩💻
👨💻
👩💻
👨💻
👨💻
👩💻
👩💻
👩💻
👨💻
👨💻
👩💻
👩💻
👩💻
👩💻
Infrequent Shifts
Panicked Devs
Burned Out Site
Reliability Team
A burned out Site Reliability Team. No one in this entire on-call rotation situation was happy. To top it all off…

@molly_struve
👩💻
👨💻
👩💻
👨💻
👨💻
👩💻
👩💻
👩💻
👨💻
👨💻
👩💻
👩💻
👩💻
👩💻
No
Ownership
There was no feeling of ownership for anyone over the code they were supporting. All of..

@molly_struve
👩💻
👨💻
👩💻
👨💻
👨💻
👩💻
👩💻
👩💻
👨💻
👨💻
👩💻
👩💻
👩💻
👩💻
Infrequent Shifts
Panicked Devs
Burned Out Site
Reliability Team
No
Ownership
These problems began adding up and eventually, it got …

Presentation 
Title
@molly_struve
so bad that we knew something had to change. Now before I tell you all about our solution I first want to briefly cover…

@molly_struve
Team
Organization
How the engineering team was organized at the time so you have some context about how we ended up with the solution we did. In the engineering department at the time there were…

@molly_struve
👨💻 👨💻
👩💻
👩💻
👨💻
👨💻 👨💻
👩💻
👩💻
👨💻
👨💻 👨💻
👩💻
👩💻
👨💻
Team
Organization
3 separate developer teams. Each team had 5-7 developers on it…

@molly_struve
👨💻 👨💻
👩💻
👩💻
👨💻
👨💻 👨💻
👩💻
👩💻
👨💻
👨💻 👨💻
👩💻
👩💻
👨💻
👨💼 👩💼
👨💼
Team
Organization
plus a manager. Each team had its own set of projects but all of the teams worked across one…

@molly_struve
One Monolithic
Application
👨💻 👨💻
👩💻
👩💻
👨💻
👨💻 👨💻
👩💻
👩💻
👨💻
👨💻 👨💻
👩💻
👩💻
👨💻
👨💼 👩💼
👨💼
single monolithic application. Unlike other apps that might have very separate backend components owned by individual teams, there were no clear or obvious lines of ownership within…

@molly_struve
👨💻 👨💻
👩💻
👩💻
👨💻
👨💻 👨💻
👩💻
👩💻
👨💻
👨💻 👨💻
👩💻
👩💻
👨💻
👨💼 👩💼
👨💼
One Monolithic
Application
this single monolithic application. This would prove to be the biggest hurdle when it came to fixing this terrible on-call system. Now that you have a little background on the team organization, lets get to the good
stuff…

@molly_struve
👩💻
👨💻
👩💻
👨💻
👨💻
👩💻
👩💻
👩💻
👨💻
👨💻
👩💻
👩💻
👩💻
👩💻
The Solution
The solution to this fixing this terrible on-call system. First and foremost, we knew we had to break up…

@molly_struve
👩💻
👨💻
👩💻
👨💻
👨💻
👩💻
👩💻
👩💻
👨💻
👨💻
👩💻
👩💻
👩💻
👩💻
This giant single rotation if we wanted to continue growing. Despite all of these developers working….

@molly_struve
👩💻
👨💻
👩💻
👨💻
👨💻
👩💻
👩💻
👩💻
👨💻
👨💻
👩💻
👩💻
👩💻
👩💻
One Monolithic
Application
across one monolithic application, we decided to break the single rotation into 3, one rotation…

@molly_struve
👨💻 👨💻
👩💻
👩💻
👨💻
👨💻 👨💻
👩💻
👩💻
👨💻
👨💻 👨💻
👩💻
👩💻
👨💻
👨💼 👩💼
👨💼
for each of the 3 developer teams. Having 3 small rotations led..

@molly_struve
👨💻 👨💻
👩💻
👩💻
👨💻
👨💻 👨💻
👩💻
👩💻
👨💻
👨💻 👨💻
👩💻
👩💻
👨💻
👨💼 👩💼
👨💼
More Frequent
Shifts
to more frequent on-call shits which meant more practice and experience for those handling on-call. As backward as it may sound, being on-call on a regular cadence is a benefit because developers become…

@molly_struve
👨💻 👨💻
👩💻
👩💻
👨💻
👨💻 👨💻
👩💻
👩💻
👨💻
👨💻 👨💻
👩💻
👩💻
👨💻
👨💼 👩💼
👨💼
More Frequent
Shifts
More Comfortable
a lot more comfortable with it and are able to really figure out a strategy that works best for them. So the first strategy we implemented…

NETFLIX
SECTION DIVIDER
Overhauling On-Call
1
2
3
4
5
6
When Overhauling this on-call system was..

NETFLIX
SECTION DIVIDER
Overhauling On-Call
1
2
3
4
5
6
Smaller On-Call Rotations
to split our giant rotation into 3 smaller on-call rotations. Those 3 smaller on-call rotations solved the problem of shift frequency but that still left the biggest problem of all…

@molly_struve
Application
Ownership
Application ownership. Who supports what? How do you decide what team is going to own what code when they are on-call? When it all boils down to it, no one..

@molly_struve
wants to support something they don't feel like they own. To accomplish this we choose…

@molly_struve
Application
Ownership
Team 1 Team 2
Team 3
to split up the on-call application ownership amongst the 3 developer teams. Even though I am about to breeze through this split I want to be clear, this did not happen overnight. During this process there were a lot of
meetings and planning and collaborating …

@molly_struve
Team 1 Team 2
Team 3
Site
Reliability
Between the Site Reliability team and the developer teams to figure out the best and most logical way to split up the components of our monolithic application. I really want to highlight that this was not the Site
Reliability team calling the shots and handing over the “assignments” to the…

@molly_struve
Team 1 Team 2
Team 3
Site
Reliability
developer teams. We wanted this whole process to be as collaborative as possible bc we knew that was going to give us the highest chance of succeeding. These application components may be specific for this

@molly_struve
Splitting App Ownership
Team 1 Team 2 Team 3
situation but I want to call them out in hopes that it might spark some ideas for how you could split up application ownership amongst multiple teams when clear lines might be hard to define. We first started by splitting up
our…

@molly_struve
Background
Workers
Background workers. Our app did a lot of async processing and had a lot of background works so we figured those would be good to divide up. Team 1

@molly_struve
Background
Workers
Data Processing
Workers
got the Data processing workers. Team 2…

@molly_struve
Background
Workers
Data Processing
Workers
Overnight
Reporting Workers
Got the Overnight reporting workers and finally Team 3…

@molly_struve
Background
Workers
Data Processing
Workers
Overnight
Reporting Workers
User Communication
Workers
Got the User communication workers. The next thing we needed to split up were our…

@molly_struve
Background
Workers
Data Processing
Workers
Overnight
Reporting Workers
User Communication
Workers
Service Alerts
Service Alerts. When I say service alerts here I am referring to alerts that were set up within our current monitoring system to monitor things like our database and systems. Before it was a single person staying on top of all
of them. With this new system we decided to split them up as well. We gave…

@molly_struve
Background
Workers
Data Processing
Workers
Overnight
Reporting Workers
User Communication
Workers
Service Alerts
Redis and Worker
Queue Alerts
Team 1 the Redis and Worker queue alerts. We gave…

@molly_struve
Background
Workers
Data Processing
Workers
Overnight
Reporting Workers
User Communication
Workers
Service Alerts
Redis and Worker
Queue Alerts
Elasticsearch and
API Alerts
Team 2 the Elasticsearch and API Alerts. And finally we gave…

@molly_struve
Background
Workers
Data Processing
Workers
Overnight
Reporting Workers
User Communication
Workers
Service Alerts
Elasticsearch and
API Alerts
Redis and Worker
Queue Alerts
MySQL and Page
Load Alerts
Team 3 the MySQL and Page load alerts. Now that existing service alerts and our background workers were split up, the last thing to split up was the

@molly_struve
Background
Workers
Service Alerts
Application
Code
Data Processing
Workers
Overnight
Reporting Workers
User Communication
Workers
Redis and Worker
Queue Alerts
Elasticsearch and
API Alerts
MySQL and Page
Load Alerts
Application components/Code. We were running a single monolithic Rails application so this involved splitting up things like Models and Controllers within the codebase. We started by giving…

@molly_struve
Background
Workers
Service Alerts
Application
Code
Data Processing
Workers
Overnight
Reporting Workers
User Communication
Workers
Redis and Worker
Queue Alerts
Elasticsearch and
API Alerts
MySQL and Page
Load Alerts
Data Processing
Code
All the Data processing Code to Team 1. We figured this would pair well with the background workers they were also assigned. We gave…

@molly_struve
Background
Workers
Service Alerts
Application
Code
Data Processing
Workers
Overnight
Reporting Workers
User Communication
Workers
Redis and Worker
Queue Alerts
Elasticsearch and
API Alerts
MySQL and Page
Load Alerts
Data Processing
Code
Reporting and
Emailing Code
Team 2 the emailing and reporting code which paired well with their overnight workers. And finally we gave..

@molly_struve
Background
Workers
Service Alerts
Application
Code
Data Processing
Workers
Overnight
Reporting Workers
User Communication
Workers
Redis and Worker
Queue Alerts
Elasticsearch and
API Alerts
MySQL and Page
Load Alerts
User and App Alert
Code
Data Processing
Code
Reporting and
Emailing Code
Team 3 the User and in App Alerting Code which paired well with their user communication workers. Once..

@molly_struve
Background
Workers
Service Alerts
Application
Code
Data Processing
Workers
Overnight
Reporting Workers
User Communication
Workers
Redis and Worker
Queue Alerts
Elasticsearch and
API Alerts
MySQL and Page
Load Alerts
User and App Alert
Code
Data Processing
Code
Reporting and
Emailing Code
the lines had been drawn, we stressed to each of the developer teams that despite doing our best to balance the code equally we might still have to move things around. This showed the developers that we were fully
invested in making sure this new on-call rotation was fair and better for everyone. As I mentioned earlier, I wanted…

@molly_struve
Background
Workers
Service Alerts
Application
Code
Data Processing
Workers
Overnight
Reporting Workers
User Communication
Workers
Redis and Worker
Queue Alerts
Elasticsearch and
API Alerts
MySQL and Page
Load Alerts
User and App Alert
Code
Data Processing
Code
Reporting and
Emailing Code
To get a little specific here with how we split up our application so that hopefully it can give you some ideas about how you might go about splitting up ownership in a single application where lines might not be
clearly drawn. And with that…

NETFLIX
SECTION DIVIDER
Overhauling On-Call
1
2
3
4
5
6
Split Up Application Ownership
Splitting up the application ownership slides into spot 2 in our overhauling on-call list. Now when it comes to instilling a feeling of ownership another big obstacle is constantly…

@molly_struve
Changing Code
Changing code. Having 15 developers meant we could turn out a lot of features but then the question became, how did teams stay on top of the code they were responsible for when on-call and it changed. For this
we……

@molly_struve
CODEOWNERS
took advantage of Gitlab's CODEOWNERS file. The CODEOWNERS file lives in…

@molly_struve
.github/CODEOWNERS
the .github directory of an application. This file allows you to specify who or what teams in your organization own a file. Here is an example…

@molly_struve
/*.md @org/team-1

/app/controllers/reporting/ @org/team-2

/app/workers/data_processing/ @org/team-1

/config/database.yml @org/team-3

CODEOWNERS
Of a CODEOWNERS file. As you can see, you can assign…

@molly_struve
/*.md @org/team-1




CODEOWNERS
types of files to a team or person. You can assign…

@molly_struve
/*.md @org/team-1




CODEOWNERS
entire directories to a team or person. Or you can assign…

@molly_struve
/*.md @org/team-1




CODEOWNERS
just a single file to a team or person. Once you have..

@molly_struve
/*.md @org/team-1




CODEOWNERS
this file in place, when any file in your app directory is updated in a Pull Request, the owner of the file will automatically be tagged for review. This allowed the 3 teams to work across the entire codebase while also
staying on top of what was changing in the components they were responsible for during On-Call. Using…

NETFLIX
SECTION DIVIDER
Overhauling On-Call
1
2
3
4
5
6
Use a CODEOWNERS file
A CODEOWNERS file slips into the 3rd spot in our overhauling on-call strategy list. With the application components split up and a CODEOWNERS file to support and empower that ownership feeling, next…

NETFLIX
SECTION DIVIDER
Overhauling On-Call
1
2
3
4
5
6
on our list was to make sure every team and every single person on each team was completely comfortable with the application components they had been given ownership over. To do this the SRE team…

@molly_struve
On-Call Training
Sessions
Hosted mini on-call training sessions. During these sessions we sat down…

@molly_struve
👨💻 👨💻
👩💻
👩💻
👨💻
👩💼
👨💻 👨💻
👩💻
👩💻
👨💻
👨💼
👨💻 👨💻
👩💻
👩💻
👨💼 👨💻
with each developer team to thoroughly review the…

@molly_struve
👨💻 👨💻
👩💻
👩💻
👨💻
👩💼
👨💻 👨💻
👩💻
👩💻
👨💻
👨💼
Code ✅
Workers ✅
Alerts ✅
👨💻 👨💻
👩💻
👩💻
👨💼 👨💻
code, workers, and alerts they were responsible for covering for on-call. During these…

@molly_struve
On-Call Training Sessions
on-call training sessions we went over things like…

@molly_struve
• Common issues
common issues that might pop up. For example, when this alert goes off usually it means xyz is broken with this piece of the code. We also took the time to…

@molly_struve
• Common issues
• Code Functionality
Dive into all of the code functionality. We made sure every person on every team knew exactly what each piece of code they covered did. And last but not least, we made sure each team understood…

@molly_struve
• Common issues
• Code Functionality
• Larger Application Impact
How their components impacted the rest of the application. For example, if say Redis went down, how did that affect the rest of the application. These On-Call training sessions gave devs…

@molly_struve
Confidence
a lot more confidence in their ability to handle on-call situations because they now had a clear picture of what they were responsible for and how to handle it. Even though they hadn’t built some of the code themselves,
they had an understanding of exactly how it all worked. Hosting…

NETFLIX
SECTION DIVIDER
Overhauling On-Call
1
2
3
4
5
6
On-call training sessions takes the 4th spot in our overhauling on-call list. As I mentioned earlier, the purpose of these training sessions was to not only educate the developers about the code they were supporting,
but also to give them confidence. Another confidence booster for developers who were on-call was…

@molly_struve
On-Call Support
Having on-call support. What exactly do I mean by this? When a person is paged they aren’t always going to have all of the answers. Sometimes they need help and support from someone else to figure out the
problem. Originally…

@molly_struve
On-Call Support
Site Reliability
the Site Reliability team acted as support for the on-call developer. If the on-call developer had questions or needed help they would talk to the Site Reliability team member that was on-call that week. The problem
with this approach was that our Site Reliability team, as I mentioned earlier, only…

@molly_struve
On-Call Support
🤓
🤓 🤓
Had 3 people on it and when you have 3 people trying to support 15+ its…

@molly_struve
On-Call Support
😥
😕 😫
It’s not going to end well! Our Site Reliability team got burned out pretty quick being the constant support system for the on-call developers. With…

@molly_struve
👨💻 👨💻
👩💻
👩💻
👨💻
👩💼
👨💻 👨💻
👩💻
👩💻
👨💻
👨💼
👨💻 👨💻
👩💻
👩💻
👨💼 👨💻
the new system, each…

@molly_struve
👨💻 👨💻
👩💻
👩💻
👨💻
👩💼
👨💻 👨💻
👩💻
👩💻
👨💻
👨💼
👨💻 👨💻
👩💻
👩💻
👨💼 👨💻
developer that is on-call..

@molly_struve
👨💻 👨💻
👩💻
👩💻
👨💻
👩💼
👨💻 👨💻
👩💻
👩💻
👨💻
👨💼
👨💻 👨💻
👩💻
👩💻
👨💼 👨💻
acts as support for the others. If any developer finds themselves overwhelmed or stuck on an issue they have two people they can reach out to for help. Having a support system like this is crucial for crafting an on-
call system that is comfortable for everyone. No one wants to feel alone when they are on-call so ensuring that the…

NETFLIX
SECTION DIVIDER
Overhauling On-Call
1
2
3
4
5
6
On-Call Support System
On-Call developer has a solid support system in place is crucial. The last improvement we made to our system that was welcomed by everyone was that we…

@molly_struve
Focused On-Call
Responsibilities
Focused the responsibilities for the On-Call developers. With our…

@molly_struve
😦
old system the lone on-call developer was responsible for….

@molly_struve
😦 Technically debugging and
fixing the problem
Technically debugging and fixing the problem which we know is a pretty big as in itself. In addition, they were responsible…

@molly_struve
fixing the problem
Setting a status page if
needed
For Setting a status page if needed. And last it was their job to handle…

@molly_struve
fixing the problem
Communicating the problem
to the rest of the team
needed
communicating the problem to the rest of the team. Needless to say the duties of the on-call developer were WAY overloaded. With the new system…

@molly_struve
fixing the problem
needed
the ONLY responsibility an on-call developer had was debugging and fixing the problem. Narrowing the scope was crucial to…

@molly_struve
😊 Technically debugging and
fixing the problem
needed
improving the on-call experience. It allowed the developers to focus on what they did best, fixing the technical problem at hand. The responsibility of setting the status page was moved…

@molly_struve
fixing the problem
Setting a status page if we
need it
to the support team. This made sense to us bc the support team is the closest to the customer, and therefore, are the best equipped to communicate any problems. When an incident occurred, the support team was
notified and was responsible for determining if a status page or any customer communication was needed. The responsibility of…

@molly_struve
fixing the problem
Setting a status page if we
need it
Communicating the problem internally was then moved to the manager of the on-call developer’s team. If updates need to be spread across the tech organization during an incident the on-call developer’s manager
was responsible for doing it. Narrowing the scope of the on-call responsibilities was…

NETFLIX
SECTION DIVIDER
Overhauling On-Call
1
2
3
4
5
6
Narrow On-Call Responsibility Scope
The last piece of the puzzle when it came to overhauling this terrible on-call system. I am sure many of you are thinking, “That sounds great, but what does an on-call system like that get me? How can my team
benefit from implementing some of these strategies?!” I have touched lightly on some of the benefits but now I want to really dive into them and talk about…

@molly_struve
The Payoff
the Payoff of overhauling this terrible on-call system. The first big payoff was…

NETFLIX
SECTION DIVIDER
The Payoff
1
2
3
4
5
Improved Alerting
Improved alerting. Originally the Site Reliability team had set up all the alerting tools. However, once we split up the alerts and handed them over to each of the 3 developer teams, the teams took them and ran.
Because each team felt a renewed sense of ownership over their alerts they started to improve and build on them. Not only did they make more alerts, …

NETFLIX
SECTION DIVIDER
The Payoff
1
2
3
4
5
Improved Alerting
but they improved the accuracy of the existing ones. The improved alerts in turn led to happier on-call developers bc there were less false positives and alerts were tweaked to alert on problems sooner before
they became a bigger issue. Improved alerting Wasn’t the only payoff we saw after overhauling the system. As I briefly mentioned, there was…

@molly_struve
Sense of
Ownership
A renewed sense of ownership among all of the developers. Even though one team would edit the code that another team supported, there was still a keen sense of ownership for the supporting team. The supporting
team acted as the domain experts over the code they owned when on-call. The key strategy for ensuring this sense of ownership was…

@molly_struve
CODEOWNERS
using the CODEOWNERS file. The CODEOWNERS file ensured that the supporting team was always aware and could sign off on any changes made to the code they supported. In addition, splitting up the code
between the 3 teams meant each team had…

@molly_struve
Manageable
Code Chunks
A manageable chunk of code that they could actually learn and support. Unlike before where every developer had to support the entire codebase which was way too much for any single person to handle. Shrinking…

@molly_struve
Manageable
Code Chunks
down the code that each dev was responsible for along with keeping them updated on any changes to that code gave developers that…

NETFLIX
SECTION DIVIDER
The Payoff
1
2
3
4
5
Improved Alerting
Sense of Ownership
Sense of ownership again, and that sense of ownership made them excited to support their on-call code. Another benefit of the new system was…

NETFLIX
SECTION DIVIDER
The Payoff
1
2
3
4
5
Improved Alerting
Sense of Ownership
Faster Incident Response
Faster incident response time. Hallelujah! Incident response time improved for a couple of reasons. For one, with 3 developers on-call at once and each one of them focusing on a smaller piece of the application, they
could…

@molly_struve
Identify Problems
Faster
Identify problems faster. As I also mentioned before, each team took time to improve their own alerts so that the alerts would notify them of problems before they turned into…

@molly_struve
Identify Problems
Faster
major issues. This decreased the incident response times and even helped prevent some incidents altogether. In addition, to identifying problems sooner, debugging and figuring out what had triggered a problem…

@molly_struve
Identify Problems
Faster
became quicker because teams were intimately familiar with their alerts and the pieces of code they owned. When a problem arose the team could debug it much more efficiently than before. Faster…

NETFLIX
SECTION DIVIDER
The Payoff
1
2
3
4
5
Improved Alerting
Sense of Ownership
incident response is always the goal of any Site Reliability team and to be able to achieve it with a new on-call system was pretty awesome. Another payoff of this new system was that the person who was on-call
was…

NETFLIX
SECTION DIVIDER
The Payoff
1
2
3
4
5
Improved Alerting
Sense of Ownership
Never Alone
Never alone. Having 3 developers on-call at once means that none of the developers are ever alone when they are on-call. If things started to fall apart in one section of the application, the developer that owned that
section knew there were two others available to…

NETFLIX
SECTION DIVIDER
The Payoff
1
2
3
4
5
Improved Alerting
Sense of Ownership
Never Alone
help if they needed it. Being On-call can be stressful, but knowing that there is always someone easily accessible to help can do wonders for a developer’s confidence. Ensuring that no one is ever alone may
seem like a small positive, but I want to add..

NETFLIX
SECTION DIVIDER
The Payoff
1
2
3
4
5
Improved Alerting
Sense of Ownership
Never Alone
this was the most requested attribute of an on-call system from the developers. Before starting this overhaul process, I spoke with a few developers to get a feel for what they wanted out of the new system and at…

NETFLIX
SECTION DIVIDER
The Payoff
1
2
3
4
5
Improved Alerting
Sense of Ownership
Never Alone
the very top of the list was having help and support when on-call. Don’t underestimate how much a multiple developer on-call system can improve the on-call experience. Developers…

NETFLIX
SECTION DIVIDER
The Payoff
1
2
3
4
5
Improved Alerting
Sense of Ownership
Never Alone
Never being alone while on call takes the 4th spot in our list of benefits of overhauling our on-call system. The last benefit we discovered with the new system was…

NETFLIX
SECTION DIVIDER
The Payoff
1
2
3
4
5
Improved Alerting
Sense of Ownership
Never Alone
Betting Cross-Team Communication
better cross-team communication. As I stated before, each of the 3 developer teams worked across the entire application. This meant teams were often changing the code that another team was responsible for
during on-call. Having the CODEOWNERS file ensured that the on-call team was alerted to those changes. This..

NETFLIX
SECTION DIVIDER
The Payoff
1
2
3
4
5
Improved Alerting
Sense of Ownership
Never Alone
not only allowed for a good technical review but it also kept each of the teams up to date on what the other teams were working on and doing. And with that…

NETFLIX
SECTION DIVIDER
The Payoff
1
2
3
4
5
Improved Alerting
Sense of Ownership
Never Alone
The list of payoffs of overhauling this terrible on-call system is complete. At the top…

NETFLIX
SECTION DIVIDER
The Payoff
1
2
3
4
5
Improved Alerting
Sense of Ownership
Never Alone
Improved alerting. Any small Site Reliability team knows that any outside help you can get with your alerting and monitoring systems is hugely appreciated and benefits everyone. Developers got a renewed…

NETFLIX
SECTION DIVIDER
The Payoff
1
2
3
4
5
Improved Alerting
Sense of Ownership
Never Alone
sense of ownership making them more enthusiastic about their on-call responsibilities…

NETFLIX
SECTION DIVIDER
The Payoff
1
2
3
4
5
Improved Alerting
Sense of Ownership
Never Alone
Incident response got faster thanks to the improved alerting and in depth knowledge of each team over their on-call components. On-call developers were

NETFLIX
SECTION DIVIDER
The Payoff
1
2
3
4
5
Improved Alerting
Sense of Ownership
Never Alone
never alone when they were on call giving them peace of mind and confidence. And finally…

NETFLIX
SECTION DIVIDER
The Payoff
1
2
3
4
5
Improved Alerting
Sense of Ownership
Never Alone
cross-team communication improved which benefited the entire technical organization. I think we can agree that these..

NETFLIX
SECTION DIVIDER
The Payoff
1
2
3
4
5
Improved Alerting
Sense of Ownership
Never Alone
Benefits would benefit all of our teams and to get them from an on-call system is huge. If these are the benefits that your team is looking for and developers are struggling within your on-call system then consider
overhauling it…

NETFLIX
SECTION DIVIDER
Overhauling On-Call
1
2
3
4
5
6
With the help of these 6 strategies. Smaller rotations to increase on-call frequency. Split up application ownership so developers can once again feel like they own what they are supporting. Use…

NETFLIX
SECTION DIVIDER
Overhauling On-Call
1
2
3
4
5
6
a CODEOWNERS file to further instill that sense of ownership for your developers. Host on-call training sessions for your teams and ensure those who are on-call always have a support system. Finally keep the
responsibilities for your on-call developers in check so they can focus on what they do best, fixing technical problems. On-call……

@molly_struve
On-Call Shouldn’t
Suck
is something that many people in this industry dread and it shouldn't be that way. If people are dreading on-call then something is broken with your system. Sure, everyone at some point will get that late night or
weekend page that is a pain, but that pain…

@molly_struve
On-Call Shouldn’t
Suck
shouldn't be the norm. If on-call makes people want to pull their hair out ALL the time, then you have a problem that needs to be fixed. I hope this talk has given you some ideas to help you improve your own on-call
system so that it can help your developers thrive.

Thank 
You.
Molly Struve
mstruve@net
fl
ix.com

LeadDev NYC 2022: Calling Out a Terrible On-call System

Recommended

Recommended

More Related Content

Similar to LeadDev NYC 2022: Calling Out a Terrible On-call System

Similar to LeadDev NYC 2022: Calling Out a Terrible On-call System (20)

More from Molly Struve

More from Molly Struve (12)

Recently uploaded

Recently uploaded (20)

LeadDev NYC 2022: Calling Out a Terrible On-call System