2. Joe Smith
Operations Engineer, Slack Application Operations Team
● Build/Run core systems responsible for Slack
● CDNs, Edge Regions, Web tier, Websockets, development workflow, etc
● Previously:
○ Tech Lead, Aurora/Mesos SRE at Twitter
○ Internal Technology Resident at Google
3. Folks with the desire to build
resilient systems
These roles may have different names, but
share the above goal
● Reliability Engineering
● Operations
● DevOps
● Site Reliability Engineering
● Production Engineering
● Systems Engineering
Audience
6. Two Pizza Rule
A team should be sized to share
around two large pizzas
7. “”
“We will take the site down today! 💥”
– No one when they wake up
8. ● Careful Planning and Procedures
● Extensive Documentation
● Good Communication
Strategies
9. Planning
● Some components are prioritized for speed while others are meant to be
canaried and analyzed
● Changes need to be staged to coordinate with each other
● Give teams the tools and visibility they need to make improvements and
understand impact
● Identify your rollback strategy ahead of time
10. Documentation
● Each code or procedure change should be paired with an update to easily-
readable text
● Help your teammates and yourself weeks from now when you need to
understand how things work.
● Do not just describe how systems are structured, explain why they are built
that way!
● Additional context can inform future decisions
11. Communication
● Change Management - Coordinating release schedules can be difficult
● Launch Channel - Announce changes, link to more details in a feature-
specific Slack channel for the change
● Add links to commits, code reviews, threads in Slack, mailing list posts, and
StackOverflow questions
● This enables your team to benefit from the research you've done
12. ● Careful Planning and Procedures
● Extensive Documentation
● Good communication
Strategies
14. Good Problems to Have
As the team grows, it's no
longer possible to understand
everything that's happening at
once.
The scope of work is also
increasing!
18. Location
● Google Docs
○ Good Formatting, Mobile Apps, external service
● Wiki
○ Web Interface, Track Changes
● Markdown in git repo (paired with Github)
○ Formatting, offline support, normal Pull Request flow
19. Markdown in Git Repo
● Track changes across revisions
● Optional peer review
● Link to relevant sections
● Clone repo for offline support
31. Beyond Runbooks
● Turn a manual checklist into a testable, repeatable set of steps anyone can
run
● Anytime you discover a sharp edge or workaround, this can be codified in the
tool
● Reduce sections of "but if this happens, check this dashboard and then do
one of three things"
32. The Tooling Workflow
Initial Steps
This process can evolve over a long time and generally improve things.
● One person has all the knowledge in their head
● That person writes down everything they know in a runbook
● Someone sees an annoying or complicated piece and writes a small script
to be run instead for a tiny part of the process
The next jump will be the most difficult part!
33. The Tooling Workflow
Maintenance
● It feels great to have written your first tool!
● You may be lucky and have no bugs
● Most likely there are some edge cases- that is okay and expected!
● Take some time to figure out what went wrong and how to make things better.
34. The Tooling Workflow
Completion
● Later on, another part of the process can be added in and the documentation
further updated
● Over time- the runbook becomes "Run this tool we wrote, send bugs to the
authors"
● Finally- there is no longer an entry! The tool is run automatically, or the
system itself is able to solve that problem
35. “”
"The value of humans is to
execute Judgement, the value
of computers is to execute
instructions"
– Aron, teammate at Slack
37. Process in Code
● Using libraries like fabric, pychef, and boto3 can ease automation
● When there are issues, the code can be reviewed for process changes, git
history can be consulted, etc
● No more "I forgot that was changed and followed the old process!"
● Each time someone submits an improvement or workflow tweak, that will
always be useful from now on!
39. Joe Smith
Operations Engineer, Slack Application Operations Team
● Come build the future of work!
○ https://slack.com/careers/641062/senior-site-reliability-engineer
● Please reach out and say hi!
○ @Yasumoto on Twitter
● Tools Scaffold
○ https://github.com/Yasumoto/tools
Editor's Notes
I do believe that folks in these roles all have different approaches, and that is excellent.
There is a focus of continuity with Operations, and deep architectural expertise with Site Reliability Engineering.
I’d love to talk to you all afterward to hear your approaches to building out teams!
We're going to break this talk into 3 pieces, and I would love to talk about any of them for hours
But we only have time for one talk this year, so let's see what we'd like to focus on
Who here considers them an experienced Operations Engineer or SRE?
Who has written at least a few hundred lines of Python?
Who has published your own python distribution to Cheeseshop (aka PyPI?)
This is going to set the stage for priorities and approaches based on years running large web services.
The focus is primarily around working with a larger group of people— a team that has gone beyond the “two pizza rule”
Once you go beyond this number of engineers, your problems begin to center around organization and collaboration instead of technology (usually!)
There are a few universal best practices which we can use to inform how we structure our work.
Different approaches to doing this
For this talk, we're going to consider three pillars of a great operations team
These are aspects of rolling out a new tool, procedure, service, or automation
^ Consider strategies for dealing with mistakes vs. just "do a rollback"
^ Is everyone familiar with Rollback strategies?
^ Whenever you make a change in production, you should consider how are you going to rollback if something breaks!
^ DO NOT come up with your rollback strategy under pressure after things break!
^ Use the _feature-flag_ approach: Hide chef/puppet changes behind a per-node attribute
^ Deploy processes should be the same for a roll-forward as a roll-back
^ Specify which version number should be running and structure the config to match
These written pieces of text will be generally of two types: Design Documentation describing architecture and function.
The second, which we will dig into, are called “Runbooks”
These are descriptions about how to run a service. They contain important information about how to spin the service up from scratch. How to repair components when they fail. And most importantly, a corresponding runbook for each page/alert that goes to an engineer.
Documentation is written in the past, and consumed in the present.
“Communication” is real-time: this means information is conveyed at the same time it is received.
Any questions about bringing these together?
So these are the ideal, but unfortunately it isn't always the case!
Let’s talk about corresponding issues we can run into
^ These all get more difficult over time
^ More people, more difficult to communicate
Here’s where we get to a really fun part of the talk
I’ve been on a team where our runbook was one really long wiki page, and now a team with much more structured information.
I’m biased, but I think this approach is much better!
A little effort has helped make oncall and remediation much easier.
Even if you don’t follow anything else I suggest here— make time explicitly for your team to improve documentation!
First, consider whether what you're writing is really a runbook. If you're documenting an architecture or a workflow that's expected to become muscle memory (like how Logstash works or our Chef workflow) then your file belongs in the ops directory.
Runbooks are for reapeatable processes. We are all required to refer to the runbook each and every time we execute a process documented in one so that changes to runbooks will reliably take effect.
We’re going to talk about three different components to good runbooks.
Runbooks should be organized into directories in the runbooks directory named descriptively and broadly. For example, message_server as a top level directory, with descriptive files within. Within those files there should be a first-level heading reiterating the name of the file (some_alert_name.md becomes "# Some Alert Name") and many second-level headings, following the template laid out in the example directory.
There should also be a README.md file with a basic explanation of the service, related services, contacts/channels for that service, and related links/graphs.
Now remember, when someone realizes something is wrong, this will likely be the first place they go to get the lay on the land. This should be generally helpful information.
Each process documented should present, as prose, any context that may be required to decide if use of this runbook is appropriate or otherwise set the stage for execution. Then an ordered list should be presented. Use "1. " as the prefix for every element in the list to keep diffs tidier; when the Markdown is rendered the numbers will be sequential. Finally, any parting considerations should be presented, again as prose.
It is very important that Markdown be used precisely when writing runbooks. In the heat of the moment, consistent formatting adds clarity and helps us move quickly and confidently.
Specify where a command is to be run before specifying the command ("From staging:", "From your chef-repo:", etc.)
Link to graphs, God pages, and other specific resources required during execution of the runbook
Include actions like updating the status site, sending particular messages in Slack, deploying the site, enabling notifications, and so on lest they be forgotten or their omission confuse the operator executing your runbook
This is an example of something that gives some context— it clearly points the reader to the next services to investigate. Following along with what Corey said this morning...
It could be improved in a few ways.. Most notably by helpfully linking to runbooks about how to diagnose if memcache, databases, and other downstreams are healthy or not!
This is something that an experienced team member can read and have no issues with. Someone new on the team may need to burn time getting help from a teammate before isolating the root cause.
Ensure you are creating something that can be acted upon by a teammate at 2am. Otherwise expect to get woken up to chip in!
So here’s my favorite part! To start we’re going to talk about one of my favorite maxims, followed by an ancient proverb.
This is the best rule of thumb we have for ensuring we are directing efforts appropriately
For something that is so standard that you know you can handle it, and have seen the edge cases, it’s going to be boring
Don’t let yourself or your team stay working on the same issue for too long.
This is the perfect opportunity to put in some extra effort to make sure it is automated.
Automatically detect which procedures to follow
As we saw, that example runbook was helpful, but sort of passed the buck when it came to further troubleshooting.
A potential tool might automatically query health metrics for downstream services to help identify any problems
Let’s walk through what that procedure might look like.
Once you have defined the logic for a process, it can be written out in a language for the future
^ That language could also happen to be understood by a machine!
Humans can also read code, but machines can execute explicit instructions.
^ We have found that investing the time to write tools as we understand a process has been worth the effort
^ We are able to work on much more interesting problems, and not repeatedly wake up for the same issues again and again
This can be a large amount of work, and the payoff isn’t always immediate.
It can be slow-going to write something which is resilient enough to be run by someone not familiar with both the process and the tool.
The payoff will be well-worth it however!
I’m hopeful I’ve motivated you to not only invest in your runbooks, but also looking into improvements beyond just text.
Using machines to do some work for us will allow our brains to focus on bigger and more impactful challenges.
If you agree, disagree, or have any questions, I’d love to hear it!
Thank you!