A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Joe Smith
November 14, 2017
1
A Culture Of Automation

Joe Smith
Operations Engineer, Slack Application Operations Team
● Build/Run core systems responsible for Slack
● CDNs, Edge Regions, Web tier, Websockets, development workflow, etc
● Previously:
○ Tech Lead, Aurora/Mesos SRE at Twitter
○ Internal Technology Resident at Google

Folks with the desire to build
resilient systems
These roles may have different names, but
share the above goal
● Reliability Engineering
● Operations
● DevOps
● Site Reliability Engineering
● Production Engineering
● Systems Engineering
Audience

Agenda
1. Running Production Services
2. Runbooks
3. Automation

Two Pizza Rule
A team should be sized to share
around two large pizzas

“”
“We will take the site down today! 💥”
– No one when they wake up

● Careful Planning and Procedures
● Extensive Documentation
● Good Communication
Strategies

Planning
● Some components are prioritized for speed while others are meant to be
canaried and analyzed
● Changes need to be staged to coordinate with each other
● Give teams the tools and visibility they need to make improvements and
understand impact
● Identify your rollback strategy ahead of time

Documentation
● Each code or procedure change should be paired with an update to easily-
readable text
● Help your teammates and yourself weeks from now when you need to
understand how things work.
● Do not just describe how systems are structured, explain why they are built
that way!
● Additional context can inform future decisions

Communication
● Change Management - Coordinating release schedules can be difficult
● Launch Channel - Announce changes, link to more details in a feature-
specific Slack channel for the change
● Add links to commits, code reviews, threads in Slack, mailing list posts, and
StackOverflow questions
● This enables your team to benefit from the research you've done

● Careful Planning and Procedures
● Extensive Documentation
● Good communication
Strategies

● Unexpected changes, forced roll-forwards
● Outdated Runbooks
● Missed Notifications
Growing Pains

Good Problems to Have
As the team grows, it's no
longer possible to understand
everything that's happening at
once.
The scope of work is also
increasing!

“”
“Checklists for commonly
repeated operational tasks.”
– Slack, Runbook README

1. Location
2. Format
3. Contents

Location
● Google Docs
○ Good Formatting, Mobile Apps, external service
● Wiki
○ Web Interface, Track Changes
● Markdown in git repo (paired with Github)
○ Formatting, offline support, normal Pull Request flow

Markdown in Git Repo
● Track changes across revisions
● Optional peer review
● Link to relevant sections
● Clone repo for offline support

Runbook Template
(thanks to my teammate Megan!)
● ApplicationServer
○ README.md
○ standard_actions.md
○ other_actions.md
○ alerts/
■ box_failure.md
■ some_alert_name.md

Content
This is not the place for Design
Documentation.
These are highly-actionable,
succinct descriptions of next
steps.

“”
"Test until fear turns
to boredom."
– JUnit FAQ,
http://junit.sourceforge.net/doc/faq/faq.htm#tests_6

“”
"Automate once fear
turns to boredom"
– Ancient SRE Corrollary

Beyond Runbooks
● Turn a manual checklist into a testable, repeatable set of steps anyone can
run
● Anytime you discover a sharp edge or workaround, this can be codified in the
tool
● Reduce sections of "but if this happens, check this dashboard and then do
one of three things"

The Tooling Workflow
Initial Steps
This process can evolve over a long time and generally improve things.
● One person has all the knowledge in their head
● That person writes down everything they know in a runbook
● Someone sees an annoying or complicated piece and writes a small script
to be run instead for a tiny part of the process
The next jump will be the most difficult part!

Maintenance
● It feels great to have written your first tool!
● You may be lucky and have no bugs
● Most likely there are some edge cases- that is okay and expected!
● Take some time to figure out what went wrong and how to make things better.

Completion
● Later on, another part of the process can be added in and the documentation
further updated
● Over time- the runbook becomes "Run this tool we wrote, send bugs to the
authors"
● Finally- there is no longer an entry! The tool is run automatically, or the
system itself is able to solve that problem

“”
"The value of humans is to
execute Judgement, the value
of computers is to execute
instructions"
– Aron, teammate at Slack

Runbook to Automated Workflow
1. Brain
2. Runbook
3. Start of automation
4. Automation evolution (safeguards)
5. Self-contained Tool
6. Fully Automated

Process in Code
● Using libraries like fabric, pychef, and boto3 can ease automation
● When there are issues, the code can be reviewed for process changes, git
history can be consulted, etc
● No more "I forgot that was changed and followed the old process!"
● Each time someone submits an improvement or workflow tweak, that will
always be useful from now on!

Thank You!
38
For more information go to: slack.com/jobs

Joe Smith
Operations Engineer, Slack Application Operations Team
● Come build the future of work!
○ https://slack.com/careers/641062/senior-site-reliability-engineer
● Please reach out and say hi!
○ @Yasumoto on Twitter
● Tools Scaffold
○ https://github.com/Yasumoto/tools

A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Similar to A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017 (20)

More from DevOpsDays Tel Aviv

More from DevOpsDays Tel Aviv (20)

Recently uploaded

Recently uploaded (20)

A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Editor's Notes