SlideShare a Scribd company logo
1 of 39
Joe Smith
November 14, 2017
1
A Culture Of Automation
Joe Smith
Operations Engineer, Slack Application Operations Team
● Build/Run core systems responsible for Slack
● CDNs, Edge Regions, Web tier, Websockets, development workflow, etc
● Previously:
○ Tech Lead, Aurora/Mesos SRE at Twitter
○ Internal Technology Resident at Google
Folks with the desire to build
resilient systems
These roles may have different names, but
share the above goal
● Reliability Engineering
● Operations
● DevOps
● Site Reliability Engineering
● Production Engineering
● Systems Engineering
Audience
Agenda
1. Running Production Services
2. Runbooks
3. Automation
Running Production Services
Two Pizza Rule
A team should be sized to share
around two large pizzas
“”
“We will take the site down today! 💥”
– No one when they wake up
● Careful Planning and Procedures
● Extensive Documentation
● Good Communication
Strategies
Planning
● Some components are prioritized for speed while others are meant to be
canaried and analyzed
● Changes need to be staged to coordinate with each other
● Give teams the tools and visibility they need to make improvements and
understand impact
● Identify your rollback strategy ahead of time
Documentation
● Each code or procedure change should be paired with an update to easily-
readable text
● Help your teammates and yourself weeks from now when you need to
understand how things work.
● Do not just describe how systems are structured, explain why they are built
that way!
● Additional context can inform future decisions
Communication
● Change Management - Coordinating release schedules can be difficult
● Launch Channel - Announce changes, link to more details in a feature-
specific Slack channel for the change
● Add links to commits, code reviews, threads in Slack, mailing list posts, and
StackOverflow questions
● This enables your team to benefit from the research you've done
● Careful Planning and Procedures
● Extensive Documentation
● Good communication
Strategies
● Unexpected changes, forced roll-forwards
● Outdated Runbooks
● Missed Notifications
Growing Pains
Good Problems to Have
As the team grows, it's no
longer possible to understand
everything that's happening at
once.
The scope of work is also
increasing!
Runbooks
“”
“Checklists for commonly
repeated operational tasks.”
– Slack, Runbook README
1. Location
2. Format
3. Contents
Location
● Google Docs
○ Good Formatting, Mobile Apps, external service
● Wiki
○ Web Interface, Track Changes
● Markdown in git repo (paired with Github)
○ Formatting, offline support, normal Pull Request flow
Markdown in Git Repo
● Track changes across revisions
● Optional peer review
● Link to relevant sections
● Clone repo for offline support
Runbook Template
(thanks to my teammate Megan!)
● ApplicationServer
○ README.md
○ standard_actions.md
○ other_actions.md
○ alerts/
■ box_failure.md
■ some_alert_name.md
README.md
standard_actions.md
other_action.md
alert.md
box_failure.md
Example
Content
This is not the place for Design
Documentation.
These are highly-actionable,
succinct descriptions of next
steps.
Automation
“”
"Test until fear turns
to boredom."
– JUnit FAQ,
http://junit.sourceforge.net/doc/faq/faq.htm#tests_6
“”
"Automate once fear
turns to boredom"
– Ancient SRE Corrollary
Beyond Runbooks
● Turn a manual checklist into a testable, repeatable set of steps anyone can
run
● Anytime you discover a sharp edge or workaround, this can be codified in the
tool
● Reduce sections of "but if this happens, check this dashboard and then do
one of three things"
The Tooling Workflow
Initial Steps
This process can evolve over a long time and generally improve things.
● One person has all the knowledge in their head
● That person writes down everything they know in a runbook
● Someone sees an annoying or complicated piece and writes a small script
to be run instead for a tiny part of the process
The next jump will be the most difficult part!
The Tooling Workflow
Maintenance
● It feels great to have written your first tool!
● You may be lucky and have no bugs
● Most likely there are some edge cases- that is okay and expected!
● Take some time to figure out what went wrong and how to make things better.
The Tooling Workflow
Completion
● Later on, another part of the process can be added in and the documentation
further updated
● Over time- the runbook becomes "Run this tool we wrote, send bugs to the
authors"
● Finally- there is no longer an entry! The tool is run automatically, or the
system itself is able to solve that problem
“”
"The value of humans is to
execute Judgement, the value
of computers is to execute
instructions"
– Aron, teammate at Slack
Runbook to Automated Workflow
1. Brain
2. Runbook
3. Start of automation
4. Automation evolution (safeguards)
5. Self-contained Tool
6. Fully Automated
Process in Code
● Using libraries like fabric, pychef, and boto3 can ease automation
● When there are issues, the code can be reviewed for process changes, git
history can be consulted, etc
● No more "I forgot that was changed and followed the old process!"
● Each time someone submits an improvement or workflow tweak, that will
always be useful from now on!
Thank You!
38
For more information go to: slack.com/jobs
Joe Smith
Operations Engineer, Slack Application Operations Team
● Come build the future of work!
○ https://slack.com/careers/641062/senior-site-reliability-engineer
● Please reach out and say hi!
○ @Yasumoto on Twitter
● Tools Scaffold
○ https://github.com/Yasumoto/tools

More Related Content

What's hot

TDC 2021 - Better software, faster: Principles of Continuous Delivery and DevOps
TDC 2021 - Better software, faster: Principles of Continuous Delivery and DevOpsTDC 2021 - Better software, faster: Principles of Continuous Delivery and DevOps
TDC 2021 - Better software, faster: Principles of Continuous Delivery and DevOpsBert Jan Schrijver
 
Automated Performance Testing
Automated Performance TestingAutomated Performance Testing
Automated Performance TestingLars Thorup
 
JUG CH September 2021 - Debugging distributed systems
JUG CH September 2021 - Debugging distributed systemsJUG CH September 2021 - Debugging distributed systems
JUG CH September 2021 - Debugging distributed systemsBert Jan Schrijver
 
JavaLand 2022 - Debugging distributed systems
JavaLand 2022 - Debugging distributed systemsJavaLand 2022 - Debugging distributed systems
JavaLand 2022 - Debugging distributed systemsBert Jan Schrijver
 
Random thoughts and dev practices / advices to build a great product
Random thoughts and dev practices / advices to build a great productRandom thoughts and dev practices / advices to build a great product
Random thoughts and dev practices / advices to build a great productGuillaume POTIER
 
The Clash Between Devops and Quality Assurance
The Clash Between Devops and Quality AssuranceThe Clash Between Devops and Quality Assurance
The Clash Between Devops and Quality AssuranceWebcsonsultsEU
 
Bootstrapping Quality
Bootstrapping QualityBootstrapping Quality
Bootstrapping QualityMichael Roufa
 
YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...
YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...
YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...DevOpsDays Tel Aviv
 
Dev ops is more than CI+CD tools
Dev ops is more than CI+CD toolsDev ops is more than CI+CD tools
Dev ops is more than CI+CD toolsSudipta Lahiri
 
Agile Mindset and Its Implications - My Understanding
Agile Mindset and Its Implications - My UnderstandingAgile Mindset and Its Implications - My Understanding
Agile Mindset and Its Implications - My UnderstandingNitin Bhide
 
JavaLand 2022 - Software architecture in a DevOps world
JavaLand 2022 - Software architecture in a DevOps worldJavaLand 2022 - Software architecture in a DevOps world
JavaLand 2022 - Software architecture in a DevOps worldBert Jan Schrijver
 
Lightning talk how to edit the Silverstripe CMS docs
Lightning talk how to edit the Silverstripe CMS docsLightning talk how to edit the Silverstripe CMS docs
Lightning talk how to edit the Silverstripe CMS docsMichaelPritchard21
 
Marko Berković
Marko BerkovićMarko Berković
Marko BerkovićCodeFest
 
JUG Bonn June 2021 - The DevOps disaster
JUG Bonn June 2021 - The DevOps disasterJUG Bonn June 2021 - The DevOps disaster
JUG Bonn June 2021 - The DevOps disasterBert Jan Schrijver
 
Programming Sessions KU Leuven - Session 01
Programming Sessions KU Leuven - Session 01Programming Sessions KU Leuven - Session 01
Programming Sessions KU Leuven - Session 01Rafael Camacho Dejay
 
130511 stop wasting_your_time
130511 stop wasting_your_time130511 stop wasting_your_time
130511 stop wasting_your_timeHenning Blohm
 
10 skills developers should invest in for 2014
10 skills developers should invest in for 201410 skills developers should invest in for 2014
10 skills developers should invest in for 2014Pakorn Weecharungsan
 

What's hot (20)

TDC 2021 - Better software, faster: Principles of Continuous Delivery and DevOps
TDC 2021 - Better software, faster: Principles of Continuous Delivery and DevOpsTDC 2021 - Better software, faster: Principles of Continuous Delivery and DevOps
TDC 2021 - Better software, faster: Principles of Continuous Delivery and DevOps
 
Automated Performance Testing
Automated Performance TestingAutomated Performance Testing
Automated Performance Testing
 
JUG CH September 2021 - Debugging distributed systems
JUG CH September 2021 - Debugging distributed systemsJUG CH September 2021 - Debugging distributed systems
JUG CH September 2021 - Debugging distributed systems
 
OpenNTF Essentials
OpenNTF EssentialsOpenNTF Essentials
OpenNTF Essentials
 
JavaLand 2022 - Debugging distributed systems
JavaLand 2022 - Debugging distributed systemsJavaLand 2022 - Debugging distributed systems
JavaLand 2022 - Debugging distributed systems
 
Random thoughts and dev practices / advices to build a great product
Random thoughts and dev practices / advices to build a great productRandom thoughts and dev practices / advices to build a great product
Random thoughts and dev practices / advices to build a great product
 
Develop 4 Developers
Develop 4 DevelopersDevelop 4 Developers
Develop 4 Developers
 
The Clash Between Devops and Quality Assurance
The Clash Between Devops and Quality AssuranceThe Clash Between Devops and Quality Assurance
The Clash Between Devops and Quality Assurance
 
Bootstrapping Quality
Bootstrapping QualityBootstrapping Quality
Bootstrapping Quality
 
YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...
YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...
YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...
 
Dev ops is more than CI+CD tools
Dev ops is more than CI+CD toolsDev ops is more than CI+CD tools
Dev ops is more than CI+CD tools
 
Agile Mindset and Its Implications - My Understanding
Agile Mindset and Its Implications - My UnderstandingAgile Mindset and Its Implications - My Understanding
Agile Mindset and Its Implications - My Understanding
 
JavaLand 2022 - Software architecture in a DevOps world
JavaLand 2022 - Software architecture in a DevOps worldJavaLand 2022 - Software architecture in a DevOps world
JavaLand 2022 - Software architecture in a DevOps world
 
Lightning talk how to edit the Silverstripe CMS docs
Lightning talk how to edit the Silverstripe CMS docsLightning talk how to edit the Silverstripe CMS docs
Lightning talk how to edit the Silverstripe CMS docs
 
Marko Berković
Marko BerkovićMarko Berković
Marko Berković
 
Developer disciplines
Developer disciplinesDeveloper disciplines
Developer disciplines
 
JUG Bonn June 2021 - The DevOps disaster
JUG Bonn June 2021 - The DevOps disasterJUG Bonn June 2021 - The DevOps disaster
JUG Bonn June 2021 - The DevOps disaster
 
Programming Sessions KU Leuven - Session 01
Programming Sessions KU Leuven - Session 01Programming Sessions KU Leuven - Session 01
Programming Sessions KU Leuven - Session 01
 
130511 stop wasting_your_time
130511 stop wasting_your_time130511 stop wasting_your_time
130511 stop wasting_your_time
 
10 skills developers should invest in for 2014
10 skills developers should invest in for 201410 skills developers should invest in for 2014
10 skills developers should invest in for 2014
 

Similar to A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Version Control, Writers, and Workflows
Version Control, Writers, and WorkflowsVersion Control, Writers, and Workflows
Version Control, Writers, and Workflowsstc-siliconvalley
 
Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014Dan Cundiff
 
Code Management Workshop
Code Management WorkshopCode Management Workshop
Code Management WorkshopSameh El-Ashry
 
Building a custom cms with django
Building a custom cms with djangoBuilding a custom cms with django
Building a custom cms with djangoYann Malet
 
Que nos espera a los ALM Dudes para el 2013?
Que nos espera a los ALM Dudes para el 2013?Que nos espera a los ALM Dudes para el 2013?
Que nos espera a los ALM Dudes para el 2013?Bruno Capuano
 
Agile Software Development
Agile Software DevelopmentAgile Software Development
Agile Software DevelopmentAhmet Bulut
 
Become a Better Developer with Debugging Techniques for Drupal (and more!)
Become a Better Developer with Debugging Techniques for Drupal (and more!)Become a Better Developer with Debugging Techniques for Drupal (and more!)
Become a Better Developer with Debugging Techniques for Drupal (and more!)Acquia
 
Devops Devops Devops, at Froscon
Devops Devops Devops, at FrosconDevops Devops Devops, at Froscon
Devops Devops Devops, at FrosconKris Buytaert
 
Sharing our best secrets: Design a distributed system from scratch
Sharing our best secrets: Design a distributed system from scratchSharing our best secrets: Design a distributed system from scratch
Sharing our best secrets: Design a distributed system from scratchAdelina Simion
 
Introduction to Scrum
Introduction to ScrumIntroduction to Scrum
Introduction to ScrumRichie Rump
 
Agile_SDLC_Node.js@Paypal_ppt
Agile_SDLC_Node.js@Paypal_pptAgile_SDLC_Node.js@Paypal_ppt
Agile_SDLC_Node.js@Paypal_pptHitesh Kumar
 
Distributed teams
Distributed teamsDistributed teams
Distributed teamsKush Shah
 
Managing software projects & teams effectively
Managing software projects & teams effectivelyManaging software projects & teams effectively
Managing software projects & teams effectivelyAshutosh Agarwal
 
Practical automation for beginners
Practical automation for beginnersPractical automation for beginners
Practical automation for beginnersSeoweon Yoo
 
Services, tools & practices for a software house
Services, tools & practices for a software houseServices, tools & practices for a software house
Services, tools & practices for a software houseParis Apostolopoulos
 

Similar to A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017 (20)

Version Control, Writers, and Workflows
Version Control, Writers, and WorkflowsVersion Control, Writers, and Workflows
Version Control, Writers, and Workflows
 
Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014
 
Code Management Workshop
Code Management WorkshopCode Management Workshop
Code Management Workshop
 
Building a custom cms with django
Building a custom cms with djangoBuilding a custom cms with django
Building a custom cms with django
 
Usable Software Design
Usable Software DesignUsable Software Design
Usable Software Design
 
Que nos espera a los ALM Dudes para el 2013?
Que nos espera a los ALM Dudes para el 2013?Que nos espera a los ALM Dudes para el 2013?
Que nos espera a los ALM Dudes para el 2013?
 
The Road to DITA
The Road to DITAThe Road to DITA
The Road to DITA
 
Agile Software Development
Agile Software DevelopmentAgile Software Development
Agile Software Development
 
Adamson "Blueprint for Managing Your Project"
Adamson "Blueprint for Managing Your Project"Adamson "Blueprint for Managing Your Project"
Adamson "Blueprint for Managing Your Project"
 
Become a Better Developer with Debugging Techniques for Drupal (and more!)
Become a Better Developer with Debugging Techniques for Drupal (and more!)Become a Better Developer with Debugging Techniques for Drupal (and more!)
Become a Better Developer with Debugging Techniques for Drupal (and more!)
 
Devops Devops Devops, at Froscon
Devops Devops Devops, at FrosconDevops Devops Devops, at Froscon
Devops Devops Devops, at Froscon
 
Sharing our best secrets: Design a distributed system from scratch
Sharing our best secrets: Design a distributed system from scratchSharing our best secrets: Design a distributed system from scratch
Sharing our best secrets: Design a distributed system from scratch
 
Introduction to Scrum
Introduction to ScrumIntroduction to Scrum
Introduction to Scrum
 
Agile_SDLC_Node.js@Paypal_ppt
Agile_SDLC_Node.js@Paypal_pptAgile_SDLC_Node.js@Paypal_ppt
Agile_SDLC_Node.js@Paypal_ppt
 
Distributed teams
Distributed teamsDistributed teams
Distributed teams
 
Distributed_teams
Distributed_teamsDistributed_teams
Distributed_teams
 
Managing software projects & teams effectively
Managing software projects & teams effectivelyManaging software projects & teams effectively
Managing software projects & teams effectively
 
Devops For Drupal
Devops  For DrupalDevops  For Drupal
Devops For Drupal
 
Practical automation for beginners
Practical automation for beginnersPractical automation for beginners
Practical automation for beginners
 
Services, tools & practices for a software house
Services, tools & practices for a software houseServices, tools & practices for a software house
Services, tools & practices for a software house
 

More from DevOpsDays Tel Aviv

GRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, Salto
GRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, SaltoGRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, Salto
GRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, SaltoDevOpsDays Tel Aviv
 
MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...
MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...
MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...DevOpsDays Tel Aviv
 
THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT ...
THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT ...THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT ...
THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT ...DevOpsDays Tel Aviv
 
PRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDog
PRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDogPRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDog
PRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDogDevOpsDays Tel Aviv
 
NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...
NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...
NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...DevOpsDays Tel Aviv
 
(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG
(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG
(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGGDevOpsDays Tel Aviv
 
BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...
BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...
BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...DevOpsDays Tel Aviv
 
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider SecurityTHE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider SecurityDevOpsDays Tel Aviv
 
THE PLEASURES OF ON-PREM, TOMER GABEL
THE PLEASURES OF ON-PREM, TOMER GABELTHE PLEASURES OF ON-PREM, TOMER GABEL
THE PLEASURES OF ON-PREM, TOMER GABELDevOpsDays Tel Aviv
 
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPack
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPackCONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPack
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPackDevOpsDays Tel Aviv
 
SOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, Develeap
SOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, DeveleapSOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, Develeap
SOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, DeveleapDevOpsDays Tel Aviv
 
OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...
OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...
OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...DevOpsDays Tel Aviv
 
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKHHOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKHDevOpsDays Tel Aviv
 
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearBHOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearBDevOpsDays Tel Aviv
 
FLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, Icinga
FLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, IcingaFLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, Icinga
FLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, IcingaDevOpsDays Tel Aviv
 
(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY
(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY
(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITYDevOpsDays Tel Aviv
 
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.ioSLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.ioDevOpsDays Tel Aviv
 
ONBOARDING IN LOCKDOWN, HILA FOX, Augury
ONBOARDING IN LOCKDOWN, HILA FOX, AuguryONBOARDING IN LOCKDOWN, HILA FOX, Augury
ONBOARDING IN LOCKDOWN, HILA FOX, AuguryDevOpsDays Tel Aviv
 
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, FireflyDON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, FireflyDevOpsDays Tel Aviv
 
KEYNOTE | WHAT'S COMING IN THE NEXT 10 YEARS OF DEVOPS? // ELLEN CHISA, bolds...
KEYNOTE | WHAT'S COMING IN THE NEXT 10 YEARS OF DEVOPS? // ELLEN CHISA, bolds...KEYNOTE | WHAT'S COMING IN THE NEXT 10 YEARS OF DEVOPS? // ELLEN CHISA, bolds...
KEYNOTE | WHAT'S COMING IN THE NEXT 10 YEARS OF DEVOPS? // ELLEN CHISA, bolds...DevOpsDays Tel Aviv
 

More from DevOpsDays Tel Aviv (20)

GRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, Salto
GRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, SaltoGRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, Salto
GRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, Salto
 
MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...
MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...
MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...
 
THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT ...
THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT ...THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT ...
THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT ...
 
PRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDog
PRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDogPRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDog
PRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDog
 
NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...
NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...
NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...
 
(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG
(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG
(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG
 
BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...
BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...
BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...
 
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider SecurityTHE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
 
THE PLEASURES OF ON-PREM, TOMER GABEL
THE PLEASURES OF ON-PREM, TOMER GABELTHE PLEASURES OF ON-PREM, TOMER GABEL
THE PLEASURES OF ON-PREM, TOMER GABEL
 
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPack
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPackCONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPack
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPack
 
SOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, Develeap
SOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, DeveleapSOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, Develeap
SOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, Develeap
 
OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...
OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...
OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...
 
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKHHOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH
 
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearBHOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
 
FLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, Icinga
FLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, IcingaFLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, Icinga
FLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, Icinga
 
(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY
(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY
(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY
 
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.ioSLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
 
ONBOARDING IN LOCKDOWN, HILA FOX, Augury
ONBOARDING IN LOCKDOWN, HILA FOX, AuguryONBOARDING IN LOCKDOWN, HILA FOX, Augury
ONBOARDING IN LOCKDOWN, HILA FOX, Augury
 
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, FireflyDON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
 
KEYNOTE | WHAT'S COMING IN THE NEXT 10 YEARS OF DEVOPS? // ELLEN CHISA, bolds...
KEYNOTE | WHAT'S COMING IN THE NEXT 10 YEARS OF DEVOPS? // ELLEN CHISA, bolds...KEYNOTE | WHAT'S COMING IN THE NEXT 10 YEARS OF DEVOPS? // ELLEN CHISA, bolds...
KEYNOTE | WHAT'S COMING IN THE NEXT 10 YEARS OF DEVOPS? // ELLEN CHISA, bolds...
 

Recently uploaded

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 

Recently uploaded (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 

A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

  • 1. Joe Smith November 14, 2017 1 A Culture Of Automation
  • 2. Joe Smith Operations Engineer, Slack Application Operations Team ● Build/Run core systems responsible for Slack ● CDNs, Edge Regions, Web tier, Websockets, development workflow, etc ● Previously: ○ Tech Lead, Aurora/Mesos SRE at Twitter ○ Internal Technology Resident at Google
  • 3. Folks with the desire to build resilient systems These roles may have different names, but share the above goal ● Reliability Engineering ● Operations ● DevOps ● Site Reliability Engineering ● Production Engineering ● Systems Engineering Audience
  • 4. Agenda 1. Running Production Services 2. Runbooks 3. Automation
  • 6. Two Pizza Rule A team should be sized to share around two large pizzas
  • 7. “” “We will take the site down today! 💥” – No one when they wake up
  • 8. ● Careful Planning and Procedures ● Extensive Documentation ● Good Communication Strategies
  • 9. Planning ● Some components are prioritized for speed while others are meant to be canaried and analyzed ● Changes need to be staged to coordinate with each other ● Give teams the tools and visibility they need to make improvements and understand impact ● Identify your rollback strategy ahead of time
  • 10. Documentation ● Each code or procedure change should be paired with an update to easily- readable text ● Help your teammates and yourself weeks from now when you need to understand how things work. ● Do not just describe how systems are structured, explain why they are built that way! ● Additional context can inform future decisions
  • 11. Communication ● Change Management - Coordinating release schedules can be difficult ● Launch Channel - Announce changes, link to more details in a feature- specific Slack channel for the change ● Add links to commits, code reviews, threads in Slack, mailing list posts, and StackOverflow questions ● This enables your team to benefit from the research you've done
  • 12. ● Careful Planning and Procedures ● Extensive Documentation ● Good communication Strategies
  • 13. ● Unexpected changes, forced roll-forwards ● Outdated Runbooks ● Missed Notifications Growing Pains
  • 14. Good Problems to Have As the team grows, it's no longer possible to understand everything that's happening at once. The scope of work is also increasing!
  • 16. “” “Checklists for commonly repeated operational tasks.” – Slack, Runbook README
  • 18. Location ● Google Docs ○ Good Formatting, Mobile Apps, external service ● Wiki ○ Web Interface, Track Changes ● Markdown in git repo (paired with Github) ○ Formatting, offline support, normal Pull Request flow
  • 19. Markdown in Git Repo ● Track changes across revisions ● Optional peer review ● Link to relevant sections ● Clone repo for offline support
  • 20. Runbook Template (thanks to my teammate Megan!) ● ApplicationServer ○ README.md ○ standard_actions.md ○ other_actions.md ○ alerts/ ■ box_failure.md ■ some_alert_name.md
  • 27. Content This is not the place for Design Documentation. These are highly-actionable, succinct descriptions of next steps.
  • 29. “” "Test until fear turns to boredom." – JUnit FAQ, http://junit.sourceforge.net/doc/faq/faq.htm#tests_6
  • 30. “” "Automate once fear turns to boredom" – Ancient SRE Corrollary
  • 31. Beyond Runbooks ● Turn a manual checklist into a testable, repeatable set of steps anyone can run ● Anytime you discover a sharp edge or workaround, this can be codified in the tool ● Reduce sections of "but if this happens, check this dashboard and then do one of three things"
  • 32. The Tooling Workflow Initial Steps This process can evolve over a long time and generally improve things. ● One person has all the knowledge in their head ● That person writes down everything they know in a runbook ● Someone sees an annoying or complicated piece and writes a small script to be run instead for a tiny part of the process The next jump will be the most difficult part!
  • 33. The Tooling Workflow Maintenance ● It feels great to have written your first tool! ● You may be lucky and have no bugs ● Most likely there are some edge cases- that is okay and expected! ● Take some time to figure out what went wrong and how to make things better.
  • 34. The Tooling Workflow Completion ● Later on, another part of the process can be added in and the documentation further updated ● Over time- the runbook becomes "Run this tool we wrote, send bugs to the authors" ● Finally- there is no longer an entry! The tool is run automatically, or the system itself is able to solve that problem
  • 35. “” "The value of humans is to execute Judgement, the value of computers is to execute instructions" – Aron, teammate at Slack
  • 36. Runbook to Automated Workflow 1. Brain 2. Runbook 3. Start of automation 4. Automation evolution (safeguards) 5. Self-contained Tool 6. Fully Automated
  • 37. Process in Code ● Using libraries like fabric, pychef, and boto3 can ease automation ● When there are issues, the code can be reviewed for process changes, git history can be consulted, etc ● No more "I forgot that was changed and followed the old process!" ● Each time someone submits an improvement or workflow tweak, that will always be useful from now on!
  • 38. Thank You! 38 For more information go to: slack.com/jobs
  • 39. Joe Smith Operations Engineer, Slack Application Operations Team ● Come build the future of work! ○ https://slack.com/careers/641062/senior-site-reliability-engineer ● Please reach out and say hi! ○ @Yasumoto on Twitter ● Tools Scaffold ○ https://github.com/Yasumoto/tools

Editor's Notes

  1. I do believe that folks in these roles all have different approaches, and that is excellent. There is a focus of continuity with Operations, and deep architectural expertise with Site Reliability Engineering. I’d love to talk to you all afterward to hear your approaches to building out teams!
  2. We're going to break this talk into 3 pieces, and I would love to talk about any of them for hours But we only have time for one talk this year, so let's see what we'd like to focus on Who here considers them an experienced Operations Engineer or SRE? Who has written at least a few hundred lines of Python? Who has published your own python distribution to Cheeseshop (aka PyPI?)
  3. This is going to set the stage for priorities and approaches based on years running large web services. The focus is primarily around working with a larger group of people— a team that has gone beyond the “two pizza rule”
  4. Once you go beyond this number of engineers, your problems begin to center around organization and collaboration instead of technology (usually!)
  5. There are a few universal best practices which we can use to inform how we structure our work.
  6. Different approaches to doing this For this talk, we're going to consider three pillars of a great operations team These are aspects of rolling out a new tool, procedure, service, or automation
  7. ^ Consider strategies for dealing with mistakes vs. just "do a rollback" ^ Is everyone familiar with Rollback strategies? ^ Whenever you make a change in production, you should consider how are you going to rollback if something breaks! ^ DO NOT come up with your rollback strategy under pressure after things break! ^ Use the _feature-flag_ approach: Hide chef/puppet changes behind a per-node attribute ^ Deploy processes should be the same for a roll-forward as a roll-back ^ Specify which version number should be running and structure the config to match
  8. These written pieces of text will be generally of two types: Design Documentation describing architecture and function. The second, which we will dig into, are called “Runbooks” These are descriptions about how to run a service. They contain important information about how to spin the service up from scratch. How to repair components when they fail. And most importantly, a corresponding runbook for each page/alert that goes to an engineer.
  9. Documentation is written in the past, and consumed in the present. “Communication” is real-time: this means information is conveyed at the same time it is received.
  10. Any questions about bringing these together? So these are the ideal, but unfortunately it isn't always the case! Let’s talk about corresponding issues we can run into
  11. ^ These all get more difficult over time ^ More people, more difficult to communicate
  12. Here’s where we get to a really fun part of the talk I’ve been on a team where our runbook was one really long wiki page, and now a team with much more structured information. I’m biased, but I think this approach is much better! A little effort has helped make oncall and remediation much easier. Even if you don’t follow anything else I suggest here— make time explicitly for your team to improve documentation!
  13. First, consider whether what you're writing is really a runbook. If you're documenting an architecture or a workflow that's expected to become muscle memory (like how Logstash works or our Chef workflow) then your file belongs in the ops directory. Runbooks are for reapeatable processes. We are all required to refer to the runbook each and every time we execute a process documented in one so that changes to runbooks will reliably take effect.
  14. We’re going to talk about three different components to good runbooks.
  15. Runbooks should be organized into directories in the runbooks directory named descriptively and broadly. For example, message_server as a top level directory, with descriptive files within. Within those files there should be a first-level heading reiterating the name of the file (some_alert_name.md becomes "# Some Alert Name") and many second-level headings, following the template laid out in the example directory.
  16. There should also be a README.md file with a basic explanation of the service, related services, contacts/channels for that service, and related links/graphs. Now remember, when someone realizes something is wrong, this will likely be the first place they go to get the lay on the land. This should be generally helpful information.
  17. Each process documented should present, as prose, any context that may be required to decide if use of this runbook is appropriate or otherwise set the stage for execution. Then an ordered list should be presented. Use "1. " as the prefix for every element in the list to keep diffs tidier; when the Markdown is rendered the numbers will be sequential. Finally, any parting considerations should be presented, again as prose.
  18. It is very important that Markdown be used precisely when writing runbooks. In the heat of the moment, consistent formatting adds clarity and helps us move quickly and confidently.
  19. Specify where a command is to be run before specifying the command ("From staging:", "From your chef-repo:", etc.) Link to graphs, God pages, and other specific resources required during execution of the runbook Include actions like updating the status site, sending particular messages in Slack, deploying the site, enabling notifications, and so on lest they be forgotten or their omission confuse the operator executing your runbook
  20. This is an example of something that gives some context— it clearly points the reader to the next services to investigate. Following along with what Corey said this morning... It could be improved in a few ways.. Most notably by helpfully linking to runbooks about how to diagnose if memcache, databases, and other downstreams are healthy or not! This is something that an experienced team member can read and have no issues with. Someone new on the team may need to burn time getting help from a teammate before isolating the root cause.
  21. Ensure you are creating something that can be acted upon by a teammate at 2am. Otherwise expect to get woken up to chip in!
  22. So here’s my favorite part! To start we’re going to talk about one of my favorite maxims, followed by an ancient proverb.
  23. This is the best rule of thumb we have for ensuring we are directing efforts appropriately For something that is so standard that you know you can handle it, and have seen the edge cases, it’s going to be boring Don’t let yourself or your team stay working on the same issue for too long. This is the perfect opportunity to put in some extra effort to make sure it is automated.
  24. Automatically detect which procedures to follow As we saw, that example runbook was helpful, but sort of passed the buck when it came to further troubleshooting. A potential tool might automatically query health metrics for downstream services to help identify any problems Let’s walk through what that procedure might look like.
  25. Once you have defined the logic for a process, it can be written out in a language for the future ^ That language could also happen to be understood by a machine! Humans can also read code, but machines can execute explicit instructions.
  26. ^ We have found that investing the time to write tools as we understand a process has been worth the effort ^ We are able to work on much more interesting problems, and not repeatedly wake up for the same issues again and again
  27. This can be a large amount of work, and the payoff isn’t always immediate. It can be slow-going to write something which is resilient enough to be run by someone not familiar with both the process and the tool. The payoff will be well-worth it however!
  28. I’m hopeful I’ve motivated you to not only invest in your runbooks, but also looking into improvements beyond just text. Using machines to do some work for us will allow our brains to focus on bigger and more impactful challenges.
  29. If you agree, disagree, or have any questions, I’d love to hear it! Thank you!