SlideShare a Scribd company logo
1 of 56
The Virtuous Cycle
Getting Good Things Out of Bad Failures
Joy Scharmen
I’m here to talk about failure.
Why talk about failure?
Failure is amazing.
Failure is our best teacher.
How do we learn from failure?
Have you ever run
out of integers
in an auto-
incrementing primary
key column in a
database?
Looking at you, ActiveRecord.
Frameworks that default to INT
Assumptions about the size of
your database.
(before it hits production)
Just not thinking about it
I’m just happy that it works at all.
Oops, I did it again.
We had this happen to us twice.
...then it happened a third time.
And we’re an operations
company.
┬─┬ノ( ゜-゜ノ)
First, consider it more deeply.
Σ(-᷅_-᷄๑)
We fixed one occurrence. It was
simple.
It happened again. Same fix.
Then it happened again.
Obviously what we are doing here isn’t working.
Who here is familiar with
retrospectives?
ret·ro·spec·tive
ˌretrəˈspektiv/
adjective
1. looking back on or dealing with past events or situations.
Who has been to a boring
retrospective?
I have. I’ve run them. Sorry.
A retrospective is the pivot point
between failure and learning.
If it’s boring, no one is learning.
How do we have non-boring
retrospectives?
Create engagement. Prepare!
Don’t force people to watch the sausage being
made.
Before the retrospective:
Choose a facilitator
They should know who was involved and
why.
Before the retrospective:
Build a timeline
Gather your facts.
Use your tools wisely
“We become what we behold. We shape our
tools, and thereafter our tools shape us.”
― Marshall McLuhan
Retrospective PrepIncident Management
ChatOps
Bot Tools
SitReps
Time
Outreach
Organization
My Tools For
My personal incident management tool belt:
ChatOps
My personal incident management tool belt:
Bot Tools
My personal incident management tool belt:
SitReps
My personal retrospective toolbelt:
Time
Block out time.
My personal retrospective toolbelt:
Outreach
Have roles defined.
My personal retrospective toolbelt:
Organization
Send out the agenda, including the timeline, the
day before the retrospective.
People should show up to a
retrospective with context to
begin a discussion.
Everyone is in the
retrospective. The timeline
is done. How do we start?
Have the most involved
engineer give a brief
summary of what
happened.
Make sure everyone is engaged.
Read the room.
Be compassionate for your customers.
Talk about customer
impact.
Take note.
Pick the point you want to
start from and dive in.
If you ever get to “human
error”, keep digging.
No, really.
If you ever get to “human
error”, keep digging.
Most Important:
Always Assume Good
Intent
Defensiveness kills
retrospection.
One way you can tell a
retrospective is good:
you have a ridiculous list of
remediation items.
“re-architect the whole platform”
Remediations can be anything from:
“fix typo on line 5”.
“make the speed of light go faster”.
to
to
Don’t do every remediation.
Don’t discount big projects!
and
What do you do with all of
these remediations?
Bring them to product as well as engineering!
Product can be your best
friend.
Do you have a need? Your customers do
too.
Product is great at getting needs in front of
customers.
Heroku Pipelines
Pipelines is a product that came out of an
engineering need.
Is your fix a small thing you
can add to existing customer
tools?
Engineering should be able to do this with
minimal product sign off.
You can improve your
customers’ experience.
Your customers, your fellow engineers, and
your community can benefit from your own
needs and hard won experience.
Back to the story.
Done: Tooling
Done: Process
Next: Automation
Next: Fix inputs
* https://github.com/rails/rails/pull/24962
Every failure is a
chance to learn.
Make those chances count.
Thank you.
Joy Scharmen / @peculiaire / joy@heroku.com
Retrospective Resource Wiki:
http://retrospectivewiki.org
https://www.oreilly.com/ideas/the-infinite-hows
Infinite Hows:
https://devcenter.heroku.com
Heroku Dev Center:
https://github.com/peculiaire/incident-lifecycle/blob/master/retrotemplate.md
Retrospective Template:

More Related Content

What's hot

Maybe We Don’t Have to Test It
Maybe We Don’t Have to Test ItMaybe We Don’t Have to Test It
Maybe We Don’t Have to Test ItTechWell
 
How to Run 100 User Tests in Two Days
How to Run 100 User Tests in Two DaysHow to Run 100 User Tests in Two Days
How to Run 100 User Tests in Two DaysDaniel Sauble
 
Obstacles of Digital Transformation Evolution
Obstacles of Digital Transformation EvolutionObstacles of Digital Transformation Evolution
Obstacles of Digital Transformation EvolutionEqual Experts
 
STARWest Workshop: Explore with Intent
STARWest Workshop: Explore with IntentSTARWest Workshop: Explore with Intent
STARWest Workshop: Explore with IntentMaaret Pyhäjärvi
 
Product Development -The Great Unknown
Product Development -The Great UnknownProduct Development -The Great Unknown
Product Development -The Great UnknownSteve Owens
 
Stop the line @spotify
Stop the line @spotifyStop the line @spotify
Stop the line @spotifyPeter Antman
 
Problem Solving
Problem SolvingProblem Solving
Problem Solvingnroggen
 
STARWest: Make Your Team Awesome, Yes You Can!
STARWest: Make Your Team Awesome, Yes You Can!STARWest: Make Your Team Awesome, Yes You Can!
STARWest: Make Your Team Awesome, Yes You Can!Maaret Pyhäjärvi
 
Matt Heusser - Keynote - Cool New Things... and some old ones too
Matt Heusser - Keynote - Cool New Things... and some old ones tooMatt Heusser - Keynote - Cool New Things... and some old ones too
Matt Heusser - Keynote - Cool New Things... and some old ones tooQA or the Highway
 
Data Integrity - Patryk Hes
Data Integrity - Patryk HesData Integrity - Patryk Hes
Data Integrity - Patryk HesPROIDEA
 
HUSTEF '21 Keynote: Hands Off Exploratory Testing - Managing at Scale
HUSTEF '21 Keynote: Hands Off Exploratory Testing - Managing at ScaleHUSTEF '21 Keynote: Hands Off Exploratory Testing - Managing at Scale
HUSTEF '21 Keynote: Hands Off Exploratory Testing - Managing at ScaleMaaret Pyhäjärvi
 
Why unvalidated assumption is the enemy of good product
Why unvalidated assumption is the enemy of good productWhy unvalidated assumption is the enemy of good product
Why unvalidated assumption is the enemy of good productSeb Agertoft
 
How to continuosly gain user insights during an agile project
How to continuosly gain user insights during an agile projectHow to continuosly gain user insights during an agile project
How to continuosly gain user insights during an agile projectAnders Ballde Jacobsson
 
Guerilla Usability Testing, or How I learned that perfectly imperfect tests a...
Guerilla Usability Testing, or How I learned that perfectly imperfect tests a...Guerilla Usability Testing, or How I learned that perfectly imperfect tests a...
Guerilla Usability Testing, or How I learned that perfectly imperfect tests a...Sara Snyder
 

What's hot (18)

Designing work
Designing workDesigning work
Designing work
 
Maybe We Don’t Have to Test It
Maybe We Don’t Have to Test ItMaybe We Don’t Have to Test It
Maybe We Don’t Have to Test It
 
How to Run 100 User Tests in Two Days
How to Run 100 User Tests in Two DaysHow to Run 100 User Tests in Two Days
How to Run 100 User Tests in Two Days
 
Obstacles of Digital Transformation Evolution
Obstacles of Digital Transformation EvolutionObstacles of Digital Transformation Evolution
Obstacles of Digital Transformation Evolution
 
STARWest Workshop: Explore with Intent
STARWest Workshop: Explore with IntentSTARWest Workshop: Explore with Intent
STARWest Workshop: Explore with Intent
 
Product Development -The Great Unknown
Product Development -The Great UnknownProduct Development -The Great Unknown
Product Development -The Great Unknown
 
Stop the line @spotify
Stop the line @spotifyStop the line @spotify
Stop the line @spotify
 
Problem Solving
Problem SolvingProblem Solving
Problem Solving
 
STARWest: Make Your Team Awesome, Yes You Can!
STARWest: Make Your Team Awesome, Yes You Can!STARWest: Make Your Team Awesome, Yes You Can!
STARWest: Make Your Team Awesome, Yes You Can!
 
Matt Heusser - Keynote - Cool New Things... and some old ones too
Matt Heusser - Keynote - Cool New Things... and some old ones tooMatt Heusser - Keynote - Cool New Things... and some old ones too
Matt Heusser - Keynote - Cool New Things... and some old ones too
 
Data Integrity - Patryk Hes
Data Integrity - Patryk HesData Integrity - Patryk Hes
Data Integrity - Patryk Hes
 
Nightmare on PMO Street
Nightmare on PMO StreetNightmare on PMO Street
Nightmare on PMO Street
 
SEETest: Making Teams Awesome
SEETest: Making Teams AwesomeSEETest: Making Teams Awesome
SEETest: Making Teams Awesome
 
Ooda pres
Ooda presOoda pres
Ooda pres
 
HUSTEF '21 Keynote: Hands Off Exploratory Testing - Managing at Scale
HUSTEF '21 Keynote: Hands Off Exploratory Testing - Managing at ScaleHUSTEF '21 Keynote: Hands Off Exploratory Testing - Managing at Scale
HUSTEF '21 Keynote: Hands Off Exploratory Testing - Managing at Scale
 
Why unvalidated assumption is the enemy of good product
Why unvalidated assumption is the enemy of good productWhy unvalidated assumption is the enemy of good product
Why unvalidated assumption is the enemy of good product
 
How to continuosly gain user insights during an agile project
How to continuosly gain user insights during an agile projectHow to continuosly gain user insights during an agile project
How to continuosly gain user insights during an agile project
 
Guerilla Usability Testing, or How I learned that perfectly imperfect tests a...
Guerilla Usability Testing, or How I learned that perfectly imperfect tests a...Guerilla Usability Testing, or How I learned that perfectly imperfect tests a...
Guerilla Usability Testing, or How I learned that perfectly imperfect tests a...
 

Viewers also liked

Mark Leslie - Leadership and The Virtuous Cycle
Mark Leslie - Leadership and The Virtuous CycleMark Leslie - Leadership and The Virtuous Cycle
Mark Leslie - Leadership and The Virtuous CycleMark Leslie
 
Microsoft Trusted Cloud - Security Privacy & Control, Compliance, Transparency
Microsoft Trusted Cloud - Security Privacy & Control, Compliance, TransparencyMicrosoft Trusted Cloud - Security Privacy & Control, Compliance, Transparency
Microsoft Trusted Cloud - Security Privacy & Control, Compliance, TransparencyMicrosoft Österreich
 
Enable Mobility and Improve Cost Efficiency within a Secure Ecosystem - Futur...
Enable Mobility and Improve Cost Efficiency within a Secure Ecosystem - Futur...Enable Mobility and Improve Cost Efficiency within a Secure Ecosystem - Futur...
Enable Mobility and Improve Cost Efficiency within a Secure Ecosystem - Futur...Microsoft Österreich
 
Webinar - Top 5 Strategies for Digital Process Agility
Webinar - Top 5 Strategies for Digital Process AgilityWebinar - Top 5 Strategies for Digital Process Agility
Webinar - Top 5 Strategies for Digital Process AgilityBizagi
 
Empired Snap: Intranets are Changing
Empired Snap: Intranets are ChangingEmpired Snap: Intranets are Changing
Empired Snap: Intranets are ChangingEmpired
 
Digital Transformation How to Reboot IT and Business Collaboration
Digital Transformation   How to Reboot IT and Business CollaborationDigital Transformation   How to Reboot IT and Business Collaboration
Digital Transformation How to Reboot IT and Business CollaborationBizagi
 
Microsoft Dynamics 365 and why you need it NOW!
Microsoft Dynamics 365 and why you need it NOW!Microsoft Dynamics 365 and why you need it NOW!
Microsoft Dynamics 365 and why you need it NOW!David Blumentals
 
Dynamics Day 2016 - Microsoft Dynamics 365 sales and customer service (CRM) ...
Dynamics Day 2016  - Microsoft Dynamics 365 sales and customer service (CRM) ...Dynamics Day 2016  - Microsoft Dynamics 365 sales and customer service (CRM) ...
Dynamics Day 2016 - Microsoft Dynamics 365 sales and customer service (CRM) ...Empired
 
Dynamics Day 2016 - Digital transformation with Microsoft Dynamics 365
Dynamics Day 2016  - Digital transformation with Microsoft Dynamics 365Dynamics Day 2016  - Digital transformation with Microsoft Dynamics 365
Dynamics Day 2016 - Digital transformation with Microsoft Dynamics 365Empired
 
Dynamics Day 2016 - Microsoft Dynamics 365 the future of Dynamics
Dynamics Day 2016  - Microsoft Dynamics 365 the future of DynamicsDynamics Day 2016  - Microsoft Dynamics 365 the future of Dynamics
Dynamics Day 2016 - Microsoft Dynamics 365 the future of DynamicsEmpired
 
The essential elements of a digital transformation strategy
The essential elements of a digital transformation strategyThe essential elements of a digital transformation strategy
The essential elements of a digital transformation strategyMarcel Santilli
 
Why Digital Transformation is not an IT Transformation
Why Digital Transformation is not an IT Transformation Why Digital Transformation is not an IT Transformation
Why Digital Transformation is not an IT Transformation Vishal Sharma
 
Digital Transformation - How to Deliver Meaningful Results
Digital Transformation - How to Deliver Meaningful ResultsDigital Transformation - How to Deliver Meaningful Results
Digital Transformation - How to Deliver Meaningful ResultsBizagi
 
Digital Transformation and the Customer Experience
Digital Transformation and the Customer ExperienceDigital Transformation and the Customer Experience
Digital Transformation and the Customer ExperienceMat Ford
 
Microsoft Dynamics CRM 2015 Pre-sales Presentation Material
Microsoft Dynamics CRM 2015 Pre-sales Presentation MaterialMicrosoft Dynamics CRM 2015 Pre-sales Presentation Material
Microsoft Dynamics CRM 2015 Pre-sales Presentation MaterialAileen Gusni
 
Developing a Roadmap for Digital Transformation
Developing a Roadmap for Digital TransformationDeveloping a Roadmap for Digital Transformation
Developing a Roadmap for Digital TransformationJohn Sinke
 

Viewers also liked (18)

Mark Leslie - Leadership and The Virtuous Cycle
Mark Leslie - Leadership and The Virtuous CycleMark Leslie - Leadership and The Virtuous Cycle
Mark Leslie - Leadership and The Virtuous Cycle
 
Microsoft Trusted Cloud - Security Privacy & Control, Compliance, Transparency
Microsoft Trusted Cloud - Security Privacy & Control, Compliance, TransparencyMicrosoft Trusted Cloud - Security Privacy & Control, Compliance, Transparency
Microsoft Trusted Cloud - Security Privacy & Control, Compliance, Transparency
 
Enable Mobility and Improve Cost Efficiency within a Secure Ecosystem - Futur...
Enable Mobility and Improve Cost Efficiency within a Secure Ecosystem - Futur...Enable Mobility and Improve Cost Efficiency within a Secure Ecosystem - Futur...
Enable Mobility and Improve Cost Efficiency within a Secure Ecosystem - Futur...
 
Webinar - Top 5 Strategies for Digital Process Agility
Webinar - Top 5 Strategies for Digital Process AgilityWebinar - Top 5 Strategies for Digital Process Agility
Webinar - Top 5 Strategies for Digital Process Agility
 
Empired Snap: Intranets are Changing
Empired Snap: Intranets are ChangingEmpired Snap: Intranets are Changing
Empired Snap: Intranets are Changing
 
Digital Transformation How to Reboot IT and Business Collaboration
Digital Transformation   How to Reboot IT and Business CollaborationDigital Transformation   How to Reboot IT and Business Collaboration
Digital Transformation How to Reboot IT and Business Collaboration
 
Microsoft Dynamics 365 and why you need it NOW!
Microsoft Dynamics 365 and why you need it NOW!Microsoft Dynamics 365 and why you need it NOW!
Microsoft Dynamics 365 and why you need it NOW!
 
Dynamics Day 2016 - Microsoft Dynamics 365 sales and customer service (CRM) ...
Dynamics Day 2016  - Microsoft Dynamics 365 sales and customer service (CRM) ...Dynamics Day 2016  - Microsoft Dynamics 365 sales and customer service (CRM) ...
Dynamics Day 2016 - Microsoft Dynamics 365 sales and customer service (CRM) ...
 
Digital Workspace
Digital WorkspaceDigital Workspace
Digital Workspace
 
Payment Factory
Payment FactoryPayment Factory
Payment Factory
 
Dynamics Day 2016 - Digital transformation with Microsoft Dynamics 365
Dynamics Day 2016  - Digital transformation with Microsoft Dynamics 365Dynamics Day 2016  - Digital transformation with Microsoft Dynamics 365
Dynamics Day 2016 - Digital transformation with Microsoft Dynamics 365
 
Dynamics Day 2016 - Microsoft Dynamics 365 the future of Dynamics
Dynamics Day 2016  - Microsoft Dynamics 365 the future of DynamicsDynamics Day 2016  - Microsoft Dynamics 365 the future of Dynamics
Dynamics Day 2016 - Microsoft Dynamics 365 the future of Dynamics
 
The essential elements of a digital transformation strategy
The essential elements of a digital transformation strategyThe essential elements of a digital transformation strategy
The essential elements of a digital transformation strategy
 
Why Digital Transformation is not an IT Transformation
Why Digital Transformation is not an IT Transformation Why Digital Transformation is not an IT Transformation
Why Digital Transformation is not an IT Transformation
 
Digital Transformation - How to Deliver Meaningful Results
Digital Transformation - How to Deliver Meaningful ResultsDigital Transformation - How to Deliver Meaningful Results
Digital Transformation - How to Deliver Meaningful Results
 
Digital Transformation and the Customer Experience
Digital Transformation and the Customer ExperienceDigital Transformation and the Customer Experience
Digital Transformation and the Customer Experience
 
Microsoft Dynamics CRM 2015 Pre-sales Presentation Material
Microsoft Dynamics CRM 2015 Pre-sales Presentation MaterialMicrosoft Dynamics CRM 2015 Pre-sales Presentation Material
Microsoft Dynamics CRM 2015 Pre-sales Presentation Material
 
Developing a Roadmap for Digital Transformation
Developing a Roadmap for Digital TransformationDeveloping a Roadmap for Digital Transformation
Developing a Roadmap for Digital Transformation
 

Similar to Joy Scharmen - The Virtuous Cycle: Getting Good Things Out of Bad Failures

Five Ways to Get Better Data From Our Users
Five Ways to Get Better Data From Our UsersFive Ways to Get Better Data From Our Users
Five Ways to Get Better Data From Our UsersSajid Reshamwala
 
Get things done : pragmatic project management
Get things done : pragmatic project managementGet things done : pragmatic project management
Get things done : pragmatic project managementStan Carrico
 
Choose Boring Technology
Choose Boring TechnologyChoose Boring Technology
Choose Boring TechnologyDan McKinley
 
Wait A Moment? How High Workload Kills Efficiency! - Roman Pickl
Wait A Moment? How High Workload Kills Efficiency! - Roman PicklWait A Moment? How High Workload Kills Efficiency! - Roman Pickl
Wait A Moment? How High Workload Kills Efficiency! - Roman PicklPROIDEA
 
Blameless system design - annotated
Blameless system design  - annotatedBlameless system design  - annotated
Blameless system design - annotatedDouglas Land
 
Toyota business practices
Toyota business practicesToyota business practices
Toyota business practicesssuser727fc31
 
Grails Worst Practices
Grails Worst PracticesGrails Worst Practices
Grails Worst PracticesBurt Beckwith
 
Evolve or Die: A3 Thinking and Popcorn Flow in Action (#LKCE14)
Evolve or Die: A3 Thinking and Popcorn Flow in Action (#LKCE14)Evolve or Die: A3 Thinking and Popcorn Flow in Action (#LKCE14)
Evolve or Die: A3 Thinking and Popcorn Flow in Action (#LKCE14)Claudio Perrone
 
Agent of Change
Agent of ChangeAgent of Change
Agent of Changemfrost503
 
“Don’t Repeat Yourself”: 4 Process Street Features to Keep Work DRY
“Don’t Repeat Yourself”: 4 Process Street Features to Keep Work DRY“Don’t Repeat Yourself”: 4 Process Street Features to Keep Work DRY
“Don’t Repeat Yourself”: 4 Process Street Features to Keep Work DRYLizzyManz
 
Module 3.1 PowerPoint Slide Deck - DOWNLOAD for Presentation version April 20...
Module 3.1 PowerPoint Slide Deck - DOWNLOAD for Presentation version April 20...Module 3.1 PowerPoint Slide Deck - DOWNLOAD for Presentation version April 20...
Module 3.1 PowerPoint Slide Deck - DOWNLOAD for Presentation version April 20...GeorgeGozon1
 
2016 letter to Amazon shareholders
2016 letter to Amazon shareholders2016 letter to Amazon shareholders
2016 letter to Amazon shareholdersMatt Oh
 
Jeff Bezos' 2016 Letter to Amazon Shareholders
Jeff Bezos' 2016 Letter to Amazon ShareholdersJeff Bezos' 2016 Letter to Amazon Shareholders
Jeff Bezos' 2016 Letter to Amazon ShareholdersRazin Mustafiz
 
Amazon Jeff Bezos 2016 letter to shareholders
Amazon Jeff Bezos 2016 letter to shareholdersAmazon Jeff Bezos 2016 letter to shareholders
Amazon Jeff Bezos 2016 letter to shareholdersLaurie Ruettimann
 
Impact Analysis - LoopConf
Impact Analysis - LoopConfImpact Analysis - LoopConf
Impact Analysis - LoopConfChris Lema
 
Impactanalysis 150507054758-lva1-app6891
Impactanalysis 150507054758-lva1-app6891Impactanalysis 150507054758-lva1-app6891
Impactanalysis 150507054758-lva1-app6891Jose P. Banuelos
 
The-Small Book-of-The-Few-Big-Rules-OutSystems
The-Small Book-of-The-Few-Big-Rules-OutSystemsThe-Small Book-of-The-Few-Big-Rules-OutSystems
The-Small Book-of-The-Few-Big-Rules-OutSystemsSteve Rotter
 
Architecting a Post Mortem - Velocity 2018 San Jose Tutorial
Architecting a Post Mortem - Velocity 2018 San Jose TutorialArchitecting a Post Mortem - Velocity 2018 San Jose Tutorial
Architecting a Post Mortem - Velocity 2018 San Jose TutorialWill Gallego
 

Similar to Joy Scharmen - The Virtuous Cycle: Getting Good Things Out of Bad Failures (20)

Five Ways to Get Better Data From Our Users
Five Ways to Get Better Data From Our UsersFive Ways to Get Better Data From Our Users
Five Ways to Get Better Data From Our Users
 
Get things done : pragmatic project management
Get things done : pragmatic project managementGet things done : pragmatic project management
Get things done : pragmatic project management
 
Choose Boring Technology
Choose Boring TechnologyChoose Boring Technology
Choose Boring Technology
 
Wait A Moment? How High Workload Kills Efficiency! - Roman Pickl
Wait A Moment? How High Workload Kills Efficiency! - Roman PicklWait A Moment? How High Workload Kills Efficiency! - Roman Pickl
Wait A Moment? How High Workload Kills Efficiency! - Roman Pickl
 
Blameless system design - annotated
Blameless system design  - annotatedBlameless system design  - annotated
Blameless system design - annotated
 
Toyota business practices
Toyota business practicesToyota business practices
Toyota business practices
 
Grails Worst Practices
Grails Worst PracticesGrails Worst Practices
Grails Worst Practices
 
The alignment
The alignmentThe alignment
The alignment
 
C programming guide new
C programming guide newC programming guide new
C programming guide new
 
Evolve or Die: A3 Thinking and Popcorn Flow in Action (#LKCE14)
Evolve or Die: A3 Thinking and Popcorn Flow in Action (#LKCE14)Evolve or Die: A3 Thinking and Popcorn Flow in Action (#LKCE14)
Evolve or Die: A3 Thinking and Popcorn Flow in Action (#LKCE14)
 
Agent of Change
Agent of ChangeAgent of Change
Agent of Change
 
“Don’t Repeat Yourself”: 4 Process Street Features to Keep Work DRY
“Don’t Repeat Yourself”: 4 Process Street Features to Keep Work DRY“Don’t Repeat Yourself”: 4 Process Street Features to Keep Work DRY
“Don’t Repeat Yourself”: 4 Process Street Features to Keep Work DRY
 
Module 3.1 PowerPoint Slide Deck - DOWNLOAD for Presentation version April 20...
Module 3.1 PowerPoint Slide Deck - DOWNLOAD for Presentation version April 20...Module 3.1 PowerPoint Slide Deck - DOWNLOAD for Presentation version April 20...
Module 3.1 PowerPoint Slide Deck - DOWNLOAD for Presentation version April 20...
 
2016 letter to Amazon shareholders
2016 letter to Amazon shareholders2016 letter to Amazon shareholders
2016 letter to Amazon shareholders
 
Jeff Bezos' 2016 Letter to Amazon Shareholders
Jeff Bezos' 2016 Letter to Amazon ShareholdersJeff Bezos' 2016 Letter to Amazon Shareholders
Jeff Bezos' 2016 Letter to Amazon Shareholders
 
Amazon Jeff Bezos 2016 letter to shareholders
Amazon Jeff Bezos 2016 letter to shareholdersAmazon Jeff Bezos 2016 letter to shareholders
Amazon Jeff Bezos 2016 letter to shareholders
 
Impact Analysis - LoopConf
Impact Analysis - LoopConfImpact Analysis - LoopConf
Impact Analysis - LoopConf
 
Impactanalysis 150507054758-lva1-app6891
Impactanalysis 150507054758-lva1-app6891Impactanalysis 150507054758-lva1-app6891
Impactanalysis 150507054758-lva1-app6891
 
The-Small Book-of-The-Few-Big-Rules-OutSystems
The-Small Book-of-The-Few-Big-Rules-OutSystemsThe-Small Book-of-The-Few-Big-Rules-OutSystems
The-Small Book-of-The-Few-Big-Rules-OutSystems
 
Architecting a Post Mortem - Velocity 2018 San Jose Tutorial
Architecting a Post Mortem - Velocity 2018 San Jose TutorialArchitecting a Post Mortem - Velocity 2018 San Jose Tutorial
Architecting a Post Mortem - Velocity 2018 San Jose Tutorial
 

Recently uploaded

457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptxrouholahahmadi9876
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Call Girls Mumbai
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdfAldoGarca30
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...Amil baba
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...jabtakhaidam7
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Introduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdfIntroduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdfsumitt6_25730773
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Ramkumar k
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 

Recently uploaded (20)

457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Introduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdfIntroduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdf
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 

Joy Scharmen - The Virtuous Cycle: Getting Good Things Out of Bad Failures

Editor's Notes

  1. Hi, I’m Joy and I’m the SRE director at Heroku. For those of you who aren’t familiar with Heroku, we’re a Platform as a Service. This means we handle a lot of the operations work for the customers who run on our platform. My job is to keep our platforms maximally stable so our customers can sleep easy at night.
  2. I'm here to talk about failure and why I love it, or at least don’t hate it.
  3. Why would I want to talk about failure? Failure is amazing — it can be our best teacher. As an SRE failure is utterly crucial to me doing my job. Complex systems often fail and we learn so much more from their failure than from success. A lot of us have probably had this realization. If we didn’t have failure, we’d be out of a job.
  4. So the question today is how do we learn from that failure? How do we learn from that failure in a way that doesn’t make us feel like failures? Let's start with an SRE war story — everyone loves a good war story.
  5. How many of us ever run out of integers in an auto-incrementing primary key column in a database? The whole database halts because it just ran out of numbers. And it’s usually a critical database. I've seen this failure mode pretty much everywhere I've ever worked as an SRE.
  6. It's pretty embarrassing because seriously -- you just ran out of numbers. It seems really easy to fix but it just keeps cropping up. So what are some of the reasons that this keeps happening?
  7. Commonly used frameworks have defaults that can come back and bite you later.
  8. Assumptions about the size of your database before it hits production. It’s a good problem to have when you’re successful enough that you outgrew your original assumptions. Two billion — that's a lot of numbers.
  9. Or just not thinking about it at all! That's probably the most common reason.
  10. So we had this happen to us twice in two months. That was pretty bad. Then it happened a third time almost a year later. For me as the head of SRE seeing this again was pretty painful!
  11. We run a Platform as a Service! Our whole premise is doing operations for our customers so they don’t have to. So how do we fix this problem for real?
  12. First we have to consider it more deeply than we did at the start. If the obvious fix was the long-term fix it wouldn't keep coming up.
  13. It’s simple enough to fix one occurrence of this, just change it to BIGINT. Data starts flowing again, folks go back to business as usual.
  14. When this happened the second time we applied a similar fix, and we also poked around manually at other crucial DBs that might have this problem. We even caught a few before failure that way.
  15. We needed to fix this a lot more systemically. Fortunately there’s a good tool for that!
  16. So who here is familiar with retrospectives? I imagine most people here have been to or at least know about them as a place to reflect on past projects or incidents. One of the main things that SRE instituted at Heroku were retrospectives for all customer affecting incidents.
  17. If you have been to a retrospective, you probably have been to a boring retrospective. I know I’ve run boring retrospectives. Sorry. I used to think that if you just got the right people in a room together to chat over an incident, things would naturally happen and we’d have a great, engaging conversation and leave with an amazing solution that would fix our problems. Maybe it would also solve world hunger. In reality, when you pop a 1 hour meeting on a bunch of folks’ calendars about an outage with no context, this stuff happens: Some folks don’t show, because they are allergic to calendars, email, and meetings. The ones that do show might be there because they have an axe to grind, or because they feel like they have to defend themselves. Establishing the timeline in the meeting leads to bickering and “well, actually” statements that put everyone who wasn’t in a bad mood into a bad mood. Once everyone is sufficiently miserable, you’re most of the way through your time. You have about 5 minutes to give people some work to do as the cherry on top of the misery sundae. If that doesn’t happen, everyone is bored and tuned out. The engineers are all doing email. The facilitator is doing email. No one’s paying any attention. At the end of the meeting you have some cursory remediation items and if you are lucky some might actually get done.
  18. For a retrospective to be useful, it can’t be boring. A retrospective is the pivot point between failure and learning. If it’s boring, no one is learning and you might as well give everyone back the time in their day they were sitting in the meeting. Putting a bunch of highly-paid engineers in a meeting for an hour in which they don’t learn anything is a waste of time, money, and morale.
  19. One problem we had with the first INT rollover is that we didn’t have a retrospective, because folks thought that they were a waste of time for something so trivial and easily understood. They were trying to avoid a boring time consuming meeting without a clear sense of what value it would have. This makes sense. I avoid boring meetings too. In this case, the problem was deceptive. Had we dug into it the first or even the second time we would have been able to discover that. So how do you have non-boring, useful retrospectives?
  20. One way to create engagement during the retrospective is by preparing for the meeting. Don’t force people to watch the sausage being made. It is excruciating for someone to attend a meeting and then have to figure out the timeline, or to find that you don’t have the right people, or even that you have the wrong people and not the right people. Retrospectives are a big time commitment we expect people to make and we need to make them count. People should know that when they show up to a retrospective that they're actually going to get something good out of it.
  21. The facilitator is the most crucial role in this meeting. The person should familiarize themself with the facts of the incident -- so ideally they are someone who is adjacent to the incident but not a primary responder, because they're going to be talking a lot in the meeting, and they shouldn’t be asking questions of themselves. The facilitator should know who was involved and why they were involved in the incident.
  22. You should also build a timeline. This can be done by the facilitator while they're gathering all the facts for the retrospective. This is really important. When I say build a timeline, I don't mean have everything down to the second of precision and every little tiny detail. It should be an overview. Think of it as a narrative - how would you tell the story of this event? If you were telling a story, you would have a beginning, middle, and end. You’d cover salient points. And you probably wouldn’t be going for microsecond precision.
  23. Any good engineer needs their tools. When I talk about tools, I don’t just mean stuff that you can check into a repo. I mean mental tools as well.
  24. Here’s an overview of the tools I most commonly use to create engaging retrospectives. There’s nothing magical about any of these -- you can use them too. I’ll take you through them.
  25. Why chat? Audio transcriptions are error-prone and time consuming. We run all our incidents, and indeed our day to day communications, in chat. That means everything has a transcript that you can refer back to. People can communicate in parallel -- you don't have to worry about interrupting someone on the voice bridge, and you don’t need someone to transcribe what’s happening on a voice bridge. You can copy and paste commands as needed. I don’t care which type of chat you use, as long as you use chat.
  26. Bot tools include incident management tools built on top of our chat bots. One example is here, where we recorded something for the timeline of this incident. We deploy in chat, and deploys emit chat notifications. Pages alert in chat. We also have incident-management specific tools we wrote that can create notes for building a timeline or questions to follow up on while the incident is ongoing. This makes the gathering information process for the retrospective much easier. It’s also great for transparency and discoverability amongst our engineers.
  27. SitReps (or situation reports) are a common pattern in incident response anywhere. You just want to periodic summary of the situation. This isn't what you're telling to customers -- this is what you're telling to people internally. You can of course use jargon, you can use acronyms, and you can you don't need to polish it the same level as you would customer-facing communications. The goal is to make sure that responders have check points to guide themselves with as they work on the incident, especially as new folks come in. These are also very helpful when you try to understand what happened after an incident -- sitreps give you milestones of what happened and when.
  28. People underestimate the amount of time it takes to run a good retrospective. I'm not just talking about the time that it takes in the meeting. Prior preparation generally shortens the amount of time you all have to spend in a room together. Block out time for yourself to prepare at least one day before the retro is scheduled.
  29. Make sure all key players (including the incident coordinator and the communications people) are available and plan on attending the meeting. If someone crucial can’t attend, either reschedule or have someone who can speak for them show up instead (such as a team member). Make sure you have a note-taker, someone who isn’t a primary responder so they won’t have to talk and take notes simultaneously.
  30. In general, be organized. Send out the agenda, including the timeline, the day before. Make sure the room is booked ahead of time and A/V is working.
  31. When everyone shows up with context retrospectives can get to the interesting bits faster. Who doesn’t love dissecting a failure in a complex system? I love doing this and I know a lot of us do, because that’s why we’re in SRE.
  32. So everyone is in the retrospective and the timeline is done. How do we start? We set context, we keep it short, and we don't do the litany of timeline reading. Think of telling a story.
  33. Have the most involved engineer give a brief summary of what happened. They should stick to the facts and really take less than five minutes. The goal is to make sure that everyone really orients themselves to What happened. One thing I should say is that a retrospective should happen within a week of the incident. People should still have this relatively fresh in their mind by the time you go to retrospect. Otherwise you're wasting people's’ time, and you missed a chance to strike while the iron is hot and folks are feeling motivated to tackle remediations.
  34. Once you are actually in the meeting you're going to want to read the room. As a facilitator you need to make sure that everyone is engaged. You yourself need to be very present and active part of leading the discussion. Don't be the note taker -- make sure someone else is the note taker. You'll need to ask questions of everyone, especially the quiet folks. Some people will want to dominate the conversation and some people will never want to jump in but that quiet person probably has some really good insights.
  35. You should talk about customer impact! We should be compassionate for what your customers felt during the outage. It's not just that you woke up at 3 AM because your database ran out of numbers -- your customer who might be running a business on your platform and maybe is around the world could have lost some valuable business, or some important work, and we need to be aware of that disruption.
  36. Take note of interesting questions, statements, and points of confusion. This gives you jumping off points for deeper conversations. When we’ve established context we can start diving into these things.
  37. Once you have some starting points to start your questioning, dive in. There are various methods you can use to formulate questions for investigation. A lot of people like the 5 whys -- I think that it’s interesting (it was created at Toyota) and very logical for engineers to grasp, but I like more flexible methods. I really like John Allspaw’s Infinite Hows. Asking “why” can frame the conversation in a more blameful way than asking “how”. I don’t think this needs to be prescriptive, though. Simply don’t stop asking questions until you have gotten many layers deep.
  38. Really really important -- if you ever get to human error, keep digging. Your systems are created and operated by humans for humans. Human error is a constant.
  39. I cannot emphasize this enough! You have to work around and with human error. Have you ever heard the phrase “Linux is user-friendly, it's just picky about its friends”? I disagree. Linux is dangerous. Complex and powerful tools can be dangerous. If you can take out your system with a typo your systems are too fragile, because someone is going to make a typo.
  40. If someone skips a step or makes a typo due to exhaustion or in attention, that’s not on the engineer. Always assume good intent. Humans get tired, humans get burnt out, humans get distracted. And humans run your systems. When we build and maintain complex systems we have to develop interfaces for them that are as tolerant as possible to human frailty. The bonus here is that we like working with systems like this. Less friction and stress over using your tools means happier engineers, and happy engineers mean better work. Usable, beautiful tools are an investment in scaling and reliability.
  41. A reason to be very careful about respecting human failings is that we don't want to make people feel defensive. When someone feels that they have to defend themselves, they throw up shields. After that point, you won’t get useful information out of that retrospective. Folks need to feel safe to disclose mistakes they have made. That's how we find out how to fix these gaps in our tools.
  42. One way you can tell a retrospective was good is in the end you have a ridiculous list of remediation items.
  43. Remediations can be big and sweeping, to tiny and tactical, to completely absurd. The ridiculous means you made it to the end of the questioning line!
  44. Don’t feel you have to do every remediation that comes out of a retrospective. Give yourself the freedom to think about all the options and narrow them down afterwards. Narrow down what you can commit to only after you’ve been creative. Don’t discount big projects either! That’s the really interesting work. This is where it helps to understand your company’s process for bringing new work into engineering.
  45. All too often we focus on remediations we can do quickly and within one team. We should be thinking more holistically.
  46. Product is often really excited to hear new ideas. It’s their job to think about how to improve customer experience and what new things customers want. SREs are great at finding problems and Product is great at finding solutions.
  47. An example of something that came out of a common need for our engineers and our customers -- Heroku Pipelines. We use this for our own internal deployment flows! A lot of Heroku runs on Heroku. Apps in a pipeline are grouped into “review”, “development”, “staging”, and “production” stages representing different deployment steps in a continuous delivery workflow.
  48. You don’t have to build something huge to be customer facing. A lot of time SREs think of ourselves as internet plumbers (or janitors) -- no one knows we’re there until something’s broken. That’s valuable! It’s also gratifying to see your work in front of a customer.
  49. Don’t limit yourself to behind the scenes work. Don’t settle for tools that are unpleasant to use. Don’t prevent yourself from bringing up ideas because it will require cross-team or cross-functional collaboration. You can improve your customers’ experience and your own.
  50. Back to our war story. What did we actually do to fix our INT rollover problems?
  51. Well, we added tooling to easily detect rollover conditions and give you a heads up to fix them before your database comes to a halt. There’s a heroku postgres tool called pg:diagnose, and it will now alert you when 75% and then 90% of your integer sequence is consumed.
  52. We also added process. There’s a productionization checklist that services should be going through before they hit production. We added an item to ensure sequences are in BIGINT. There’s no reason for us to use integer rather than bigint columns for sequences in Heroku Postgres. https://www.flickr.com/photos/peretzpup/2361847171/
  53. And of course we could and will improve. We’d like to have this check scan our production databases automatically and alert before failure. Then of course, we could give that option to our customers. https://www.flickr.com/photos/nnova/2967902322/
  54. We also are sending pull requests to at least one common open source framework (yes, still looking at you, ActiveRecord) to set better defaults.
  55. Thanks for sticking with me while I explain why I love failure. We’re all going to fail at some point, and operating distributed systems means your odds get much higher. It’s way easier to fail when you remember that every failure is a chance to learn. Make them count!
  56. Some relevant links! I hope these help you. Thank you for your time today.