Joy Scharmen - The Virtuous Cycle: Getting Good Things Out of Bad Failures

•Download as PPTX, PDF•

1 like•583 views

The document discusses how to learn from failures through effective retrospective meetings. It recommends that retrospectives include proper preparation like choosing a facilitator and building a timeline. During the meeting, the most involved engineer should provide context, customer impact should be discussed, and the discussion should focus on process improvements rather than blame. Many potential improvements or "remediations" may be identified. Both engineering and product teams should consider improvements to prevent future issues and improve customer experience. Effective retrospectives can help organizations continuously learn and improve.

Engineering

The Virtuous Cycle
Getting Good Things Out of Bad Failures
Joy Scharmen

Why talk about failure?
Failure is amazing.
Failure is our best teacher.

Have you ever run
out of integers
in an auto-
incrementing primary
key column in a
database?

Looking at you, ActiveRecord.
Frameworks that default to INT

Assumptions about the size of
your database.
(before it hits production)

Just not thinking about it
I’m just happy that it works at all.

Oops, I did it again.
We had this happen to us twice.
...then it happened a third time.

And we’re an operations
company.
┬─┬ノ( ゜-゜ノ)

First, consider it more deeply.
Σ(-᷅_-᷄๑)

Then it happened again.
Obviously what we are doing here isn’t working.

Who here is familiar with
retrospectives?
ret·ro·spec·tive
ˌretrəˈspektiv/
adjective
1. looking back on or dealing with past events or situations.

Who has been to a boring
retrospective?
I have. I’ve run them. Sorry.

A retrospective is the pivot point
between failure and learning.
If it’s boring, no one is learning.

How do we have non-boring
retrospectives?

Create engagement. Prepare!
Don’t force people to watch the sausage being
made.

Before the retrospective:
Choose a facilitator
They should know who was involved and
why.

Before the retrospective:
Build a timeline
Gather your facts.

Use your tools wisely
“We become what we behold. We shape our
tools, and thereafter our tools shape us.”
― Marshall McLuhan

Retrospective PrepIncident Management
ChatOps
Bot Tools
SitReps
Time
Outreach
Organization
My Tools For

My personal incident management tool belt:
ChatOps

My personal incident management tool belt:
Bot Tools

My personal incident management tool belt:
SitReps

My personal retrospective toolbelt:
Time
Block out time.

My personal retrospective toolbelt:
Outreach
Have roles defined.

My personal retrospective toolbelt:
Organization
Send out the agenda, including the timeline, the
day before the retrospective.

People should show up to a
retrospective with context to
begin a discussion.

Everyone is in the
retrospective. The timeline
is done. How do we start?

Have the most involved
engineer give a brief
summary of what
happened.

Make sure everyone is engaged.
Read the room.

Be compassionate for your customers.
Talk about customer
impact.

Pick the point you want to
start from and dive in.

If you ever get to “human
error”, keep digging.

No, really.
If you ever get to “human
error”, keep digging.

Most Important:
Always Assume Good
Intent

One way you can tell a
retrospective is good:
you have a ridiculous list of
remediation items.

“re-architect the whole platform”
Remediations can be anything from:
“fix typo on line 5”.
“make the speed of light go faster”.
to
to

Don’t do every remediation.
Don’t discount big projects!
and

What do you do with all of
these remediations?
Bring them to product as well as engineering!

Product can be your best
friend.
Do you have a need? Your customers do
too.
Product is great at getting needs in front of
customers.

Heroku Pipelines
Pipelines is a product that came out of an
engineering need.

Is your fix a small thing you
can add to existing customer
tools?
Engineering should be able to do this with
minimal product sign off.

You can improve your
customers’ experience.
Your customers, your fellow engineers, and
your community can benefit from your own
needs and hard won experience.

Next: Fix inputs
* https://github.com/rails/rails/pull/24962

Every failure is a
chance to learn.
Make those chances count.
Thank you.

Joy Scharmen / @peculiaire / joy@heroku.com
Retrospective Resource Wiki:
http://retrospectivewiki.org
https://www.oreilly.com/ideas/the-infinite-hows
Infinite Hows:
https://devcenter.heroku.com
Heroku Dev Center:
https://github.com/peculiaire/incident-lifecycle/blob/master/retrotemplate.md
Retrospective Template:

What's hot

Designing workfarzanashoma

Maybe We Don’t Have to Test ItTechWell

How to Run 100 User Tests in Two DaysDaniel Sauble

Obstacles of Digital Transformation EvolutionEqual Experts

STARWest Workshop: Explore with IntentMaaret Pyhäjärvi

Product Development -The Great UnknownSteve Owens

Stop the line @spotifyPeter Antman

Problem Solvingnroggen

STARWest: Make Your Team Awesome, Yes You Can!Maaret Pyhäjärvi

Matt Heusser - Keynote - Cool New Things... and some old ones tooQA or the Highway

Data Integrity - Patryk HesPROIDEA

Nightmare on PMO StreetKeyedIn Projects

SEETest: Making Teams AwesomeMaaret Pyhäjärvi

Ooda preskammeyer

HUSTEF '21 Keynote: Hands Off Exploratory Testing - Managing at ScaleMaaret Pyhäjärvi

Why unvalidated assumption is the enemy of good productSeb Agertoft

How to continuosly gain user insights during an agile projectAnders Ballde Jacobsson

Guerilla Usability Testing, or How I learned that perfectly imperfect tests a...Sara Snyder

What's hot (18)

Designing work

Maybe We Don’t Have to Test It

How to Run 100 User Tests in Two Days

Obstacles of Digital Transformation Evolution

STARWest Workshop: Explore with Intent

Product Development -The Great Unknown

Stop the line @spotify

Problem Solving

STARWest: Make Your Team Awesome, Yes You Can!

Matt Heusser - Keynote - Cool New Things... and some old ones too

Data Integrity - Patryk Hes

Nightmare on PMO Street

SEETest: Making Teams Awesome

Ooda pres

HUSTEF '21 Keynote: Hands Off Exploratory Testing - Managing at Scale

Why unvalidated assumption is the enemy of good product

How to continuosly gain user insights during an agile project

Guerilla Usability Testing, or How I learned that perfectly imperfect tests a...

Viewers also liked

Mark Leslie - Leadership and The Virtuous CycleMark Leslie

Microsoft Trusted Cloud - Security Privacy & Control, Compliance, TransparencyMicrosoft Österreich

Enable Mobility and Improve Cost Efficiency within a Secure Ecosystem - Futur...Microsoft Österreich

Webinar - Top 5 Strategies for Digital Process AgilityBizagi

Empired Snap: Intranets are ChangingEmpired

Digital Transformation How to Reboot IT and Business CollaborationBizagi

Microsoft Dynamics 365 and why you need it NOW!David Blumentals

Dynamics Day 2016 - Microsoft Dynamics 365 sales and customer service (CRM) ...Empired

Digital WorkspaceBearingPoint

Payment FactoryBearingPoint

Dynamics Day 2016 - Digital transformation with Microsoft Dynamics 365Empired

Dynamics Day 2016 - Microsoft Dynamics 365 the future of DynamicsEmpired

The essential elements of a digital transformation strategyMarcel Santilli

Why Digital Transformation is not an IT Transformation Vishal Sharma

Digital Transformation - How to Deliver Meaningful ResultsBizagi

Digital Transformation and the Customer ExperienceMat Ford

Microsoft Dynamics CRM 2015 Pre-sales Presentation MaterialAileen Gusni

Developing a Roadmap for Digital TransformationJohn Sinke

Viewers also liked (18)

Mark Leslie - Leadership and The Virtuous Cycle

Microsoft Trusted Cloud - Security Privacy & Control, Compliance, Transparency

Enable Mobility and Improve Cost Efficiency within a Secure Ecosystem - Futur...

Webinar - Top 5 Strategies for Digital Process Agility

Empired Snap: Intranets are Changing

Digital Transformation How to Reboot IT and Business Collaboration

Microsoft Dynamics 365 and why you need it NOW!

Dynamics Day 2016 - Microsoft Dynamics 365 sales and customer service (CRM) ...

Digital Workspace

Payment Factory

Dynamics Day 2016 - Digital transformation with Microsoft Dynamics 365

Dynamics Day 2016 - Microsoft Dynamics 365 the future of Dynamics

The essential elements of a digital transformation strategy

Why Digital Transformation is not an IT Transformation

Digital Transformation - How to Deliver Meaningful Results

Digital Transformation and the Customer Experience

Microsoft Dynamics CRM 2015 Pre-sales Presentation Material

Developing a Roadmap for Digital Transformation

Similar to Joy Scharmen - The Virtuous Cycle: Getting Good Things Out of Bad Failures

Five Ways to Get Better Data From Our UsersSajid Reshamwala

Get things done : pragmatic project managementStan Carrico

Choose Boring TechnologyDan McKinley

Wait A Moment? How High Workload Kills Efficiency! - Roman PicklPROIDEA

Blameless system design - annotatedDouglas Land

Toyota business practicesssuser727fc31

Grails Worst PracticesBurt Beckwith

The alignmentAlberto Brandolini

C programming guide newKuntal Bhowmick

Evolve or Die: A3 Thinking and Popcorn Flow in Action (#LKCE14)Claudio Perrone

Agent of Changemfrost503

“Don’t Repeat Yourself”: 4 Process Street Features to Keep Work DRYLizzyManz

Module 3.1 PowerPoint Slide Deck - DOWNLOAD for Presentation version April 20...GeorgeGozon1

2016 letter to Amazon shareholdersMatt Oh

Jeff Bezos' 2016 Letter to Amazon ShareholdersRazin Mustafiz

Amazon Jeff Bezos 2016 letter to shareholdersLaurie Ruettimann

Impact Analysis - LoopConfChris Lema

Impactanalysis 150507054758-lva1-app6891Jose P. Banuelos

The-Small Book-of-The-Few-Big-Rules-OutSystemsSteve Rotter

Architecting a Post Mortem - Velocity 2018 San Jose TutorialWill Gallego

Similar to Joy Scharmen - The Virtuous Cycle: Getting Good Things Out of Bad Failures (20)

Five Ways to Get Better Data From Our Users

Get things done : pragmatic project management

Choose Boring Technology

Wait A Moment? How High Workload Kills Efficiency! - Roman Pickl

Blameless system design - annotated

Toyota business practices

Grails Worst Practices

The alignment

C programming guide new

Evolve or Die: A3 Thinking and Popcorn Flow in Action (#LKCE14)

Agent of Change

“Don’t Repeat Yourself”: 4 Process Street Features to Keep Work DRY

Module 3.1 PowerPoint Slide Deck - DOWNLOAD for Presentation version April 20...

2016 letter to Amazon shareholders

Jeff Bezos' 2016 Letter to Amazon Shareholders

Amazon Jeff Bezos 2016 letter to shareholders

Impact Analysis - LoopConf

Impactanalysis 150507054758-lva1-app6891

The-Small Book-of-The-Few-Big-Rules-OutSystems

Architecting a Post Mortem - Velocity 2018 San Jose Tutorial

Recently uploaded

457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptxrouholahahmadi9876

DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal

Introduction to Serverless with AWS LambdaOmar Fathy

Double Revolving field theory-how the rotor develops torqueBhangaleSonal

Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Call Girls Mumbai

Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1

S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)

1_Introduction + EAM Vocabulary + how to navigate in EAM.pdfAldoGarca30

NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...Amil baba

Unleashing the Power of the SORA AI lastest leapRishantSharmaFr

Online electricity billing project report..pdfKamal Acharya

Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...jabtakhaidam7

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b

Introduction to Data Visualization,Matplotlib.pdfsumitt6_25730773

Theory of Time 2024 (Universal Theory for Everything)Ramkumar k

Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Kandungan 087776558899

Employee leave management system project.Kamal Acharya

COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA

HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR

Recently uploaded (20)

457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx

DC MACHINE-Motoring and generation, Armature circuit equation

Introduction to Serverless with AWS Lambda

Double Revolving field theory-how the rotor develops torque

Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...

Work-Permit-Receiver-in-Saudi-Aramco.pptx

S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...

1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf

NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...

Unleashing the Power of the SORA AI lastest leap

Online electricity billing project report..pdf

Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Introduction to Data Visualization,Matplotlib.pdf

Theory of Time 2024 (Universal Theory for Everything)

Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil

Employee leave management system project.

COST-EFFETIVE and Energy Efficient BUILDINGS ptx

HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR

Joy Scharmen - The Virtuous Cycle: Getting Good Things Out of Bad Failures

1. The Virtuous Cycle Getting Good Things Out of Bad Failures Joy Scharmen

2. I’m here to talk about failure.

3. Why talk about failure? Failure is amazing. Failure is our best teacher.

4. How do we learn from failure?

5. Have you ever run out of integers in an auto- incrementing primary key column in a database?

7. Looking at you, ActiveRecord. Frameworks that default to INT

8. Assumptions about the size of your database. (before it hits production)

9. Just not thinking about it I’m just happy that it works at all.

10. Oops, I did it again. We had this happen to us twice. ...then it happened a third time.

11. And we’re an operations company. ┬─┬ノ( ゜-゜ノ)

12. First, consider it more deeply. Σ(-᷅_-᷄๑)

13. We fixed one occurrence. It was simple.

14. It happened again. Same fix.

15. Then it happened again. Obviously what we are doing here isn’t working.

16. Who here is familiar with retrospectives? ret·ro·spec·tive ˌretrəˈspektiv/ adjective 1. looking back on or dealing with past events or situations.

17. Who has been to a boring retrospective? I have. I’ve run them. Sorry.

18. A retrospective is the pivot point between failure and learning. If it’s boring, no one is learning.

19. How do we have non-boring retrospectives?

20. Create engagement. Prepare! Don’t force people to watch the sausage being made.

21. Before the retrospective: Choose a facilitator They should know who was involved and why.

22. Before the retrospective: Build a timeline Gather your facts.

23. Use your tools wisely “We become what we behold. We shape our tools, and thereafter our tools shape us.” ― Marshall McLuhan

24. Retrospective PrepIncident Management ChatOps Bot Tools SitReps Time Outreach Organization My Tools For

25. My personal incident management tool belt: ChatOps

26. My personal incident management tool belt: Bot Tools

27. My personal incident management tool belt: SitReps

28. My personal retrospective toolbelt: Time Block out time.

29. My personal retrospective toolbelt: Outreach Have roles defined.

30. My personal retrospective toolbelt: Organization Send out the agenda, including the timeline, the day before the retrospective.

31. People should show up to a retrospective with context to begin a discussion.

32. Everyone is in the retrospective. The timeline is done. How do we start?

33. Have the most involved engineer give a brief summary of what happened.

34. Make sure everyone is engaged. Read the room.

35. Be compassionate for your customers. Talk about customer impact.

36. Take note.

37. Pick the point you want to start from and dive in.

38. If you ever get to “human error”, keep digging.

39. No, really. If you ever get to “human error”, keep digging.

40. Most Important: Always Assume Good Intent

41. Defensiveness kills retrospection.

42. One way you can tell a retrospective is good: you have a ridiculous list of remediation items.

43. “re-architect the whole platform” Remediations can be anything from: “fix typo on line 5”. “make the speed of light go faster”. to to

44. Don’t do every remediation. Don’t discount big projects! and

45. What do you do with all of these remediations? Bring them to product as well as engineering!

46. Product can be your best friend. Do you have a need? Your customers do too. Product is great at getting needs in front of customers.

47. Heroku Pipelines Pipelines is a product that came out of an engineering need.

48. Is your fix a small thing you can add to existing customer tools? Engineering should be able to do this with minimal product sign off.

49. You can improve your customers’ experience. Your customers, your fellow engineers, and your community can benefit from your own needs and hard won experience.

50. Back to the story.

51. Done: Tooling

52. Done: Process

53. Next: Automation

54. Next: Fix inputs * https://github.com/rails/rails/pull/24962

55. Every failure is a chance to learn. Make those chances count. Thank you.

56. Joy Scharmen / @peculiaire / joy@heroku.com Retrospective Resource Wiki: http://retrospectivewiki.org https://www.oreilly.com/ideas/the-infinite-hows Infinite Hows: https://devcenter.heroku.com Heroku Dev Center: https://github.com/peculiaire/incident-lifecycle/blob/master/retrotemplate.md Retrospective Template:

Editor's Notes

Hi, I’m Joy and I’m the SRE director at Heroku. For those of you who aren’t familiar with Heroku, we’re a Platform as a Service. This means we handle a lot of the operations work for the customers who run on our platform. My job is to keep our platforms maximally stable so our customers can sleep easy at night.
I'm here to talk about failure and why I love it, or at least don’t hate it.
Why would I want to talk about failure? Failure is amazing — it can be our best teacher. As an SRE failure is utterly crucial to me doing my job. Complex systems often fail and we learn so much more from their failure than from success. A lot of us have probably had this realization. If we didn’t have failure, we’d be out of a job.
So the question today is how do we learn from that failure? How do we learn from that failure in a way that doesn’t make us feel like failures? Let's start with an SRE war story — everyone loves a good war story.
How many of us ever run out of integers in an auto-incrementing primary key column in a database? The whole database halts because it just ran out of numbers. And it’s usually a critical database. I've seen this failure mode pretty much everywhere I've ever worked as an SRE.
It's pretty embarrassing because seriously -- you just ran out of numbers. It seems really easy to fix but it just keeps cropping up. So what are some of the reasons that this keeps happening?
Commonly used frameworks have defaults that can come back and bite you later.
Assumptions about the size of your database before it hits production. It’s a good problem to have when you’re successful enough that you outgrew your original assumptions. Two billion — that's a lot of numbers.
Or just not thinking about it at all! That's probably the most common reason.
So we had this happen to us twice in two months. That was pretty bad. Then it happened a third time almost a year later. For me as the head of SRE seeing this again was pretty painful!
We run a Platform as a Service! Our whole premise is doing operations for our customers so they don’t have to. So how do we fix this problem for real?
First we have to consider it more deeply than we did at the start. If the obvious fix was the long-term fix it wouldn't keep coming up.
It’s simple enough to fix one occurrence of this, just change it to BIGINT. Data starts flowing again, folks go back to business as usual.
When this happened the second time we applied a similar fix, and we also poked around manually at other crucial DBs that might have this problem. We even caught a few before failure that way.
We needed to fix this a lot more systemically. Fortunately there’s a good tool for that!
So who here is familiar with retrospectives? I imagine most people here have been to or at least know about them as a place to reflect on past projects or incidents. One of the main things that SRE instituted at Heroku were retrospectives for all customer affecting incidents.
If you have been to a retrospective, you probably have been to a boring retrospective. I know I’ve run boring retrospectives. Sorry. I used to think that if you just got the right people in a room together to chat over an incident, things would naturally happen and we’d have a great, engaging conversation and leave with an amazing solution that would fix our problems. Maybe it would also solve world hunger. In reality, when you pop a 1 hour meeting on a bunch of folks’ calendars about an outage with no context, this stuff happens: Some folks don’t show, because they are allergic to calendars, email, and meetings. The ones that do show might be there because they have an axe to grind, or because they feel like they have to defend themselves. Establishing the timeline in the meeting leads to bickering and “well, actually” statements that put everyone who wasn’t in a bad mood into a bad mood. Once everyone is sufficiently miserable, you’re most of the way through your time. You have about 5 minutes to give people some work to do as the cherry on top of the misery sundae. If that doesn’t happen, everyone is bored and tuned out. The engineers are all doing email. The facilitator is doing email. No one’s paying any attention. At the end of the meeting you have some cursory remediation items and if you are lucky some might actually get done.
For a retrospective to be useful, it can’t be boring. A retrospective is the pivot point between failure and learning. If it’s boring, no one is learning and you might as well give everyone back the time in their day they were sitting in the meeting. Putting a bunch of highly-paid engineers in a meeting for an hour in which they don’t learn anything is a waste of time, money, and morale.
One problem we had with the first INT rollover is that we didn’t have a retrospective, because folks thought that they were a waste of time for something so trivial and easily understood. They were trying to avoid a boring time consuming meeting without a clear sense of what value it would have. This makes sense. I avoid boring meetings too. In this case, the problem was deceptive. Had we dug into it the first or even the second time we would have been able to discover that. So how do you have non-boring, useful retrospectives?
One way to create engagement during the retrospective is by preparing for the meeting. Don’t force people to watch the sausage being made. It is excruciating for someone to attend a meeting and then have to figure out the timeline, or to find that you don’t have the right people, or even that you have the wrong people and not the right people. Retrospectives are a big time commitment we expect people to make and we need to make them count. People should know that when they show up to a retrospective that they're actually going to get something good out of it.
The facilitator is the most crucial role in this meeting. The person should familiarize themself with the facts of the incident -- so ideally they are someone who is adjacent to the incident but not a primary responder, because they're going to be talking a lot in the meeting, and they shouldn’t be asking questions of themselves. The facilitator should know who was involved and why they were involved in the incident.
You should also build a timeline. This can be done by the facilitator while they're gathering all the facts for the retrospective. This is really important. When I say build a timeline, I don't mean have everything down to the second of precision and every little tiny detail. It should be an overview. Think of it as a narrative - how would you tell the story of this event? If you were telling a story, you would have a beginning, middle, and end. You’d cover salient points. And you probably wouldn’t be going for microsecond precision.
Any good engineer needs their tools. When I talk about tools, I don’t just mean stuff that you can check into a repo. I mean mental tools as well.
Here’s an overview of the tools I most commonly use to create engaging retrospectives. There’s nothing magical about any of these -- you can use them too. I’ll take you through them.
Why chat? Audio transcriptions are error-prone and time consuming. We run all our incidents, and indeed our day to day communications, in chat. That means everything has a transcript that you can refer back to. People can communicate in parallel -- you don't have to worry about interrupting someone on the voice bridge, and you don’t need someone to transcribe what’s happening on a voice bridge. You can copy and paste commands as needed. I don’t care which type of chat you use, as long as you use chat.
Bot tools include incident management tools built on top of our chat bots. One example is here, where we recorded something for the timeline of this incident. We deploy in chat, and deploys emit chat notifications. Pages alert in chat. We also have incident-management specific tools we wrote that can create notes for building a timeline or questions to follow up on while the incident is ongoing. This makes the gathering information process for the retrospective much easier. It’s also great for transparency and discoverability amongst our engineers.
SitReps (or situation reports) are a common pattern in incident response anywhere. You just want to periodic summary of the situation. This isn't what you're telling to customers -- this is what you're telling to people internally. You can of course use jargon, you can use acronyms, and you can you don't need to polish it the same level as you would customer-facing communications. The goal is to make sure that responders have check points to guide themselves with as they work on the incident, especially as new folks come in. These are also very helpful when you try to understand what happened after an incident -- sitreps give you milestones of what happened and when.
People underestimate the amount of time it takes to run a good retrospective. I'm not just talking about the time that it takes in the meeting. Prior preparation generally shortens the amount of time you all have to spend in a room together. Block out time for yourself to prepare at least one day before the retro is scheduled.
Make sure all key players (including the incident coordinator and the communications people) are available and plan on attending the meeting. If someone crucial can’t attend, either reschedule or have someone who can speak for them show up instead (such as a team member). Make sure you have a note-taker, someone who isn’t a primary responder so they won’t have to talk and take notes simultaneously.
In general, be organized. Send out the agenda, including the timeline, the day before. Make sure the room is booked ahead of time and A/V is working.
When everyone shows up with context retrospectives can get to the interesting bits faster. Who doesn’t love dissecting a failure in a complex system? I love doing this and I know a lot of us do, because that’s why we’re in SRE.
So everyone is in the retrospective and the timeline is done. How do we start? We set context, we keep it short, and we don't do the litany of timeline reading. Think of telling a story.
Have the most involved engineer give a brief summary of what happened. They should stick to the facts and really take less than five minutes. The goal is to make sure that everyone really orients themselves to What happened. One thing I should say is that a retrospective should happen within a week of the incident. People should still have this relatively fresh in their mind by the time you go to retrospect. Otherwise you're wasting people's’ time, and you missed a chance to strike while the iron is hot and folks are feeling motivated to tackle remediations.
Once you are actually in the meeting you're going to want to read the room. As a facilitator you need to make sure that everyone is engaged. You yourself need to be very present and active part of leading the discussion. Don't be the note taker -- make sure someone else is the note taker. You'll need to ask questions of everyone, especially the quiet folks. Some people will want to dominate the conversation and some people will never want to jump in but that quiet person probably has some really good insights.
You should talk about customer impact! We should be compassionate for what your customers felt during the outage. It's not just that you woke up at 3 AM because your database ran out of numbers -- your customer who might be running a business on your platform and maybe is around the world could have lost some valuable business, or some important work, and we need to be aware of that disruption.
Take note of interesting questions, statements, and points of confusion. This gives you jumping off points for deeper conversations. When we’ve established context we can start diving into these things.
Once you have some starting points to start your questioning, dive in. There are various methods you can use to formulate questions for investigation. A lot of people like the 5 whys -- I think that it’s interesting (it was created at Toyota) and very logical for engineers to grasp, but I like more flexible methods. I really like John Allspaw’s Infinite Hows. Asking “why” can frame the conversation in a more blameful way than asking “how”. I don’t think this needs to be prescriptive, though. Simply don’t stop asking questions until you have gotten many layers deep.
Really really important -- if you ever get to human error, keep digging. Your systems are created and operated by humans for humans. Human error is a constant.
I cannot emphasize this enough! You have to work around and with human error. Have you ever heard the phrase “Linux is user-friendly, it's just picky about its friends”? I disagree. Linux is dangerous. Complex and powerful tools can be dangerous. If you can take out your system with a typo your systems are too fragile, because someone is going to make a typo.
If someone skips a step or makes a typo due to exhaustion or in attention, that’s not on the engineer. Always assume good intent. Humans get tired, humans get burnt out, humans get distracted. And humans run your systems. When we build and maintain complex systems we have to develop interfaces for them that are as tolerant as possible to human frailty. The bonus here is that we like working with systems like this. Less friction and stress over using your tools means happier engineers, and happy engineers mean better work. Usable, beautiful tools are an investment in scaling and reliability.
A reason to be very careful about respecting human failings is that we don't want to make people feel defensive. When someone feels that they have to defend themselves, they throw up shields. After that point, you won’t get useful information out of that retrospective. Folks need to feel safe to disclose mistakes they have made. That's how we find out how to fix these gaps in our tools.
One way you can tell a retrospective was good is in the end you have a ridiculous list of remediation items.
Remediations can be big and sweeping, to tiny and tactical, to completely absurd. The ridiculous means you made it to the end of the questioning line!
Don’t feel you have to do every remediation that comes out of a retrospective. Give yourself the freedom to think about all the options and narrow them down afterwards. Narrow down what you can commit to only after you’ve been creative. Don’t discount big projects either! That’s the really interesting work. This is where it helps to understand your company’s process for bringing new work into engineering.
All too often we focus on remediations we can do quickly and within one team. We should be thinking more holistically.
Product is often really excited to hear new ideas. It’s their job to think about how to improve customer experience and what new things customers want. SREs are great at finding problems and Product is great at finding solutions.
An example of something that came out of a common need for our engineers and our customers -- Heroku Pipelines. We use this for our own internal deployment flows! A lot of Heroku runs on Heroku. Apps in a pipeline are grouped into “review”, “development”, “staging”, and “production” stages representing different deployment steps in a continuous delivery workflow.
You don’t have to build something huge to be customer facing. A lot of time SREs think of ourselves as internet plumbers (or janitors) -- no one knows we’re there until something’s broken. That’s valuable! It’s also gratifying to see your work in front of a customer.
Don’t limit yourself to behind the scenes work. Don’t settle for tools that are unpleasant to use. Don’t prevent yourself from bringing up ideas because it will require cross-team or cross-functional collaboration. You can improve your customers’ experience and your own.
Back to our war story. What did we actually do to fix our INT rollover problems?
Well, we added tooling to easily detect rollover conditions and give you a heads up to fix them before your database comes to a halt. There’s a heroku postgres tool called pg:diagnose, and it will now alert you when 75% and then 90% of your integer sequence is consumed.
We also added process. There’s a productionization checklist that services should be going through before they hit production. We added an item to ensure sequences are in BIGINT. There’s no reason for us to use integer rather than bigint columns for sequences in Heroku Postgres. https://www.flickr.com/photos/peretzpup/2361847171/
And of course we could and will improve. We’d like to have this check scan our production databases automatically and alert before failure. Then of course, we could give that option to our customers. https://www.flickr.com/photos/nnova/2967902322/
We also are sending pull requests to at least one common open source framework (yes, still looking at you, ActiveRecord) to set better defaults.
Thanks for sticking with me while I explain why I love failure. We’re all going to fail at some point, and operating distributed systems means your odds get much higher. It’s way easier to fail when you remember that every failure is a chance to learn. Make them count!
Some relevant links! I hope these help you. Thank you for your time today.

Joy Scharmen - The Virtuous Cycle: Getting Good Things Out of Bad Failures

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (18)

Similar to Joy Scharmen - The Virtuous Cycle: Getting Good Things Out of Bad Failures

Similar to Joy Scharmen - The Virtuous Cycle: Getting Good Things Out of Bad Failures (20)

Recently uploaded

Recently uploaded (20)

Joy Scharmen - The Virtuous Cycle: Getting Good Things Out of Bad Failures

Editor's Notes