Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Anatomy of
Three Incidents
Randy Shoup
@randyshoup
linkedin.com/in/randyshoup
Background
@randyshoup
@randyshoup
App Engine Outage - Oct 2012
http://googleappengine.blogspot.com/2012/10/about-todays-app-engine-outage.html
App Engine Outage - Oct 2012
App Engine Reliability Fixit
• Step 1: Identify the Problem
o All team leads and senior engineers met in a room with a whi...
• Step 2: Understand the Problem
o Each theme assigned to a senior engineer to investigate
o Timeboxed for 1 week
o After ...
• Step 3: Consensus and Prioritization
o Leads discussed themes and prioritized work
o Assigned engineers to tasks
App Eng...
• Step 4: Implementation and Follow-up
o Engineers worked on assigned tasks
o Simple spreadsheet of task status, which eng...
•  Results
o 10x reduction in reliability issues
o Improved team cohesion and camaraderie
o Broader participation and own...
@randyshoup
Stitch Fix – Oct / Nov 2016
• (11/08/2016) Spectre unavailable for ~3 minutes [Shared Database]
• (11/05/2016) Spectre una...
Database Stability Problems
• 1. Applications contended on common tables
• 2. Scalability limited by database connections
...
Stability Retrospective
• Step 1: Identify the Problem
• Step 2: Understand the Problem
• Step 3: Consensus and Prioritiza...
Stability Solutions
• 1. Focus on expensive queries
o Log
o Eliminate
o Rewrite
o Reduce
• 2. Manage database connections ...
@randyshoup
Login Issues - 2019
• Problem: Some members unable to log in
• Inconsistent representations across different services in t...
Login Retrospective
• Step 1: Identify the Problem
• Step 2: Understand the Problem
• Step 3: Consensus and Prioritization...
Login Solutions
• 1. Clean up user data
o Find inconsistencies
o Track inconsistency metrics
o Identify and fix contributi...
Common Elements
• Unintentional, long-term accumulation of
small, individually reasonable decisions
• “Compelling event” c...
Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement
Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement
Vicious Cycle of Technical Debt
Technical
Debt
“No time
to do it
right”
Quick-
and-dirty
“Do you have time to do it
twice?”
“We don’t have time to do it
right!”
@randyshoup
The more constrained you are
on time or resources, the more
important it is to get it done
the first time.
@randyshoup
Negotiating Tradeoffs
Scope
Time
Quality
@randyshoup
Virtuous Cycle of Investment
Solid
Foundation
Confidence
Faster and
Better
Quality
Investment
Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement
During the Incident
• Focus on restoring service
o Everything else is secondary, and should wait
• Shield the team
• Clear...
After the Incident
• Blameless postmortem
• Identify and understand the
contributing factors
• Action items and Learnings
...
Psychological Safety
• Team is safe for interpersonal
risk-taking
• “Being able to show and employ
one’s self without fear...
“Finally we can prioritize
fixing that broken system!”
@randyshoup
Inclusive Decisionmaking
• Make better business decisions
87% of the time
• Make decisions 2x faster with
1/2 the meetings...
Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement
Frame the Problem:
Quality and reliability are
business concerns
@randyshoup
Use Common Currency
Time
Money People
@randyshoup
15 Million
“Never let a
good crisis go
to waste.”
@randyshoup
“Incidents are unplanned
investments, and they are also
opportunities. Your challenge
is to maximize the ROI on the
sunk c...
Improvement Budget
• Explicit resource investment
o Agree on an up-front investment
(e.g., 25%, 30% of engineering efforts...
Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement
Incident Response Patterns
• Incident Roles
• Incident Triggers
• On-Call Rotation and Onboarding
• Incident Command Train...
Thank you!
@randyshoup
linkedin.com/in/randyshoup
medium.com/@randyshoup
Upcoming SlideShare
Loading in …5
×

of

Anatomy of Three Incidents -- Commonalities and Lessons Slide 1 Anatomy of Three Incidents -- Commonalities and Lessons Slide 2 Anatomy of Three Incidents -- Commonalities and Lessons Slide 3 Anatomy of Three Incidents -- Commonalities and Lessons Slide 4 Anatomy of Three Incidents -- Commonalities and Lessons Slide 5 Anatomy of Three Incidents -- Commonalities and Lessons Slide 6 Anatomy of Three Incidents -- Commonalities and Lessons Slide 7 Anatomy of Three Incidents -- Commonalities and Lessons Slide 8 Anatomy of Three Incidents -- Commonalities and Lessons Slide 9 Anatomy of Three Incidents -- Commonalities and Lessons Slide 10 Anatomy of Three Incidents -- Commonalities and Lessons Slide 11 Anatomy of Three Incidents -- Commonalities and Lessons Slide 12 Anatomy of Three Incidents -- Commonalities and Lessons Slide 13 Anatomy of Three Incidents -- Commonalities and Lessons Slide 14 Anatomy of Three Incidents -- Commonalities and Lessons Slide 15 Anatomy of Three Incidents -- Commonalities and Lessons Slide 16 Anatomy of Three Incidents -- Commonalities and Lessons Slide 17 Anatomy of Three Incidents -- Commonalities and Lessons Slide 18 Anatomy of Three Incidents -- Commonalities and Lessons Slide 19 Anatomy of Three Incidents -- Commonalities and Lessons Slide 20 Anatomy of Three Incidents -- Commonalities and Lessons Slide 21 Anatomy of Three Incidents -- Commonalities and Lessons Slide 22 Anatomy of Three Incidents -- Commonalities and Lessons Slide 23 Anatomy of Three Incidents -- Commonalities and Lessons Slide 24 Anatomy of Three Incidents -- Commonalities and Lessons Slide 25 Anatomy of Three Incidents -- Commonalities and Lessons Slide 26 Anatomy of Three Incidents -- Commonalities and Lessons Slide 27 Anatomy of Three Incidents -- Commonalities and Lessons Slide 28 Anatomy of Three Incidents -- Commonalities and Lessons Slide 29 Anatomy of Three Incidents -- Commonalities and Lessons Slide 30 Anatomy of Three Incidents -- Commonalities and Lessons Slide 31 Anatomy of Three Incidents -- Commonalities and Lessons Slide 32 Anatomy of Three Incidents -- Commonalities and Lessons Slide 33 Anatomy of Three Incidents -- Commonalities and Lessons Slide 34 Anatomy of Three Incidents -- Commonalities and Lessons Slide 35 Anatomy of Three Incidents -- Commonalities and Lessons Slide 36 Anatomy of Three Incidents -- Commonalities and Lessons Slide 37 Anatomy of Three Incidents -- Commonalities and Lessons Slide 38 Anatomy of Three Incidents -- Commonalities and Lessons Slide 39 Anatomy of Three Incidents -- Commonalities and Lessons Slide 40 Anatomy of Three Incidents -- Commonalities and Lessons Slide 41 Anatomy of Three Incidents -- Commonalities and Lessons Slide 42
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

Anatomy of Three Incidents -- Commonalities and Lessons

Download to read offline

The best response to a system outage is not "What did you do?", but "What did we learn?" This session will walk through three system-wide outages at Google, at Stitch Fix, and at WeWork—their incidents, aftermaths, and recoveries. In all cases, many things went right and a few went wrong; also in all cases, because of blameless cultures, we buckled down, learned a lot, and made substantial improvements in the systems for the future. Looking back with the perspective of 20-20 hindsight, all of these incidents were seminal events that changed the focus and trajectory of engineering at each organization. You will leave with a set of actionable suggestions in dealing with customers, engineering teams, and upper management. You will also enjoy a few war stories from the trenches.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Anatomy of Three Incidents -- Commonalities and Lessons

  1. 1. Anatomy of Three Incidents Randy Shoup @randyshoup linkedin.com/in/randyshoup
  2. 2. Background @randyshoup
  3. 3. @randyshoup
  4. 4. App Engine Outage - Oct 2012
  5. 5. http://googleappengine.blogspot.com/2012/10/about-todays-app-engine-outage.html App Engine Outage - Oct 2012
  6. 6. App Engine Reliability Fixit • Step 1: Identify the Problem o All team leads and senior engineers met in a room with a whiteboard o Enumerated all known and suspected reliability issues o Too much technical debt had accumulated o Reliability issues had not been prioritized o Identify 8-10 themes @randyshoup
  7. 7. • Step 2: Understand the Problem o Each theme assigned to a senior engineer to investigate o Timeboxed for 1 week o After 1 week, all leads came back with • Detailed list of issues • Recommended steps to address them • Estimated order-of-magnitude of effort (1 day, 1 week, 1 month, etc.) App Engine Reliability Fixit @randyshoup
  8. 8. • Step 3: Consensus and Prioritization o Leads discussed themes and prioritized work o Assigned engineers to tasks App Engine Reliability Fixit @randyshoup
  9. 9. • Step 4: Implementation and Follow-up o Engineers worked on assigned tasks o Simple spreadsheet of task status, which engineers updated weekly o Minimal effort from management (~1 hour / week) to summarize progress at weekly team meeting App Engine Reliability Fixit @randyshoup
  10. 10. •  Results o 10x reduction in reliability issues o Improved team cohesion and camaraderie o Broader participation and ownership of the future health of the platform o Still remembered several years later App Engine Reliability Fixit @randyshoup
  11. 11. @randyshoup
  12. 12. Stitch Fix – Oct / Nov 2016 • (11/08/2016) Spectre unavailable for ~3 minutes [Shared Database] • (11/05/2016) Spectre unavailable for ~5 minutes [Shared Database] • (10/25/2016) All systems unavailable for ~5 minutes [Shared Database] • (10/24/2016) All systems unavailable for ~5 minutes [Shared Database] • (10/21/2016) All systems unavailable for ~3 ½ hours [DDOS attack] • (10/18/2016) All systems unavailable for ~3 minutes [Shared Database] • (10/17/2016) All systems unavailable for ~20 minutes [Shared Database] • (10/13/2016) Minx escalation broken for ~2 hours [Zendesk outage] • (10/11/2016) Label printing unavailable for ~10 minutes [FedEx outage] • (10/10/2016) Label printing unavailable for ~15 minutes [FedEx outage] • (10/10/2016) All systems unavailable for ~10 minutes [Shared Database] @randyshoup
  13. 13. Database Stability Problems • 1. Applications contended on common tables • 2. Scalability limited by database connections • 3. One application could take down entire company @randyshoup
  14. 14. Stability Retrospective • Step 1: Identify the Problem • Step 2: Understand the Problem • Step 3: Consensus and Prioritization • Step 4: Implementation and Follow-Up •  Results @randyshoup
  15. 15. Stability Solutions • 1. Focus on expensive queries o Log o Eliminate o Rewrite o Reduce • 2. Manage database connections via connection concentrator • 3. Stability and Scalability Program o Ongoing 25% investment in services migration @randyshoup
  16. 16. @randyshoup
  17. 17. Login Issues - 2019 • Problem: Some members unable to log in • Inconsistent representations across different services in the system • Over time, simple system interactions grew increasingly complex and convoluted • Not enough graceful degradation or automated repair @randyshoup
  18. 18. Login Retrospective • Step 1: Identify the Problem • Step 2: Understand the Problem • Step 3: Consensus and Prioritization • Step 4: Implementation and Follow-Up @randyshoup
  19. 19. Login Solutions • 1. Clean up user data o Find inconsistencies o Track inconsistency metrics o Identify and fix contributing processes and applications • 2. User state machines o Define user journeys as explicit state machines o Refine and correct via cross-functional feedback o Implement state machines in code • 3. “Pandora” Program o Rewrite core identity system into set of user capabilities @randyshoup
  20. 20. Common Elements • Unintentional, long-term accumulation of small, individually reasonable decisions • “Compelling event” catalyzes long-term change • Blameless culture makes learning and improvement possible • Structured post-incident approach @randyshoup
  21. 21. Lessons •Engineering Tradeoffs •Compelling Event •Driving Improvement
  22. 22. Lessons •Engineering Tradeoffs •Compelling Event •Driving Improvement
  23. 23. Vicious Cycle of Technical Debt Technical Debt “No time to do it right” Quick- and-dirty
  24. 24. “Do you have time to do it twice?” “We don’t have time to do it right!” @randyshoup
  25. 25. The more constrained you are on time or resources, the more important it is to get it done the first time. @randyshoup
  26. 26. Negotiating Tradeoffs Scope Time Quality @randyshoup
  27. 27. Virtuous Cycle of Investment Solid Foundation Confidence Faster and Better Quality Investment
  28. 28. Lessons •Engineering Tradeoffs •Compelling Event •Driving Improvement
  29. 29. During the Incident • Focus on restoring service o Everything else is secondary, and should wait • Shield the team • Clear, structured communication o Even when there is nothing to report! @randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
  30. 30. After the Incident • Blameless postmortem • Identify and understand the contributing factors • Action items and Learnings • Follow Up! @randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
  31. 31. Psychological Safety • Team is safe for interpersonal risk-taking • “Being able to show and employ one’s self without fear of negative consequences” • More important than any other factor in team success
  32. 32. “Finally we can prioritize fixing that broken system!” @randyshoup
  33. 33. Inclusive Decisionmaking • Make better business decisions 87% of the time • Make decisions 2x faster with 1/2 the meetings • Deliver 60% better business results Cloverpop Inclusive Decisionmaking study, 2016 As we improve diversity, decisionmaking improves @randyshoup
  34. 34. Lessons •Engineering Tradeoffs •Compelling Event •Driving Improvement
  35. 35. Frame the Problem: Quality and reliability are business concerns @randyshoup
  36. 36. Use Common Currency Time Money People @randyshoup
  37. 37. 15 Million “Never let a good crisis go to waste.” @randyshoup
  38. 38. “Incidents are unplanned investments, and they are also opportunities. Your challenge is to maximize the ROI on the sunk cost.” @randyshoup -- John Allspaw, Adaptive Capacity Labs
  39. 39. Improvement Budget • Explicit resource investment o Agree on an up-front investment (e.g., 25%, 30% of engineering efforts) • Retain autonomy, Provide transparency o Making these decisions is exactly why they hired you @randyshoup
  40. 40. Lessons •Engineering Tradeoffs •Compelling Event •Driving Improvement
  41. 41. Incident Response Patterns • Incident Roles • Incident Triggers • On-Call Rotation and Onboarding • Incident Command Training • Incident Communication Plan • Periodic Incident Updates • Shared Incident State Doc • Incident Call Recording • Incident Swarming • Local / Global Incident Reviews • Post-Review Improvement Items @randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
  42. 42. Thank you! @randyshoup linkedin.com/in/randyshoup medium.com/@randyshoup

The best response to a system outage is not "What did you do?", but "What did we learn?" This session will walk through three system-wide outages at Google, at Stitch Fix, and at WeWork—their incidents, aftermaths, and recoveries. In all cases, many things went right and a few went wrong; also in all cases, because of blameless cultures, we buckled down, learned a lot, and made substantial improvements in the systems for the future. Looking back with the perspective of 20-20 hindsight, all of these incidents were seminal events that changed the focus and trajectory of engineering at each organization. You will leave with a set of actionable suggestions in dealing with customers, engineering teams, and upper management. You will also enjoy a few war stories from the trenches.

Views

Total views

58

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

1

Shares

0

Comments

0

Likes

0

×