Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

Learning from Learnings: Anatomy of Three Incidents Slide 1 Learning from Learnings: Anatomy of Three Incidents Slide 2 Learning from Learnings: Anatomy of Three Incidents Slide 3 Learning from Learnings: Anatomy of Three Incidents Slide 4 Learning from Learnings: Anatomy of Three Incidents Slide 5 Learning from Learnings: Anatomy of Three Incidents Slide 6 Learning from Learnings: Anatomy of Three Incidents Slide 7 Learning from Learnings: Anatomy of Three Incidents Slide 8 Learning from Learnings: Anatomy of Three Incidents Slide 9 Learning from Learnings: Anatomy of Three Incidents Slide 10 Learning from Learnings: Anatomy of Three Incidents Slide 11 Learning from Learnings: Anatomy of Three Incidents Slide 12 Learning from Learnings: Anatomy of Three Incidents Slide 13 Learning from Learnings: Anatomy of Three Incidents Slide 14 Learning from Learnings: Anatomy of Three Incidents Slide 15 Learning from Learnings: Anatomy of Three Incidents Slide 16 Learning from Learnings: Anatomy of Three Incidents Slide 17 Learning from Learnings: Anatomy of Three Incidents Slide 18 Learning from Learnings: Anatomy of Three Incidents Slide 19 Learning from Learnings: Anatomy of Three Incidents Slide 20 Learning from Learnings: Anatomy of Three Incidents Slide 21 Learning from Learnings: Anatomy of Three Incidents Slide 22 Learning from Learnings: Anatomy of Three Incidents Slide 23 Learning from Learnings: Anatomy of Three Incidents Slide 24 Learning from Learnings: Anatomy of Three Incidents Slide 25 Learning from Learnings: Anatomy of Three Incidents Slide 26 Learning from Learnings: Anatomy of Three Incidents Slide 27 Learning from Learnings: Anatomy of Three Incidents Slide 28 Learning from Learnings: Anatomy of Three Incidents Slide 29 Learning from Learnings: Anatomy of Three Incidents Slide 30 Learning from Learnings: Anatomy of Three Incidents Slide 31 Learning from Learnings: Anatomy of Three Incidents Slide 32 Learning from Learnings: Anatomy of Three Incidents Slide 33 Learning from Learnings: Anatomy of Three Incidents Slide 34 Learning from Learnings: Anatomy of Three Incidents Slide 35 Learning from Learnings: Anatomy of Three Incidents Slide 36 Learning from Learnings: Anatomy of Three Incidents Slide 37 Learning from Learnings: Anatomy of Three Incidents Slide 38 Learning from Learnings: Anatomy of Three Incidents Slide 39 Learning from Learnings: Anatomy of Three Incidents Slide 40 Learning from Learnings: Anatomy of Three Incidents Slide 41
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

1 Like

Share

Download to read offline

Learning from Learnings: Anatomy of Three Incidents

Download to read offline

The best response to a system outage is not "What did you do?", but "What did we learn?" This session will walk through three system-wide outages at Google, at Stitch Fix, and at WeWork—their incidents, aftermaths, and recoveries. In all cases, many things went right and a few went wrong; also in all cases, because of blameless cultures, we buckled down, learned a lot, and made substantial improvements in the systems for the future. Looking back with the perspective of 20-20 hindsight, all of these incidents were seminal events that changed the focus and trajectory of engineering at each organization. You will leave with a set of actionable suggestions in dealing with customers, engineering teams, and upper management. You will also enjoy a few war stories from the trenches.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Learning from Learnings: Anatomy of Three Incidents

  1. 1. Learning from Learnings: Anatomy of Three Incidents Randy Shoup @randyshoup linkedin.com/in/randyshoup
  2. 2. Background @randyshoup
  3. 3. App Engine Outage Oct 2012
  4. 4. http://googleappengine.blogspot.com/2012/10/about-todays-app-engine-outage.html App Engine Outage Oct 2012
  5. 5. App Engine Reliability Fixit • Step 1: Identify the Problem o All team leads and senior engineers met in a room with a whiteboard o Enumerated all known and suspected reliability issues o Too much technical debt had accumulated o Reliability issues had not been prioritized o Identify 8-10 themes
  6. 6. • Step 2: Understand the Problem o Each theme assigned to a senior engineer to investigate o Timeboxed for 1 week o After 1 week, all leads came back with • Detailed list of issues • Recommended steps to address them • Estimated order-of-magnitude of effort (1 day, 1 week, 1 month, etc.) App Engine Reliability Fixit
  7. 7. • Step 3: Consensus and Prioritization o Leads discussed themes and prioritized work o Assigned engineers to tasks App Engine Reliability Fixit
  8. 8. • Step 4: Implementation and Follow-up o Engineers worked on assigned tasks o Simple spreadsheet of task status, which engineers updated weekly o Minimal effort from management (~1 hour / week) to summarize progress at weekly team meeting App Engine Reliability Fixit
  9. 9. •  Results o 10x reduction in reliability issues o Improved team cohesion and camaraderie o Broader participation and ownership of the future health of the platform o Still remembered several years later App Engine Reliability Fixit
  10. 10. Stitch Fix Outages Oct / Nov 2016 • (11/08/2016) Spectre unavailable for ~3 minutes [Shared Database] • (11/05/2016) Spectre unavailable for ~5 minutes [Shared Database] • (10/25/2016) All systems unavailable for ~5 minutes [Shared Database] • (10/24/2016) All systems unavailable for ~5 minutes [Shared Database] • (10/21/2016) All systems unavailable for ~3 ½ hours [DDOS attack] • (10/18/2016) All systems unavailable for ~3 minutes [Shared Database] • (10/17/2016) All systems unavailable for ~20 minutes [Shared Database] • (10/13/2016) Minx escalation broken for ~2 hours [Zendesk outage] • (10/11/2016) Label printing unavailable for ~10 minutes [FedEx outage] • (10/10/2016) Label printing unavailable for ~15 minutes [FedEx outage] • (10/10/2016) All systems unavailable for ~10 minutes [Shared Database]
  11. 11. Database Stability Problems • 1. Applications contended on common tables • 2. Scalability limited by database connections • 3. One application could take down entire company
  12. 12. Stitch Fix Stability Retrospective • Step 1: Identify the Problem • Step 2: Understand the Problem • Step 3: Consensus and Prioritization • Step 4: Implementation and Follow-Up •  Results @randyshoup
  13. 13. Stability Solutions • 1. Focus on expensive queries o Log o Eliminate o Rewrite o Reduce • 2. Manage database connections via connection concentrator • 3. Stability and Scalability Program o Ongoing 25% investment in services migration
  14. 14. @randyshoup
  15. 15. WeWork Login Issues • Problem: Some members unable to log in • Inconsistent representations across different services in the system • Over time, simple system interactions grew increasingly complex and convoluted • Not enough graceful degradation or automated repair @randyshoup
  16. 16. WeWork Login Retrospective • Step 1: Identify the Problem • Step 2: Understand the Problem • Step 3: Consensus and Prioritization • Step 4: Implementation and Follow-Up @randyshoup
  17. 17. Common Elements • Unintentional, long-term accumulation of small, individually reasonable decisions • “Compelling event” catalyzes long-term change • Blameless culture makes learning and improvement possible • Structured post-incident approach @randyshoup
  18. 18. If you don’t end up regretting your early technology decisions, you probably over- engineered.
  19. 19. Lessons •Leadership and Collaboration •Quality and Discipline •Driving Change
  20. 20. Lessons •Leadership and Collaboration •Quality and Discipline •Driving Change
  21. 21. Blameless Post-Mortems • Open and Honest Discussion o Detect o Diagnose o Mitigate o Remediate o Prevent Engineers compete to take personal responsibility (!) @randyshoup linkedin.com/in/randyshoup
  22. 22. “Finally we can prioritize fixing that broken system!”
  23. 23. Cross-Functional Collaboration • Best decisions made through partnership • Shared context o Tradeoffs and implications o Given common context, well-meaning people generally agree • “Disagree and Commit” @randyshoup
  24. 24. 15 Million “Never let a good crisis go to waste.”
  25. 25. Lessons •Leadership and Collaboration •Quality and Discipline •Driving Change
  26. 26. Quality and reliability are business concerns @randyshoup
  27. 27. “Do you have time to do it twice?” “We don’t have time to do it right!” @randyshoup
  28. 28. The more constrained you are on time or resources, the more important it is to build it right the first time. @randyshoup
  29. 29. “Do not try to do everything. Do one thing well.” @randyshoup
  30. 30. Definition of Done • Feature complete • Automated Tests • Released to Production • Feature gate turned on • Monitored
  31. 31. Vicious Cycle of Technical Debt Technical Debt “No time to do it right” Quick- and-dirty @randyshoup
  32. 32. Virtuous Cycle of Investment Solid Foundation Confidence Faster and Better Quality Investment @randyshoup
  33. 33. Lessons •Leadership and Collaboration •Quality and Discipline •Driving Change
  34. 34. Top-down and Bottom-up
  35. 35. Funding the Improvement Program • Approach 1: Standard project o Prioritize and fund like any other project o Works best when project is within one team • Approach 2: Explicit global investment o Agree on an up-front investment (e.g., 25%, 30% of engineering efforts) o Works best when program is across many teams
  36. 36. Finance An unexpected ally … @randyshoup
  37. 37. Lessons •Leadership and Collaboration •Quality and Discipline •Driving Change
  38. 38. “If you can’t change your organization, change your organization.” -- Martin Fowler @randyshoup
  39. 39. We are Hiring! 700 software engineers globally, in • New York • Tel Aviv • San Francisco • Seattle • Shanghai • Singapore @randyshoup
  • whilpert

    Jun. 9, 2020

The best response to a system outage is not "What did you do?", but "What did we learn?" This session will walk through three system-wide outages at Google, at Stitch Fix, and at WeWork—their incidents, aftermaths, and recoveries. In all cases, many things went right and a few went wrong; also in all cases, because of blameless cultures, we buckled down, learned a lot, and made substantial improvements in the systems for the future. Looking back with the perspective of 20-20 hindsight, all of these incidents were seminal events that changed the focus and trajectory of engineering at each organization. You will leave with a set of actionable suggestions in dealing with customers, engineering teams, and upper management. You will also enjoy a few war stories from the trenches.

Views

Total views

643

On Slideshare

0

From embeds

0

Number of embeds

16

Actions

Downloads

5

Shares

0

Comments

0

Likes

1

×