Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

One Terrible Day at Google, and How It Made Us Better Slide 1 One Terrible Day at Google, and How It Made Us Better Slide 2 One Terrible Day at Google, and How It Made Us Better Slide 3 One Terrible Day at Google, and How It Made Us Better Slide 4 One Terrible Day at Google, and How It Made Us Better Slide 5 One Terrible Day at Google, and How It Made Us Better Slide 6 One Terrible Day at Google, and How It Made Us Better Slide 7 One Terrible Day at Google, and How It Made Us Better Slide 8 One Terrible Day at Google, and How It Made Us Better Slide 9 One Terrible Day at Google, and How It Made Us Better Slide 10 One Terrible Day at Google, and How It Made Us Better Slide 11 One Terrible Day at Google, and How It Made Us Better Slide 12 One Terrible Day at Google, and How It Made Us Better Slide 13 One Terrible Day at Google, and How It Made Us Better Slide 14 One Terrible Day at Google, and How It Made Us Better Slide 15 One Terrible Day at Google, and How It Made Us Better Slide 16 One Terrible Day at Google, and How It Made Us Better Slide 17 One Terrible Day at Google, and How It Made Us Better Slide 18 One Terrible Day at Google, and How It Made Us Better Slide 19 One Terrible Day at Google, and How It Made Us Better Slide 20 One Terrible Day at Google, and How It Made Us Better Slide 21 One Terrible Day at Google, and How It Made Us Better Slide 22 One Terrible Day at Google, and How It Made Us Better Slide 23 One Terrible Day at Google, and How It Made Us Better Slide 24 One Terrible Day at Google, and How It Made Us Better Slide 25 One Terrible Day at Google, and How It Made Us Better Slide 26 One Terrible Day at Google, and How It Made Us Better Slide 27 One Terrible Day at Google, and How It Made Us Better Slide 28 One Terrible Day at Google, and How It Made Us Better Slide 29 One Terrible Day at Google, and How It Made Us Better Slide 30
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

One Terrible Day at Google, and How It Made Us Better

Download to read offline

In October 2012, Google App Engine had an 8-hour global outage. This session walks through the incident and the "Reliability Fixit" it inspired in its aftermath. Learn how the team came together, and over the next 6 months, reduced reliability issues by 10x. Also take away broader insights around engineering tradeoffs, managing an incident, and driving improvement.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

One Terrible Day at Google, and How It Made Us Better

  1. 1. One Terrible Day at Google, and How It Made Us Better Randy Shoup @randyshoup linkedin.com/in/randyshoup
  2. 2. Background @randyshoup
  3. 3. App Engine Outage - Oct 2012
  4. 4. http://googleappengine.blogspot.com/2012/10/about-todays-app-engine-outage.html App Engine Outage - Oct 2012
  5. 5. App Engine Reliability Fixit • Step 1: Identify the Problem o All team leads and senior engineers met in a room with a whiteboard o Enumerated all known and suspected reliability issues o Too much technical debt had accumulated o Reliability issues had not been prioritized o Identify 8-10 themes @randyshoup
  6. 6. • Step 2: Understand the Problem o Each theme assigned to a senior engineer to investigate o Timeboxed for 1 week o After 1 week, all leads came back with • Detailed list of issues • Recommended steps to address them • Estimated order-of-magnitude of effort (1 day, 1 week, 1 month, etc.) App Engine Reliability Fixit @randyshoup
  7. 7. • Step 3: Consensus and Prioritization o Leads discussed themes and prioritized work o Assigned engineers to tasks App Engine Reliability Fixit @randyshoup
  8. 8. • Step 4: Implementation and Follow-up o Engineers worked on assigned tasks o Simple spreadsheet of task status, which engineers updated weekly o Minimal effort from management (~1 hour / week) to summarize progress at weekly team meeting App Engine Reliability Fixit @randyshoup
  9. 9. •  Results o 10x reduction in reliability issues o Improved team cohesion and camaraderie o Broader participation and ownership of the future health of the platform o Still remembered several years later App Engine Reliability Fixit @randyshoup
  10. 10. Lessons •Engineering Tradeoffs •Compelling Event •Driving Improvement
  11. 11. Lessons •Engineering Tradeoffs •Compelling Event •Driving Improvement
  12. 12. “Do you have time to do it twice?” “We don’t have time to do it right!” @randyshoup
  13. 13. The more constrained you are on time or resources, the more important it is to get it done the first time. @randyshoup
  14. 14. Negotiating Tradeoffs Scope Time Quality @randyshoup
  15. 15. Lessons •Engineering Tradeoffs •Compelling Event •Driving Improvement
  16. 16. During the Incident • Focus on restoring service o Everything else is secondary, and should wait • Shield the team • Clear, structured communication o Even when there is nothing to report! @randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
  17. 17. After the Incident • Blameless postmortem • Identify and understand the contributing factors • Action items and Learnings • Follow Up! @randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
  18. 18. “Finally we can prioritize fixing that broken system!” @randyshoup
  19. 19. Psychological Safety • Team is safe for interpersonal risk-taking • “Being able to show and employ one’s self without fear of negative consequences” • More important than any other factor in team success
  20. 20. Inclusive Decisionmaking • Make better business decisions 87% of the time • Make decisions 2x faster with 1/2 the meetings • Deliver 60% better business results Cloverpop Inclusive Decisionmaking study, 2016 As we improve diversity, decisionmaking improves @randyshoup
  21. 21. 15 Million “Never let a good crisis go to waste.” @randyshoup
  22. 22. “Incidents are unplanned investments, and they are also opportunities. Your challenge is to maximize the ROI on the sunk cost.” @randyshoup -- John Allspaw, Adaptive Capacity Labs
  23. 23. Lessons •Engineering Tradeoffs •Compelling Event •Driving Improvement
  24. 24. Frame the Problem: Quality and reliability are business concerns @randyshoup
  25. 25. Use Common Currency Time Money People @randyshoup
  26. 26. Improvement Budget • Explicit resource investment o Agree on an up-front investment (e.g., 25%, 30% of engineering efforts) • Retain autonomy, Provide transparency o Making these decisions is exactly why they hired you @randyshoup
  27. 27. Lessons •Engineering Tradeoffs •Compelling Event •Driving Improvement
  28. 28. Common Elements • Unintentional, long-term accumulation of small, individually reasonable decisions • “Compelling event” catalyzes long-term change • Blameless culture makes learning and improvement possible • Structured post-incident approach @randyshoup
  29. 29. Incident Response Patterns • Incident Roles • Incident Triggers • On-Call Rotation and Onboarding • Incident Command Training • Incident Communication Plan • Periodic Incident Updates • Shared Incident State Doc • Incident Call Recording • Incident Swarming • Local / Global Incident Reviews • Post-Review Improvement Items @randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
  30. 30. Thank you! @randyshoup linkedin.com/in/randyshoup medium.com/@randyshoup

In October 2012, Google App Engine had an 8-hour global outage. This session walks through the incident and the "Reliability Fixit" it inspired in its aftermath. Learn how the team came together, and over the next 6 months, reduced reliability issues by 10x. Also take away broader insights around engineering tradeoffs, managing an incident, and driving improvement.

Views

Total views

155

On Slideshare

0

From embeds

0

Number of embeds

10

Actions

Downloads

3

Shares

0

Comments

0

Likes

0

×