One Terrible Day at Google,
and How It Made Us Better
Randy Shoup
@randyshoup
linkedin.com/in/randyshoup
Background
@randyshoup
App Engine Outage - Oct 2012
http://googleappengine.blogspot.com/2012/10/about-todays-app-engine-outage.html
App Engine Outage - Oct 2012
App Engine Reliability Fixit
• Step 1: Identify the Problem
o All team leads and senior engineers met in a room with a whiteboard
o Enumerated all known and suspected reliability issues
o Too much technical debt had accumulated
o Reliability issues had not been prioritized
o Identify 8-10 themes
@randyshoup
• Step 2: Understand the Problem
o Each theme assigned to a senior engineer to investigate
o Timeboxed for 1 week
o After 1 week, all leads came back with
• Detailed list of issues
• Recommended steps to address them
• Estimated order-of-magnitude of effort (1 day, 1 week, 1 month, etc.)
App Engine Reliability Fixit
@randyshoup
• Step 3: Consensus and Prioritization
o Leads discussed themes and prioritized work
o Assigned engineers to tasks
App Engine Reliability Fixit
@randyshoup
• Step 4: Implementation and Follow-up
o Engineers worked on assigned tasks
o Simple spreadsheet of task status, which engineers updated weekly
o Minimal effort from management (~1 hour / week) to summarize progress at
weekly team meeting
App Engine Reliability Fixit
@randyshoup
•  Results
o 10x reduction in reliability issues
o Improved team cohesion and camaraderie
o Broader participation and ownership of the future health of the platform
o Still remembered several years later
App Engine Reliability Fixit
@randyshoup
Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement
Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement
“Do you have time to do it
twice?”
“We don’t have time to do it
right!”
@randyshoup
The more constrained you are
on time or resources, the more
important it is to get it done
the first time.
@randyshoup
Negotiating Tradeoffs
Scope
Time
Quality
@randyshoup
Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement
During the Incident
• Focus on restoring service
o Everything else is secondary, and should wait
• Shield the team
• Clear, structured communication
o Even when there is nothing to report!
@randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
After the Incident
• Blameless postmortem
• Identify and understand the
contributing factors
• Action items and Learnings
• Follow Up!
@randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
“Finally we can prioritize
fixing that broken system!”
@randyshoup
Psychological Safety
• Team is safe for interpersonal
risk-taking
• “Being able to show and employ
one’s self without fear of
negative consequences”
• More important than any other
factor in team success
Inclusive Decisionmaking
• Make better business decisions
87% of the time
• Make decisions 2x faster with
1/2 the meetings
• Deliver 60% better business
results
Cloverpop Inclusive Decisionmaking study, 2016
As we improve diversity, decisionmaking improves
@randyshoup
15 Million
“Never let a
good crisis go
to waste.”
@randyshoup
“Incidents are unplanned
investments, and they are also
opportunities. Your challenge
is to maximize the ROI on the
sunk cost.”
@randyshoup
-- John Allspaw, Adaptive Capacity Labs
Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement
Frame the Problem:
Quality and reliability are
business concerns
@randyshoup
Use Common Currency
Time
Money People
@randyshoup
Improvement Budget
• Explicit resource investment
o Agree on an up-front investment
(e.g., 25%, 30% of engineering efforts)
• Retain autonomy, Provide transparency
o Making these decisions is exactly why they hired you
@randyshoup
Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement
Common Elements
• Unintentional, long-term accumulation of small, individually
reasonable decisions
• “Compelling event” catalyzes long-term change
• Blameless culture makes learning and improvement possible
• Structured post-incident approach
@randyshoup
Incident Response Patterns
• Incident Roles
• Incident Triggers
• On-Call Rotation and Onboarding
• Incident Command Training
• Incident Communication Plan
• Periodic Incident Updates
• Shared Incident State Doc
• Incident Call Recording
• Incident Swarming
• Local / Global Incident Reviews
• Post-Review Improvement Items
@randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
Thank you!
@randyshoup
linkedin.com/in/randyshoup
medium.com/@randyshoup

One Terrible Day at Google, and How It Made Us Better