SlideShare a Scribd company logo
Anatomy of
Three Incidents
Randy Shoup
@randyshoup
linkedin.com/in/randyshoup
Background
@randyshoup
@randyshoup
App Engine Outage - Oct 2012
http://googleappengine.blogspot.com/2012/10/about-todays-app-engine-outage.html
App Engine Outage - Oct 2012
App Engine Reliability Fixit
• Step 1: Identify the Problem
o All team leads and senior engineers met in a room with a whiteboard
o Enumerated all known and suspected reliability issues
o Too much technical debt had accumulated
o Reliability issues had not been prioritized
o Identify 8-10 themes
@randyshoup
• Step 2: Understand the Problem
o Each theme assigned to a senior engineer to investigate
o Timeboxed for 1 week
o After 1 week, all leads came back with
• Detailed list of issues
• Recommended steps to address them
• Estimated order-of-magnitude of effort (1 day, 1 week, 1 month, etc.)
App Engine Reliability Fixit
@randyshoup
• Step 3: Consensus and Prioritization
o Leads discussed themes and prioritized work
o Assigned engineers to tasks
App Engine Reliability Fixit
@randyshoup
• Step 4: Implementation and Follow-up
o Engineers worked on assigned tasks
o Simple spreadsheet of task status, which engineers updated weekly
o Minimal effort from management (~1 hour / week) to summarize progress at
weekly team meeting
App Engine Reliability Fixit
@randyshoup
•  Results
o 10x reduction in reliability issues
o Improved team cohesion and camaraderie
o Broader participation and ownership of the future health of the platform
o Still remembered several years later
App Engine Reliability Fixit
@randyshoup
@randyshoup
Stitch Fix – Oct / Nov 2016
• (11/08/2016) Spectre unavailable for ~3 minutes [Shared Database]
• (11/05/2016) Spectre unavailable for ~5 minutes [Shared Database]
• (10/25/2016) All systems unavailable for ~5 minutes [Shared Database]
• (10/24/2016) All systems unavailable for ~5 minutes [Shared Database]
• (10/21/2016) All systems unavailable for ~3 ½ hours [DDOS attack]
• (10/18/2016) All systems unavailable for ~3 minutes [Shared Database]
• (10/17/2016) All systems unavailable for ~20 minutes [Shared Database]
• (10/13/2016) Minx escalation broken for ~2 hours [Zendesk outage]
• (10/11/2016) Label printing unavailable for ~10 minutes [FedEx outage]
• (10/10/2016) Label printing unavailable for ~15 minutes [FedEx outage]
• (10/10/2016) All systems unavailable for ~10 minutes [Shared Database]
@randyshoup
Database Stability Problems
• 1. Applications contended on common tables
• 2. Scalability limited by database connections
• 3. One application could take down entire company
@randyshoup
Stability Retrospective
• Step 1: Identify the Problem
• Step 2: Understand the Problem
• Step 3: Consensus and Prioritization
• Step 4: Implementation and Follow-Up
•  Results
@randyshoup
Stability Solutions
• 1. Focus on expensive queries
o Log
o Eliminate
o Rewrite
o Reduce
• 2. Manage database connections via connection concentrator
• 3. Stability and Scalability Program
o Ongoing 25% investment in services migration
@randyshoup
@randyshoup
Login Issues - 2019
• Problem: Some members unable to log in
• Inconsistent representations across different services in the
system
• Over time, simple system interactions grew increasingly
complex and convoluted
• Not enough graceful degradation or automated repair
@randyshoup
Login Retrospective
• Step 1: Identify the Problem
• Step 2: Understand the Problem
• Step 3: Consensus and Prioritization
• Step 4: Implementation and Follow-Up
@randyshoup
Login Solutions
• 1. Clean up user data
o Find inconsistencies
o Track inconsistency metrics
o Identify and fix contributing processes and applications
• 2. User state machines
o Define user journeys as explicit state machines
o Refine and correct via cross-functional feedback
o Implement state machines in code
• 3. “Pandora” Program
o Rewrite core identity system into set of user capabilities
@randyshoup
Common Elements
• Unintentional, long-term accumulation of
small, individually reasonable decisions
• “Compelling event” catalyzes long-term
change
• Blameless culture makes learning and
improvement possible
• Structured post-incident approach
@randyshoup
Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement
Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement
Vicious Cycle of Technical Debt
Technical
Debt
“No time
to do it
right”
Quick-
and-dirty
“Do you have time to do it
twice?”
“We don’t have time to do it
right!”
@randyshoup
The more constrained you are
on time or resources, the more
important it is to get it done
the first time.
@randyshoup
Negotiating Tradeoffs
Scope
Time
Quality
@randyshoup
Virtuous Cycle of Investment
Solid
Foundation
Confidence
Faster and
Better
Quality
Investment
Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement
During the Incident
• Focus on restoring service
o Everything else is secondary, and should wait
• Shield the team
• Clear, structured communication
o Even when there is nothing to report!
@randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
After the Incident
• Blameless postmortem
• Identify and understand the
contributing factors
• Action items and Learnings
• Follow Up!
@randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
Psychological Safety
• Team is safe for interpersonal
risk-taking
• “Being able to show and employ
one’s self without fear of
negative consequences”
• More important than any other
factor in team success
“Finally we can prioritize
fixing that broken system!”
@randyshoup
Inclusive Decisionmaking
• Make better business decisions
87% of the time
• Make decisions 2x faster with
1/2 the meetings
• Deliver 60% better business
results
Cloverpop Inclusive Decisionmaking study, 2016
As we improve diversity, decisionmaking improves
@randyshoup
Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement
Frame the Problem:
Quality and reliability are
business concerns
@randyshoup
Use Common Currency
Time
Money People
@randyshoup
15 Million
“Never let a
good crisis go
to waste.”
@randyshoup
“Incidents are unplanned
investments, and they are also
opportunities. Your challenge
is to maximize the ROI on the
sunk cost.”
@randyshoup
-- John Allspaw, Adaptive Capacity Labs
Improvement Budget
• Explicit resource investment
o Agree on an up-front investment
(e.g., 25%, 30% of engineering efforts)
• Retain autonomy, Provide transparency
o Making these decisions is exactly why they hired you
@randyshoup
Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement
Incident Response Patterns
• Incident Roles
• Incident Triggers
• On-Call Rotation and Onboarding
• Incident Command Training
• Incident Communication Plan
• Periodic Incident Updates
• Shared Incident State Doc
• Incident Call Recording
• Incident Swarming
• Local / Global Incident Reviews
• Post-Review Improvement Items
@randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
Thank you!
@randyshoup
linkedin.com/in/randyshoup
medium.com/@randyshoup

More Related Content

What's hot

DOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps Transition
DOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps TransitionDOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps Transition
DOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps Transition
Gene Kim
 
Evolving Architecture and Organization - Lessons from Google and eBay
Evolving Architecture and Organization - Lessons from Google and eBayEvolving Architecture and Organization - Lessons from Google and eBay
Evolving Architecture and Organization - Lessons from Google and eBay
Randy Shoup
 
A CTO's Guide to Scaling Organizations
A CTO's Guide to Scaling OrganizationsA CTO's Guide to Scaling Organizations
A CTO's Guide to Scaling Organizations
Randy Shoup
 
Service Architectures at Scale
Service Architectures at ScaleService Architectures at Scale
Service Architectures at Scale
Randy Shoup
 
Service Architectures At Scale - QCon London 2015
Service Architectures At Scale - QCon London 2015Service Architectures At Scale - QCon London 2015
Service Architectures At Scale - QCon London 2015
Randy Shoup
 
An Agile Approach to Machine Learning
An Agile Approach to Machine LearningAn Agile Approach to Machine Learning
An Agile Approach to Machine Learning
Randy Shoup
 
Monoliths, Migrations, and Microservices
Monoliths, Migrations, and MicroservicesMonoliths, Migrations, and Microservices
Monoliths, Migrations, and Microservices
Randy Shoup
 
Minimum Viable Architecture -- Good Enough is Good Enough in a Startup
Minimum Viable Architecture -- Good Enough is Good Enough in a StartupMinimum Viable Architecture -- Good Enough is Good Enough in a Startup
Minimum Viable Architecture -- Good Enough is Good Enough in a Startup
Randy Shoup
 
Managing Data at Scale - Microservices and Events
Managing Data at Scale - Microservices and EventsManaging Data at Scale - Microservices and Events
Managing Data at Scale - Microservices and Events
Randy Shoup
 
DevOpsDays Silicon Valley 2014 - The Game of Operations
DevOpsDays Silicon Valley 2014 - The Game of OperationsDevOpsDays Silicon Valley 2014 - The Game of Operations
DevOpsDays Silicon Valley 2014 - The Game of Operations
Randy Shoup
 
The agile elephant in the room
The agile elephant in the roomThe agile elephant in the room
The agile elephant in the room
AgileDenver
 
Scaling Your Architecture with Services and Events
Scaling Your Architecture with Services and EventsScaling Your Architecture with Services and Events
Scaling Your Architecture with Services and Events
Randy Shoup
 
Why Enterprises Are Embracing the Cloud
Why Enterprises Are Embracing the CloudWhy Enterprises Are Embracing the Cloud
Why Enterprises Are Embracing the Cloud
Randy Shoup
 
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...
DevOpsDays Tel Aviv
 
Enterprise DevOps fact or fiction - DevOps Summit 2014
Enterprise DevOps fact or fiction - DevOps Summit 2014Enterprise DevOps fact or fiction - DevOps Summit 2014
Enterprise DevOps fact or fiction - DevOps Summit 2014
Chris Riley ☁
 
Lean Canvas for Internal Product Owners
Lean Canvas for Internal Product OwnersLean Canvas for Internal Product Owners
Lean Canvas for Internal Product Owners
Keith Klundt
 
Continuous Delivery in a Legacy Shop—One Step at a Time
Continuous Delivery in a Legacy Shop—One Step at a TimeContinuous Delivery in a Legacy Shop—One Step at a Time
Continuous Delivery in a Legacy Shop—One Step at a Time
TechWell
 
IAM - One Year Later
IAM - One Year LaterIAM - One Year Later
IAM - One Year Later
Dave Shields
 
Lazar Milovic - No estimates
Lazar Milovic - No estimatesLazar Milovic - No estimates
Lazar Milovic - No estimates
Serbian Product Community
 
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
Business of Software Conference
 

What's hot (20)

DOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps Transition
DOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps TransitionDOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps Transition
DOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps Transition
 
Evolving Architecture and Organization - Lessons from Google and eBay
Evolving Architecture and Organization - Lessons from Google and eBayEvolving Architecture and Organization - Lessons from Google and eBay
Evolving Architecture and Organization - Lessons from Google and eBay
 
A CTO's Guide to Scaling Organizations
A CTO's Guide to Scaling OrganizationsA CTO's Guide to Scaling Organizations
A CTO's Guide to Scaling Organizations
 
Service Architectures at Scale
Service Architectures at ScaleService Architectures at Scale
Service Architectures at Scale
 
Service Architectures At Scale - QCon London 2015
Service Architectures At Scale - QCon London 2015Service Architectures At Scale - QCon London 2015
Service Architectures At Scale - QCon London 2015
 
An Agile Approach to Machine Learning
An Agile Approach to Machine LearningAn Agile Approach to Machine Learning
An Agile Approach to Machine Learning
 
Monoliths, Migrations, and Microservices
Monoliths, Migrations, and MicroservicesMonoliths, Migrations, and Microservices
Monoliths, Migrations, and Microservices
 
Minimum Viable Architecture -- Good Enough is Good Enough in a Startup
Minimum Viable Architecture -- Good Enough is Good Enough in a StartupMinimum Viable Architecture -- Good Enough is Good Enough in a Startup
Minimum Viable Architecture -- Good Enough is Good Enough in a Startup
 
Managing Data at Scale - Microservices and Events
Managing Data at Scale - Microservices and EventsManaging Data at Scale - Microservices and Events
Managing Data at Scale - Microservices and Events
 
DevOpsDays Silicon Valley 2014 - The Game of Operations
DevOpsDays Silicon Valley 2014 - The Game of OperationsDevOpsDays Silicon Valley 2014 - The Game of Operations
DevOpsDays Silicon Valley 2014 - The Game of Operations
 
The agile elephant in the room
The agile elephant in the roomThe agile elephant in the room
The agile elephant in the room
 
Scaling Your Architecture with Services and Events
Scaling Your Architecture with Services and EventsScaling Your Architecture with Services and Events
Scaling Your Architecture with Services and Events
 
Why Enterprises Are Embracing the Cloud
Why Enterprises Are Embracing the CloudWhy Enterprises Are Embracing the Cloud
Why Enterprises Are Embracing the Cloud
 
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...
 
Enterprise DevOps fact or fiction - DevOps Summit 2014
Enterprise DevOps fact or fiction - DevOps Summit 2014Enterprise DevOps fact or fiction - DevOps Summit 2014
Enterprise DevOps fact or fiction - DevOps Summit 2014
 
Lean Canvas for Internal Product Owners
Lean Canvas for Internal Product OwnersLean Canvas for Internal Product Owners
Lean Canvas for Internal Product Owners
 
Continuous Delivery in a Legacy Shop—One Step at a Time
Continuous Delivery in a Legacy Shop—One Step at a TimeContinuous Delivery in a Legacy Shop—One Step at a Time
Continuous Delivery in a Legacy Shop—One Step at a Time
 
IAM - One Year Later
IAM - One Year LaterIAM - One Year Later
IAM - One Year Later
 
Lazar Milovic - No estimates
Lazar Milovic - No estimatesLazar Milovic - No estimates
Lazar Milovic - No estimates
 
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
 

Similar to Anatomy of Three Incidents -- Commonalities and Lessons

Visualisation&agile practices ai2014
Visualisation&agile practices ai2014Visualisation&agile practices ai2014
Visualisation&agile practices ai2014
Balaji Muniraja
 
Effective Microservices In a Data-centric World
Effective Microservices In a Data-centric WorldEffective Microservices In a Data-centric World
Effective Microservices In a Data-centric World
Randy Shoup
 
Software Release Orchestration and the Enterprise
Software Release Orchestration and the EnterpriseSoftware Release Orchestration and the Enterprise
Software Release Orchestration and the Enterprise
XebiaLabs
 
Performing an R12 Upgrade in a Highly Customized Environment with a Worldwide...
Performing an R12 Upgrade in a Highly Customized Environment with a Worldwide...Performing an R12 Upgrade in a Highly Customized Environment with a Worldwide...
Performing an R12 Upgrade in a Highly Customized Environment with a Worldwide...
AXIA Consulting Inc.
 
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Formulatedby
 
How to overcome challenges in it system evolution
How to overcome challenges in it system evolutionHow to overcome challenges in it system evolution
How to overcome challenges in it system evolution
Grupa Unity
 
Standard Bank: How APM Supports DevOps, Agile and Engineering Transformation ...
Standard Bank: How APM Supports DevOps, Agile and Engineering Transformation ...Standard Bank: How APM Supports DevOps, Agile and Engineering Transformation ...
Standard Bank: How APM Supports DevOps, Agile and Engineering Transformation ...
AppDynamics
 
Hanno Jarvet - VSM, Planning and Problem Solving - ConFu
Hanno Jarvet - VSM, Planning and Problem Solving - ConFuHanno Jarvet - VSM, Planning and Problem Solving - ConFu
Hanno Jarvet - VSM, Planning and Problem Solving - ConFu
DevConFu
 
How Microsoft ALM Tools Can Improve Your Bottom Line
How Microsoft ALM Tools Can Improve Your Bottom LineHow Microsoft ALM Tools Can Improve Your Bottom Line
How Microsoft ALM Tools Can Improve Your Bottom Line
Imaginet
 
Doing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentDoing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics Environment
Tasktop
 
Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...
Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...
Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...
AppDynamics
 
A Journey Through Agile in the Government
A Journey Through Agile in the GovernmentA Journey Through Agile in the Government
A Journey Through Agile in the Government
Richard Cheng
 
Turning Human Capital into High Performance Organizational Capital
Turning Human Capital into High Performance Organizational CapitalTurning Human Capital into High Performance Organizational Capital
Turning Human Capital into High Performance Organizational Capital
John Willis
 
Ahmed Jassat Oracle Customer Day Presentation at Monte Casino
Ahmed Jassat Oracle Customer Day Presentation at Monte CasinoAhmed Jassat Oracle Customer Day Presentation at Monte Casino
Ahmed Jassat Oracle Customer Day Presentation at Monte Casino
Zahid02
 
Agile Bureaucracy
Agile BureaucracyAgile Bureaucracy
Agile Bureaucracy
Return on Intelligence
 
The Web is Not a Project
The Web is Not a ProjectThe Web is Not a Project
The Web is Not a Project
Mark Greenfield
 
Large Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Large Scale Architecture -- The Unreasonable Effectiveness of SimplicityLarge Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Large Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Randy Shoup
 

Similar to Anatomy of Three Incidents -- Commonalities and Lessons (20)

Effective Scrum
Effective ScrumEffective Scrum
Effective Scrum
 
Visualisation&agile practices ai2014
Visualisation&agile practices ai2014Visualisation&agile practices ai2014
Visualisation&agile practices ai2014
 
Effective Microservices In a Data-centric World
Effective Microservices In a Data-centric WorldEffective Microservices In a Data-centric World
Effective Microservices In a Data-centric World
 
Software Release Orchestration and the Enterprise
Software Release Orchestration and the EnterpriseSoftware Release Orchestration and the Enterprise
Software Release Orchestration and the Enterprise
 
Performing an R12 Upgrade in a Highly Customized Environment with a Worldwide...
Performing an R12 Upgrade in a Highly Customized Environment with a Worldwide...Performing an R12 Upgrade in a Highly Customized Environment with a Worldwide...
Performing an R12 Upgrade in a Highly Customized Environment with a Worldwide...
 
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
 
Utils_Presentation_Richard U
Utils_Presentation_Richard UUtils_Presentation_Richard U
Utils_Presentation_Richard U
 
How to overcome challenges in it system evolution
How to overcome challenges in it system evolutionHow to overcome challenges in it system evolution
How to overcome challenges in it system evolution
 
Standard Bank: How APM Supports DevOps, Agile and Engineering Transformation ...
Standard Bank: How APM Supports DevOps, Agile and Engineering Transformation ...Standard Bank: How APM Supports DevOps, Agile and Engineering Transformation ...
Standard Bank: How APM Supports DevOps, Agile and Engineering Transformation ...
 
Hanno Jarvet - VSM, Planning and Problem Solving - ConFu
Hanno Jarvet - VSM, Planning and Problem Solving - ConFuHanno Jarvet - VSM, Planning and Problem Solving - ConFu
Hanno Jarvet - VSM, Planning and Problem Solving - ConFu
 
How Microsoft ALM Tools Can Improve Your Bottom Line
How Microsoft ALM Tools Can Improve Your Bottom LineHow Microsoft ALM Tools Can Improve Your Bottom Line
How Microsoft ALM Tools Can Improve Your Bottom Line
 
Doing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentDoing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics Environment
 
Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...
Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...
Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...
 
A Journey Through Agile in the Government
A Journey Through Agile in the GovernmentA Journey Through Agile in the Government
A Journey Through Agile in the Government
 
Turning Human Capital into High Performance Organizational Capital
Turning Human Capital into High Performance Organizational CapitalTurning Human Capital into High Performance Organizational Capital
Turning Human Capital into High Performance Organizational Capital
 
Ahmed Jassat Oracle Customer Day Presentation at Monte Casino
Ahmed Jassat Oracle Customer Day Presentation at Monte CasinoAhmed Jassat Oracle Customer Day Presentation at Monte Casino
Ahmed Jassat Oracle Customer Day Presentation at Monte Casino
 
Gaurav_CV
Gaurav_CVGaurav_CV
Gaurav_CV
 
Agile Bureaucracy
Agile BureaucracyAgile Bureaucracy
Agile Bureaucracy
 
The Web is Not a Project
The Web is Not a ProjectThe Web is Not a Project
The Web is Not a Project
 
Large Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Large Scale Architecture -- The Unreasonable Effectiveness of SimplicityLarge Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Large Scale Architecture -- The Unreasonable Effectiveness of Simplicity
 

More from Randy Shoup

Breaking Codes, Designing Jets, and Building Teams
Breaking Codes, Designing Jets, and Building TeamsBreaking Codes, Designing Jets, and Building Teams
Breaking Codes, Designing Jets, and Building Teams
Randy Shoup
 
Ten Lessons of the DevOps Transition
Ten Lessons of the DevOps TransitionTen Lessons of the DevOps Transition
Ten Lessons of the DevOps Transition
Randy Shoup
 
Managing Data in Microservices
Managing Data in MicroservicesManaging Data in Microservices
Managing Data in Microservices
Randy Shoup
 
Pragmatic Microservices
Pragmatic MicroservicesPragmatic Microservices
Pragmatic Microservices
Randy Shoup
 
From the Monolith to Microservices - CraftConf 2015
From the Monolith to Microservices - CraftConf 2015From the Monolith to Microservices - CraftConf 2015
From the Monolith to Microservices - CraftConf 2015
Randy Shoup
 
Concurrency at Scale: Evolution to Micro-Services
Concurrency at Scale:  Evolution to Micro-ServicesConcurrency at Scale:  Evolution to Micro-Services
Concurrency at Scale: Evolution to Micro-Services
Randy Shoup
 
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYE
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYEQCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYE
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYE
Randy Shoup
 
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...
Randy Shoup
 
The Importance of Culture: Building and Sustaining Effective Engineering Org...
The Importance of Culture:  Building and Sustaining Effective Engineering Org...The Importance of Culture:  Building and Sustaining Effective Engineering Org...
The Importance of Culture: Building and Sustaining Effective Engineering Org...
Randy Shoup
 

More from Randy Shoup (9)

Breaking Codes, Designing Jets, and Building Teams
Breaking Codes, Designing Jets, and Building TeamsBreaking Codes, Designing Jets, and Building Teams
Breaking Codes, Designing Jets, and Building Teams
 
Ten Lessons of the DevOps Transition
Ten Lessons of the DevOps TransitionTen Lessons of the DevOps Transition
Ten Lessons of the DevOps Transition
 
Managing Data in Microservices
Managing Data in MicroservicesManaging Data in Microservices
Managing Data in Microservices
 
Pragmatic Microservices
Pragmatic MicroservicesPragmatic Microservices
Pragmatic Microservices
 
From the Monolith to Microservices - CraftConf 2015
From the Monolith to Microservices - CraftConf 2015From the Monolith to Microservices - CraftConf 2015
From the Monolith to Microservices - CraftConf 2015
 
Concurrency at Scale: Evolution to Micro-Services
Concurrency at Scale:  Evolution to Micro-ServicesConcurrency at Scale:  Evolution to Micro-Services
Concurrency at Scale: Evolution to Micro-Services
 
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYE
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYEQCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYE
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYE
 
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...
 
The Importance of Culture: Building and Sustaining Effective Engineering Org...
The Importance of Culture:  Building and Sustaining Effective Engineering Org...The Importance of Culture:  Building and Sustaining Effective Engineering Org...
The Importance of Culture: Building and Sustaining Effective Engineering Org...
 

Recently uploaded

top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
varshanayak241
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2
 
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Hivelance Technology
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
Jelle | Nordend
 
Visitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.appVisitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.app
NaapbooksPrivateLimi
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
XfilesPro
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 

Recently uploaded (20)

top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
 
Visitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.appVisitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.app
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 

Anatomy of Three Incidents -- Commonalities and Lessons

  • 1. Anatomy of Three Incidents Randy Shoup @randyshoup linkedin.com/in/randyshoup
  • 4. App Engine Outage - Oct 2012
  • 6. App Engine Reliability Fixit • Step 1: Identify the Problem o All team leads and senior engineers met in a room with a whiteboard o Enumerated all known and suspected reliability issues o Too much technical debt had accumulated o Reliability issues had not been prioritized o Identify 8-10 themes @randyshoup
  • 7. • Step 2: Understand the Problem o Each theme assigned to a senior engineer to investigate o Timeboxed for 1 week o After 1 week, all leads came back with • Detailed list of issues • Recommended steps to address them • Estimated order-of-magnitude of effort (1 day, 1 week, 1 month, etc.) App Engine Reliability Fixit @randyshoup
  • 8. • Step 3: Consensus and Prioritization o Leads discussed themes and prioritized work o Assigned engineers to tasks App Engine Reliability Fixit @randyshoup
  • 9. • Step 4: Implementation and Follow-up o Engineers worked on assigned tasks o Simple spreadsheet of task status, which engineers updated weekly o Minimal effort from management (~1 hour / week) to summarize progress at weekly team meeting App Engine Reliability Fixit @randyshoup
  • 10. •  Results o 10x reduction in reliability issues o Improved team cohesion and camaraderie o Broader participation and ownership of the future health of the platform o Still remembered several years later App Engine Reliability Fixit @randyshoup
  • 12. Stitch Fix – Oct / Nov 2016 • (11/08/2016) Spectre unavailable for ~3 minutes [Shared Database] • (11/05/2016) Spectre unavailable for ~5 minutes [Shared Database] • (10/25/2016) All systems unavailable for ~5 minutes [Shared Database] • (10/24/2016) All systems unavailable for ~5 minutes [Shared Database] • (10/21/2016) All systems unavailable for ~3 ½ hours [DDOS attack] • (10/18/2016) All systems unavailable for ~3 minutes [Shared Database] • (10/17/2016) All systems unavailable for ~20 minutes [Shared Database] • (10/13/2016) Minx escalation broken for ~2 hours [Zendesk outage] • (10/11/2016) Label printing unavailable for ~10 minutes [FedEx outage] • (10/10/2016) Label printing unavailable for ~15 minutes [FedEx outage] • (10/10/2016) All systems unavailable for ~10 minutes [Shared Database] @randyshoup
  • 13. Database Stability Problems • 1. Applications contended on common tables • 2. Scalability limited by database connections • 3. One application could take down entire company @randyshoup
  • 14. Stability Retrospective • Step 1: Identify the Problem • Step 2: Understand the Problem • Step 3: Consensus and Prioritization • Step 4: Implementation and Follow-Up •  Results @randyshoup
  • 15. Stability Solutions • 1. Focus on expensive queries o Log o Eliminate o Rewrite o Reduce • 2. Manage database connections via connection concentrator • 3. Stability and Scalability Program o Ongoing 25% investment in services migration @randyshoup
  • 17. Login Issues - 2019 • Problem: Some members unable to log in • Inconsistent representations across different services in the system • Over time, simple system interactions grew increasingly complex and convoluted • Not enough graceful degradation or automated repair @randyshoup
  • 18. Login Retrospective • Step 1: Identify the Problem • Step 2: Understand the Problem • Step 3: Consensus and Prioritization • Step 4: Implementation and Follow-Up @randyshoup
  • 19. Login Solutions • 1. Clean up user data o Find inconsistencies o Track inconsistency metrics o Identify and fix contributing processes and applications • 2. User state machines o Define user journeys as explicit state machines o Refine and correct via cross-functional feedback o Implement state machines in code • 3. “Pandora” Program o Rewrite core identity system into set of user capabilities @randyshoup
  • 20. Common Elements • Unintentional, long-term accumulation of small, individually reasonable decisions • “Compelling event” catalyzes long-term change • Blameless culture makes learning and improvement possible • Structured post-incident approach @randyshoup
  • 23. Vicious Cycle of Technical Debt Technical Debt “No time to do it right” Quick- and-dirty
  • 24. “Do you have time to do it twice?” “We don’t have time to do it right!” @randyshoup
  • 25. The more constrained you are on time or resources, the more important it is to get it done the first time. @randyshoup
  • 27. Virtuous Cycle of Investment Solid Foundation Confidence Faster and Better Quality Investment
  • 29. During the Incident • Focus on restoring service o Everything else is secondary, and should wait • Shield the team • Clear, structured communication o Even when there is nothing to report! @randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
  • 30. After the Incident • Blameless postmortem • Identify and understand the contributing factors • Action items and Learnings • Follow Up! @randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
  • 31. Psychological Safety • Team is safe for interpersonal risk-taking • “Being able to show and employ one’s self without fear of negative consequences” • More important than any other factor in team success
  • 32. “Finally we can prioritize fixing that broken system!” @randyshoup
  • 33. Inclusive Decisionmaking • Make better business decisions 87% of the time • Make decisions 2x faster with 1/2 the meetings • Deliver 60% better business results Cloverpop Inclusive Decisionmaking study, 2016 As we improve diversity, decisionmaking improves @randyshoup
  • 35. Frame the Problem: Quality and reliability are business concerns @randyshoup
  • 36. Use Common Currency Time Money People @randyshoup
  • 37. 15 Million “Never let a good crisis go to waste.” @randyshoup
  • 38. “Incidents are unplanned investments, and they are also opportunities. Your challenge is to maximize the ROI on the sunk cost.” @randyshoup -- John Allspaw, Adaptive Capacity Labs
  • 39. Improvement Budget • Explicit resource investment o Agree on an up-front investment (e.g., 25%, 30% of engineering efforts) • Retain autonomy, Provide transparency o Making these decisions is exactly why they hired you @randyshoup
  • 41. Incident Response Patterns • Incident Roles • Incident Triggers • On-Call Rotation and Onboarding • Incident Command Training • Incident Communication Plan • Periodic Incident Updates • Shared Incident State Doc • Incident Call Recording • Incident Swarming • Local / Global Incident Reviews • Post-Review Improvement Items @randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response