Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/29dDOee.
John Allspaw provides a glimpse into how other fields handle incident response, including active steps companies can take to support engineers in those uncertain and ambiguous scenarios. Examples include fields such as military, surgical trauma units, space transportation, aviation and air traffic control, and wildland firefighting. Filmed at qconnewyork.com.
John Allspaw is web operations manager and engineer with 12 years of systems engineering experience.
2. InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
incident-response
3. Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
10. 1.Distributed cognition (Hutchins)
2.Plans and situated actions (Suchman)
3.Directed attention and alert design (Woods)
4.Expertise (Klein, Hoffman)
List of Important Topics
We Don’t Have Time For Today
13. Danish Maritime Accident
Investigation Board Pilot+Air Traffic Control
Instructor
Pilot and Chief
Instructor, Airline
Mining Engineering
US Forestry, Wildland Fire
Child Welfare
Services
Oil & Gas
Construction
18. 16
Tools Do Not Always Make Things Easier To Understand
Never?
19. 17
Anomaly Response in Dynamic Fault Management
• Cascading effects in monitored processes
• Tempo changes and time pressure
• Multiple interleaved tasks
• Multiple interacting goals
• Need to revise assessments as new evidence comes in
Patterns in Joint Cognitive Systems
Woods, Hollnagel (2006)
21. “Hey, look at this graph…”
“This page loads for me…”
“It’s only X that seems broken…”
19
“Didn’t we recently change…?”
“I wonder if it’s because…”
“Is it possible that…?”
Hypotheses
“Is this just for signed-in users?”
“Steve - just to be sure you think…?”
“What do you mean, Lisa?”
Coordination & Clarification
“I think we need to flip that flag.”
“One sec, let me cook up a diff…”
“Should I clear the cache…?”
Suggesting Action
Observations
25. “In dynamic fault management, intervention precedes or is interwoven
with diagnosis.”
Woods, 1994
23
26. “An alert points out an abnormality, having your attention drawn to
an abnormality should make things clearer, therefore alerts should
make things clearer.
But if it’s alerting that much, you just start tuning it out. I have tuned
it out and missed things…."
Asked an engineer about alert noise
during an outage…
24
27. “An alert points out an abnormality, having your attention drawn to
an abnormality should make things clearer, therefore alerts should
make things clearer.
But if it’s alerting that much, you just start tuning it out. I have tuned
it out and missed things…."
Wait, actually…
25
28. “An alarm points out an abnormality, having your attention drawn
to an abnormality should make things safer, therefore alarm should
make things clearer.
But if it’s alarming that much, you just start tuning it out. I have
tuned it out and missed things…."
Wait, actually…
26
33. 31
Engineer quotes when diagnosing a real outage
• “What is it doing now?”
• “Why is it doing that?”
• “What will it do next?”
• “How did it get into this state?”
34. 32
Wait, wrong reference….
• “What is it doing now?”
• “Why is it doing that?”
• “What will it do next?”
• “How did it get into this state?”
Wiener, E. L. (1989). Human factors of advanced technology ("glass cockpit") transport aircraft.
(NASA Contractor Report No. 177528). Moffett Field, CA: NASA-Ames
39. 37
Systems Logs
Event
IRC Transcripts
Dashboard (graphs)
access logs
Alert logs
Open
coding
Pilot interview
Refine coding
scheme
Semi-structured
interviews
(in-situ utterances)
what they said in situ
what they did in situ
- Heuristic identification
- Validation
- Design modifications/support
Synthesis
Application logs
(error)
44. 42
IE2
PE2
IE5
IE1
IE1
PE3
IE3
PE3
PE3
ProdEng1 re-enables
the sidebar,
with blog turned off
13:06:44 13:15:00 13:30:00 13:45:00 14:00:00 14:15:00 14:30:00
ProdEng2 turns off
homepage
sidebar module
disable a
CDN?
Load
balancer
changes?
Network
changes?
Wordpress
issue?
Frozen shop?
Featured
shop?
PE1PE1
Varnish
queuing?
Featured
staff shop?
Sidebar loading
staff shop?
IE1IE1IE1IE1IE1IE1IE1
Varnish
not caching?
IE3
Database
schema change?
IE2 IE2
IE1Errors from
Homepage
sidebar
IE2400 response
code
IE2
PublicShops_GetShopCards
API method
PE3
Featured
shop loading
OK
IE2
“Shop 1234567
does not exist”
Varnish queuing,
not caching
400 responses?
Stated hypothesis
Critical relayed
observation
45. 43
• Does the incident correlate with a known
change?
• No? Freak out, because it could be anything.
• Validate hypotheses that most easily come to
mind.
• When gaining confidence around changes
during an outage, bias towards peer review
46. Anomaly response does not happen
the way we might imagine it does.
44
PUNCH LINE
We have to look hard at how it actually
happens.