The DevOps movement has influenced not only the tools we use in modern development and operations engineering, but also how we work. It has changed how we respond when systems stop working or don't work as expected. In this session, we share methods and techniques for gathering data and using it effectively to mitigate or avoid future failures. This session is brought to you by AWS Partner, Datadog.
3. “THE ONLY REAL
MISTAKE IS THE
ONE FROM
WHICH WE LEARN
NOTHING.”
- Henry Ford
@dbenamy @datadoghq
4. COLLECTING DATA IS CHEAP;
NOT HAVING DATA WHEN YOU
NEED IT CAN BE EXPENSIVE.
@dbenamy @datadoghq
5. “The problems we work on at Datadog
are hard and often don't have obvious,
clear solutions, so it's useful to cultivate
your troubleshooting skills, no matter
what role you work in.”
Internal Datadog Developer Guide
@dbenamy @datadoghq
6. DAMON EDWARDS & JOHN WILLIS - DEVOPSDAY MOUNTAIN VIEW 2010
WHAT IS
DEVOPS?
▸ Culture
▸ Automation
▸ Metrics
▸ Sharing
@dbenamy @datadoghq
9. CULTURE & SHARING RESOURCES
BLAMELESS POSTMORTEMS
▸ Blameless Postmortems, by John Allspaw
http://bit.ly/etsy-blameless
▸ The Human Side of Postmortems, by Dave Zwieback
http://bit.ly/human-postmortem
@dbenamy @datadoghq
25. Query-Based Monitoring
“What’s the average throughput of application:nginx
per version?”
“How many requests per second is my
role:accounting-app running application:postgresql
hosted in region:us-west-1 compared to region:us-
east-1?”
@dbenamy @datadoghq
41. HUMAN DATA
DATA COLLECTION: WHEN?
▸ As soon as possible
▸ Memory drops sharply within 20 minutes
▸ Susceptibility to “false memory” increases
▸ Get managers involved!
@dbenamy @datadoghq
42. HUMAN DATA
DATA SKEW / CORRUPTION
▸ Stress
▸ Sleep deprivation
▸ Burnout
@dbenamy @datadoghq
43. HUMAN DATA
DATA SKEW / CORRUPTION
▸ Blame
▸ Fear of punitive action
@dbenamy @datadoghq
44. HUMAN DATA
DATA SKEW / CORRUPTION
▸ Bias
▸ Anchoring
▸ Hindsight
▸ Outcome
▸ Availability (recency)
▸ Bandwagon effect
@dbenamy @datadoghq
46. DATADOG POSTMORTEMS
A FEW NOTES
▸ Findings emailed company wide
▸ Scheduled recurring review meetings
@dbenamy @datadoghq
47. DATADOG POSTMORTEMS
A FEW NOTES
▸ Findings emailed company wide
▸ Scheduled recurring review meetings
@dbenamy @datadoghq
48. DATADOG’S POSTMORTEM TEMPLATE (1/5)
SUMMARY: WHAT HAPPENED?
▸ Describe what happened here at a high level. Think
of it as an abstract in a scientific paper.
▸ What was the impact on customers?
▸ What was the severity of the outage?
▸ What components were affected?
▸ What ultimately resolved the outage?
@dbenamy @datadoghq
49. DATADOG’S POSTMORTEM TEMPLATE (1/5)
SUMMARY: WHAT HAPPENED?
▸ Describe what happened here at a high level. Think
of it as an abstract in a scientific paper.
▸ What was the impact on customers?
▸ What was the severity of the outage?
▸ What components were affected?
▸ What ultimately resolved the outage?
@dbenamy @datadoghq
52. DATADOG’S POSTMORTEM TEMPLATE (2/5)
HOW WAS THE OUTAGE DETECTED?
▸ We want to make sure that we detected the issue early
and that we would catch the same issue if it were to
repeat.
▸ Did we have a metric that showed the outage?
▸ Was there a monitor on that metric?
▸ How long did it take for us to declare an outage?
@dbenamy @datadoghq
60. DATADOG’S POSTMORTEM TEMPLATE (3/5)
HOW DID WE RESPOND?
▸ Who was the incident owner, and who else was
involved?
▸ Slack archive links and timeline of events!
▸ What went well?
▸ What didn’t go so well?
@dbenamy @datadoghq
61. DATADOG’S POSTMORTEM TEMPLATE (3/5)
HOW DID WE RESPOND?
▸ Who was the incident owner, and who else was
involved?
▸ Slack archive links and timeline of events!
▸ What went well?
▸ What didn’t go so well?
@dbenamy @datadoghq
65. DATADOG’S POSTMORTEM TEMPLATE (4/5)
WHY DID IT HAPPEN?
▸ Deep dive into the cause
▸ Examples from this incident
▸ http://bit.ly/dd-statuspage
▸ http://bit.ly/alq-postmortem
@dbenamy @datadoghq
66.
67. DATADOG’S POSTMORTEM TEMPLATE (5/5)
HOW DO WE PREVENT IT IN THE FUTURE?
▸ Link to GitHub issues and Trello cards
▸ Now?
▸ Next?
▸ Later?
▸ Follow-up notes
@dbenamy @datadoghq
68. DATADOG’S POSTMORTEM TEMPLATE (5/5)
HOW DO WE PREVENT IT IN THE FUTURE?
▸ Link to GitHub issues and Trello cards
▸ Now?
▸ Next?
▸ Later?
▸ Follow-up notes
@dbenamy @datadoghq
70. DATADOG’S POSTMORTEM TEMPLATE
RECAP:
▸ What happened (summary)?
▸ How did we detect it?
▸ How did we respond?
▸ Why did it happen (deep dive)?
▸ Actionable next steps!
@dbenamy @datadoghq
71. KEEP LEARNING
MORE RESOURCES
▸ Our template
http://bit.ly/postmortem-template
▸ Post-Incident Reviews, by Jason Hand
http://bit.ly/post-incident-review
▸ The Infinite Hows, by John Allspaw
http://bit.ly/infinite-hows
▸ “Blameless” Postmortems Don’t Work, by J. Paul Reed
http://bit.ly/blameless-dont-work
@dbenamy @datadoghq