The only real mistake is the one from which we learn nothing.” So how do we learn from system failures? This session will move beyond “blameless” postmortems and show how to use data to avoid and mitigate future failures. We will share the best practices for gathering systems-related data and people-related data. You will then learn how to apply the data to formulate actionable response plans and avoid repeating failures. This session is brought to you by AWS Summit San Francisco Platinum Sponsor Datadog
3. “The problems we work on at
Datadog are hard and often don't
have obvious, clean-cut solutions,
so it's useful to cultivate your
troubleshooting skills, no matter
what role you work in.”
Internal Datadog Developer Guide
TW: @gitbisect @datadoghq
5. “AN ANALYSIS OR
DISCUSSION OF AN EVENT
HELD SOON AFTER IT HAS
OCCURRED, ESPECIALLY
IN ORDER TO DETERMINE
WHY IT WAS A FAILURE.”
OXFORD ENGLISH DICTIONARY
Oxford English Dictionary
POSTMORTEM
TW: @gitbisect @datadoghq
6. DAMON EDWARDS & JOHN WILLIS - DEVOPSDAY LOS ANGELES
WHAT IS
DEVOPS?
▸ Culture
▸ Automation
▸ Metrics
▸ Sharing
TW: @gitbisect @datadoghq
11. You’re either building a
learning organization… or you
will be losing to someone who
is. Andrew Clay Shafer
WINNING OR LOSING?
TW: @gitbisect @datadoghq
12. CULTURE & SHARING ARE GREAT, BUT WHAT
ABOUT
TW: @gitbisect @datadoghq
18. How granular?
• AWS CloudWatch – 1 minute
• Google Stackdriver – 1 minute
• MS Azure
• 1 minute up to 24 hours
• 1 hour up to 7 days
• 1 day up to 30 days
• Datadog – seconds
25. How long of a retention?
• AWS CloudWatch: 15 months!
• 1 minute granularity up to 15 days
• 5 minute granularity up to 63 days
• 1 hour granularity up to 15 months
• Google Stackdriver: 6 weeks
• MS Azure: 90 days
• Datadog: 15 months, no aggregation
38. HUMAN DATA
DATA COLLECTION:
WHEN?▸ As soon as possible.
▸ Memory drops sharply within 20 minutes
▸ Susceptibility to “false memory” increases
▸ Get your project managers involved!
TW: @gitbisect @datadoghq
41. CULTURE & SHARING RESOURCES
BLAMELESS
POSTMORTEMS▸ Blameless Postmortems by John Allspaw
http://bit.ly/etsy-blameless
▸ The Human Side of Postmortems by Dave
Zwieback
http://bit.ly/human-postmortem
TW: @gitbisect @datadoghq
44. DATADOG POSTMORTEMS
A FEW NOTES
▸ Postmortems emailed to company wide
▸ Scheduled recurring postmortem meetings
TW: @gitbisect @datadoghq
45. DATADOG’S POSTMORTEM TEMPLATE (1/5)
SUMMARY: WHAT
HAPPENED?▸ Describe what happened here at a high-level --
think of it as an abstract in a scientific paper.
▸ What was the impact on customers?
▸ What was the severity of the outage?
▸ What components were affected?
▸ What ultimately resolved the outage?
TW: @gitbisect @datadoghq
48. DATADOG’S POSTMORTEM TEMPLATE (2/5)
HOW WAS THE
OUTAGE DETECTED?▸ We want to make sure we detected the issue early
and would catch the same issue if it were to
repeat.
▸ Did we have a metric that showed the outage?
▸ Was there a monitor on that metric?
▸ How long did it take for us to declare an outage?
TW: @gitbisect @datadoghq
51. DATADOG’S POSTMORTEM TEMPLATE (3/5)
HOW DID WE
RESPOND?▸ Who was the incident owner & who else was
involved?
▸ Slack archive links and timeline of events!
▸ What went well?
▸ What didn’t go so well?
TW: @gitbisect @datadoghq
52. *Names and links changed for privacy/security.
TW: @gitbisect @datadoghq
55. DATADOG’S POSTMORTEM TEMPLATE (4/5)
WHY DID IT HAPPEN?
▸ Deep dive into the cause
▸ Examples from this incident:
▸ http://bit.ly/dd-statuspage
▸ http://bit.ly/alq-postmortem
TW: @gitbisect @datadoghq
56. DATADOG’S POSTMORTEM TEMPLATE (5/5)
HOW DO WE PREVENT IT
IN THE FUTURE?▸ Link to Github issues and Trello cards
▸ Now?
▸ Next?
▸ Later?
▸ Follow up notes
TW: @gitbisect @datadoghq
58. DATADOG’S POSTMORTEM TEMPLATE
RECAP:
▸ What happened (summary)?
▸ How did we detect it?
▸ How did we respond?
▸ Why did it happen (deep dive)?
▸ Actionable next steps!
TW: @gitbisect @datadoghq
59. KEEP LEARNING
MORE RESOURCES
▸ Postmortem Template
http://bit.ly/postmortem-template
▸ The Infinite Hows - John Allspaw
http://bit.ly/infinite-hows
▸ “Blameless” Postmortems don’t work - J Paul Reed
http://bit.ly/blameless-dont-work
TW: @gitbisect @datadoghq