Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Scaling Humans
Ops teams and incident management
dotScale, Paris 2015
David Mytton, CEO, Server Density
Cost of uptime?
Cost of uptime?
Cost of uptime?
$2.9bn

Q1: 2015
Cost of uptime?
Cost of uptime?
$2.9bn

Q1: 2015
$870m

Q1: 2015
Cost of uptime?
Cost of uptime?
$2.9bn

Q1: 2015
$870m

Q1: 2015
$4.1bn

Q1: 2015
Cost of uptime?
How much are you spending?
Expect downtime
• Prepare
• Respond
• Postmortem
Prepare
• On call
• Primary/secondary
Prepare
• On call
• Primary/secondary
• Reachability
Prepare
• On call
• Off call
Prepare
• On call
• Off call
• Docs
Prepare
• On call
• Off call
• Docs
• Searchable
Prepare
• On call
• Off call
• Docs
• Searchable
• Independent
Prepare
Prepare
• Key info
• Team contacts
Prepare
• Key info
• Team contacts
• Vendor contacts
Prepare
• Key info
• Team contacts
• Vendor contacts
• Key credentials
Prepare
• Key info
• Unexpected situations
• Communication
Prepare
• Key info
• Unexpected situations
• Communication
• Internet access
Prepare
• Key info
• Unexpected situations
• Communication
• Internet access
• Support access
Respond
• First responder
1. Load incident response checklist
Respond
• First responder
1. Load incident response checklist
2. Log into Ops War Room
Respond
• First responder
1. Load incident response checklist
2. Log into Ops War Room
3. Log incident in JIRA
Respond
• First responder
1. Load incident response checklist
2. Log into Ops War Room
3. Log incident in JIRA
4. Begin in...
Respond
• Key response principles
• Log everything
Respond
• Key response principles
• Log everything
• Frequent public updates
Respond
• Key response principles
• Log everything
• Frequent public updates
• Gather the team
Respond
• Key response principles
• Log everything
• Frequent public updates
• Gather the team
• Escalate!
Postmortem
• Within a few days
Postmortem
• Within a few days
• Tell the story
Postmortem
• Within a few days
• Tell the story
• Appropriate technical detail
Postmortem
• Within a few days
• Tell the story
• Appropriate technical detail
• What failed, why?
Postmortem
• How it’s going to be fixed
Postmortem
ありがとうございます
david@serverdensity.com
@davidmytton
Upcoming SlideShare
Loading in …5
×

Scaling humans - Ops teams and incident management

1,486 views

Published on

100% uptime is impossible. Modern architectures are designed around failure but what does that mean for the human aspect of incident management? This talk considers how to prepare for outages, how to structure the response, and how those experiences and techniques differ for small and large companies.

Presented by David Mytton at dotScale Paris 2015-06-08

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Scaling humans - Ops teams and incident management

  1. 1. Scaling Humans Ops teams and incident management dotScale, Paris 2015 David Mytton, CEO, Server Density
  2. 2. Cost of uptime?
  3. 3. Cost of uptime?
  4. 4. Cost of uptime? $2.9bn
 Q1: 2015
  5. 5. Cost of uptime?
  6. 6. Cost of uptime? $2.9bn
 Q1: 2015 $870m
 Q1: 2015
  7. 7. Cost of uptime?
  8. 8. Cost of uptime? $2.9bn
 Q1: 2015 $870m
 Q1: 2015 $4.1bn
 Q1: 2015
  9. 9. Cost of uptime?
  10. 10. How much are you spending?
  11. 11. Expect downtime • Prepare • Respond • Postmortem
  12. 12. Prepare • On call • Primary/secondary
  13. 13. Prepare • On call • Primary/secondary • Reachability
  14. 14. Prepare • On call • Off call
  15. 15. Prepare • On call • Off call • Docs
  16. 16. Prepare • On call • Off call • Docs • Searchable
  17. 17. Prepare • On call • Off call • Docs • Searchable • Independent
  18. 18. Prepare
  19. 19. Prepare • Key info • Team contacts
  20. 20. Prepare • Key info • Team contacts • Vendor contacts
  21. 21. Prepare • Key info • Team contacts • Vendor contacts • Key credentials
  22. 22. Prepare • Key info • Unexpected situations • Communication
  23. 23. Prepare • Key info • Unexpected situations • Communication • Internet access
  24. 24. Prepare • Key info • Unexpected situations • Communication • Internet access • Support access
  25. 25. Respond • First responder 1. Load incident response checklist
  26. 26. Respond • First responder 1. Load incident response checklist 2. Log into Ops War Room
  27. 27. Respond • First responder 1. Load incident response checklist 2. Log into Ops War Room 3. Log incident in JIRA
  28. 28. Respond • First responder 1. Load incident response checklist 2. Log into Ops War Room 3. Log incident in JIRA 4. Begin investigation
  29. 29. Respond • Key response principles • Log everything
  30. 30. Respond • Key response principles • Log everything • Frequent public updates
  31. 31. Respond • Key response principles • Log everything • Frequent public updates • Gather the team
  32. 32. Respond • Key response principles • Log everything • Frequent public updates • Gather the team • Escalate!
  33. 33. Postmortem • Within a few days
  34. 34. Postmortem • Within a few days • Tell the story
  35. 35. Postmortem • Within a few days • Tell the story • Appropriate technical detail
  36. 36. Postmortem • Within a few days • Tell the story • Appropriate technical detail • What failed, why?
  37. 37. Postmortem • How it’s going to be fixed
  38. 38. Postmortem
  39. 39. ありがとうございます david@serverdensity.com @davidmytton

×