Nagios Conference 2012 - Alex Solomon - Managing Your Heros

1,402 views
1,157 views

Published on

Alex Solomon's presentation on the people's aspect of monitoring. The presentation was given during the Nagios World Conference North America held Sept 25-28th, 2012 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,402
On SlideShare
0
From Embeds
0
Number of Embeds
25
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Nagios Conference 2012 - Alex Solomon - Managing Your Heros

  1. 1. MANAGING YOUR HEROES The People Aspect of Monitoring(a.k.a. Dealing with Outages and Failures) Alex Solomon alex@pagerduty.com
  2. 2. WHO AM I? Alex Solomon • Founder / CEO of PagerDuty • Intersect Inc. • Amazon.com 2
  3. 3. DEFINITIONS 3
  4. 4. Service Level Agreement (SLA) Mean Time To Resolution (MTTR) Mean Time To ResponseMean Time Between Failures (MTBF) 4
  5. 5. OUTAGES 5
  6. 6. Can we prevent them? 6
  7. 7. PREVENTING OUTAGES Single Points of Failure (SPOFs) Redundant systems Complex, monolithic systems Service-oriented architecture 7
  8. 8. Netflix distributed SOA system 8
  9. 9. PREVENTING OUTAGES Change (not much you can do about this one) 9
  10. 10. OUTAGES 10
  11. 11. FAILURE LIFECYCLE 11
  12. 12. Monitoring detect failure Alert Investigate FixRoot-cause Analysis 12
  13. 13. Critical Incident Timeline Alert Investigate Fix RESPONSE TIME RESOLUTION TIME Issue is Engineer starts Issue isdetected working on issue fixed 13
  14. 14. MONITOR 14
  15. 15. MONITOR EVERYTHING!All levels of the stack• Data center• Network• Servers• Database• Application• Website• Business Metrics 15
  16. 16. WHY MONITOR EVERYTHING? Metrics! Metrics! Metrics! 16
  17. 17. TOOLS• Internal monitoring (behind the firewall): • •• External monitoring (SaaS-based): • •• Metrics: • Graphite or 17
  18. 18. ALERT 18
  19. 19. Best Practice: Categorize alerts by severity. 19
  20. 20. SEVERITIESDefine severities based on business impact:• sev1 - large scale business loss { 2 critical severities• sev2 - small to medium business loss• sev3• sev4 - no immediate business loss, customers may be impacted - no business loss, no { 2 non-critical severities customers impacted 20
  21. 21. Each severity level should have its own standardoperating procedure (SOP):• Who• How• Response time 21
  22. 22. • Sev1: Major outage, all hands on deck • Notify the entire team via phone and SMS • Response time: 5 min• Sev2: Critical issue • Notify the on-call person via phone and SMS • Response time: 15 min• Sev3: Non-critical issue • Notify the on-call person via email • Response time: next day during business hours 22
  23. 23. • Sev1 incidents • Rare • Rarely auto-generated • Frequently start as sev2 which are upgraded to sev1 23
  24. 24. • Sev2 incidents • More common • Mostly auto-generated 24
  25. 25. • Sev3 incidents • Non-critical incidents • Can be auto-generated • Can also be manually generated 25
  26. 26. • Severities can be downgraded or upgraded • ex. sev2 ➞ sev1 (problem got worse) • ex. sev1 ➞ sev2 (problem was partially fixed) • ex. sev2 ➞ sev3 (critical problem was fixed but we still need to investigate root cause) 26
  27. 27. One more best-practice:Alert before your systems fail completely 27
  28. 28. Main benefit of severitiesOnly page on critical issues (sev1 or 2) 28
  29. 29. Preserve sanity 29
  30. 30. Avoid “Peter and the wolf ” scenarios 30
  31. 31. ON-CALL BEST PRACTICES Person Team Level Level 31
  32. 32. ON-CALL AT THE PERSON LEVEL Cellphone 32
  33. 33. CellphoneSmart phone OR AND 33
  34. 34. 4G / 3G internet4G hotspot 4G USB modem 3G/4G tethering (don’t forget your laptop) 34
  35. 35. Page multiple times until you respond • Time zero: email and SMS •1 min later: phone-call on cell •5 min later: phone-call on cell •5 min later: phone-call on landline •5 min later: phone-call to girlfriend 35
  36. 36. Bonus: vibrating bluetooth bracelet 36
  37. 37. ON-CALL AT THE TEAM LEVEL Rarely Do not send alerts to the entire team sev1 OK sev2 NO 37
  38. 38. On-call schedules: • Simple rotation-based schedule • ex. weekly - everyone is on-call for a week at a time • Set up a follow-the-sun schedule • people in multiple timezones • no night-shifts simple rotation 38
  39. 39. What happens if the on-call person doesn’t respond at all? 39
  40. 40. If you care about uptime, you need redundancy in your on-call. 40
  41. 41. Set up multiple on-call levels with automaticescalation between them: Level 1: Primary on-call Escalate after 15 min Level 2: Secondary on-call Escalate after 20 min Level 3: Team on-call (alert entire team) 41
  42. 42. Best Practice: Put management in the on-call chain Level 1: Primary on-call Escalate after 15 min Level 2: Secondary on-call Escalate after 20 min Level 3: Team on-call (alert entire team) Escalate after 20 min Level 4: Manager / Director 42
  43. 43. Best Practice: put software engineers in theon-call chain • Devops model • Devs need to own the systems they write • Getting paged provides a strong incentive to engineer better systems 43
  44. 44. Best Practice: measure on-call performance“You can’t improve what you don’t measure.” • Measure: mean-time-to-response • Measure: % of issues that were escalated • Set up policies to encourage good performance • Put managers in on-call chain • Pay people extra to do on-call 44
  45. 45. Network Operations Center 45
  46. 46. NOC with lots of Nagios goodness 46
  47. 47. NOCs: • Reduce the mean-time-to-response drastically • Expensive (staffed 24x7 with multiple people) • Train NOC staff to fix a good %age of issues • As you scale your org, you may want a hybrid on-call approach (where NOC handles some issues, teams handle other issues directly) 47
  48. 48. Critical Incident Timeline Alert Investigate Fix RESPONSE TIME RESOLUTION TIME Issue is Engineer starts Issue is detected working on issue fixed Alert Alerting system Engineer gets to a gets ahold of computer, connects somebody to internet Issue is Engineer is Engineer startsdetected aware of issue working on issue 48
  49. 49. Alert Alerting system Engineer gets to a gets ahold of computer, connects somebody to internet{{ Issue is Engineer is Engineer startsdetected aware of issue working on issue How to minimize: How to minimize: • Alert via phone & SMS • Carry 4G internet device + laptop at all times • Alert multiple times via multiple channels • Set loud ringtone at night • Failing that, escalate! • Failing that, escalate to manager! 49
  50. 50. RESEARCH & FIX 50
  51. 51. How do we reduce the amount of time needed to investigate and fix? Investigate Fix 51
  52. 52. Set up an Emergency Ops Guide: • When you encounter a new failure, document it in the Guide • Document symptoms, research steps, fixes • Use a wiki 52
  53. 53. 53
  54. 54. 54
  55. 55. Automate fixes orAdd more fault tolerance 55
  56. 56. You need the right tools: • Tools to help you diagnose problems faster • Comprehensive monitoring, metrics and dashboards • Tools that help search for problems in log files quickly (ie. Splunk) • Tools to help your team communicate efficiently • Voice: Conference bridge, Skype, Google Hangout • Chat: Hipchat, Campfire 56
  57. 57. Best Practice: Incident Commander 57
  58. 58. Incident Commander: • Essential for dealing with sev1 issues • In charge of the situation • Providers leadership, prevents analysis paralysis • He/she directs people to do things • Helps save time making decisions 58
  59. 59. Questions? Alex Solomonalex@pagerduty.com

×