Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Leveling up monitoring: A decade of automating and scaling Nagios

10,184 views

Published on

Monitoring - we all have to do it, but most people don’t seem to like it very much. Etsy has been using Nagios for over a decade to monitor its infrastructure, and over that time has created a set of tools that has allowed multiple teams to deploy, manage, and scale it. In this talk we will offer guidelines on how to scale monitoring and alerting setups, ideas for workflows around monitoring, and methods of reducing friction and alert fatigue for on-call engineers.

Published in: Software
  • Be the first to comment

Leveling up monitoring: A decade of automating and scaling Nagios

  1. 1. Leveling Up Monitoring: A Decade of Automating and Scaling Nagios Katherine Daniels and Laurie Denness @beerops - @lozzd Velocity 2016
  2. 2. @beerops - @lozzd Velocity 2016 Katherine Daniels
 @beerops Senior Operations Engineer, Etsy Co-Author of Effective DevOps Laurie Denness @lozzd Staff Operations Engineer, Etsy Official Graph Enthusiast
  3. 3. 3
  4. 4. Agenda @beerops - @lozzd Velocity 2016 Automation 2 Deployinator 3 Scaling + Tooling 4 In The Beginning... 1
  5. 5. 25M Active Buyers About Etsy 1.6M Active Sellers $2.39B 2015 Annual GMS (As of March 31, 2016)
  6. 6. Monitoring!
  7. 7. @beerops - @lozzd Velocity 2016
  8. 8. @beerops - @lozzd Velocity 2016
  9. 9. bit.ly/yaynagios
  10. 10. https://kartar.net/2015/08/monitoring- survey-2015---tools/
  11. 11. @beerops - @lozzd Velocity 2016 In The Beginning
  12. 12. @beerops - @lozzd Velocity 2016
  13. 13. @beerops - @lozzd Velocity 2016
  14. 14. @beerops - @lozzd Velocity 2016 Sometimes your statement needs emphasis with a black background.
  15. 15. @beerops - @lozzd Velocity 2016
  16. 16. @beerops - @lozzd Velocity 2016 LESSONS LEARNED: Templates are awesome.
  17. 17. @beerops - @lozzd Velocity 2016
  18. 18. @beerops - @lozzd Velocity 2016
  19. 19. @beerops - @lozzd Velocity 2016
  20. 20. @beerops - @lozzd Velocity 2016
  21. 21. @beerops - @lozzd Velocity 2016 define service { use generic-service hostgroups Linux_hosts,!email-only-servers service_description SSH check_command check_ssh }
  22. 22. @beerops - @lozzd Velocity 2016 define service { use disk-space-service hostgroup_name email-only-servers contact_groups ops_nonurgent }
  23. 23. @beerops - @lozzd Velocity 2016
  24. 24. @beerops - @lozzd Velocity 2016 LESSONS LEARNED: Start small.
  25. 25. @beerops - @lozzd Velocity 2016 Nagios and Chef
  26. 26. @beerops - @lozzd Velocity 2016
  27. 27. @beerops - @lozzd Velocity 2016
  28. 28. 24
  29. 29. @beerops - @lozzd Velocity 2016 LESSONS LEARNED: Automation is awesome!
  30. 30. @beerops - @lozzd Velocity 2016 LESSONS LEARNED: Automation is awesome! HA HA JUST KIDDING
  31. 31. @beerops - @lozzd Velocity 2016
  32. 32. @beerops - @lozzd Velocity 2016
  33. 33. @beerops - @lozzd Velocity 2016 LESSONS LEARNED: Trust but verify.
  34. 34. @beerops - @lozzd Velocity 2016 How Many Repos?
  35. 35. @beerops - @lozzd Velocity 2016
  36. 36. @beerops - @lozzd Velocity 2016
  37. 37. @beerops - @lozzd Velocity 2016 LESSONS LEARNED: ?!?!?!?!??!?!
  38. 38. @beerops - @lozzd Velocity 2016 LESSONS LEARNED: Try, fail, learn, and try again.
  39. 39. Problems
  40. 40. Problems • Four git repos, inconsistent mess, duplication
  41. 41. Problems • Four git repos, inconsistent mess, duplication • Broken semi-useful automation - need to regain trust
  42. 42. Problems • Four git repos, inconsistent mess, duplication • Broken semi-useful automation - need to regain trust • Some shared config, some unique
  43. 43. Problems • Four git repos, inconsistent mess, duplication • Broken semi-useful automation - need to regain trust • Some shared config, some unique • Gain confidence in changes
  44. 44. Problems • Four git repos, inconsistent mess, duplication • Broken semi-useful automation - need to regain trust • Some shared config, some unique • Gain confidence in changes • Stop editing on the production box
  45. 45. @beerops - @lozzd Velocity 2016 Nagios and Chef
  46. 46. @beerops - @lozzd Velocity 2016 Nagios and Chef and Deployinator!
  47. 47. @beerops - @lozzd Velocity 2016 Solution 1: 
 Merge everything: find and remove duplication, shared configs
  48. 48. @beerops - @lozzd Velocity 2016 Thanks Murphy!
  49. 49. @beerops - @lozzd Velocity 2016 Super Secret Option!!!
  50. 50. @beerops - @lozzd Velocity 2016
  51. 51. @beerops - @lozzd Velocity 2016
  52. 52. @beerops - @lozzd Velocity 2016
  53. 53. @beerops - @lozzd Velocity 2016 Solution 2: Using Jenkins CI to test changes before production
  54. 54. @beerops - @lozzd Velocity 2016 Solution 3: Use Deployinator to run Chef recipe to generate automated configs
  55. 55. Chart Title
  56. 56. Chart Title
  57. 57. @beerops - @lozzd Velocity 2016 Solution 4: Use Deployinator to rsync config to all boxes
  58. 58. • git pull repo on deploy host
  59. 59. • git pull repo on deploy host • Run Chef recipe to add automated pieces
  60. 60. • git pull repo on deploy host • Run Chef recipe to add automated pieces • Re-run the try-nagios script against that
  61. 61. • git pull repo on deploy host • Run Chef recipe to add automated pieces • Re-run the try-nagios script against that • rsync copy from deploy box to Nagios hosts
  62. 62. • git pull repo on deploy host • Run Chef recipe to add automated pieces • Re-run the try-nagios script against that • rsync copy from deploy box to Nagios hosts • Create symlink for nagios.cfg
  63. 63. • git pull repo on deploy host • Run Chef recipe to add automated pieces • Re-run the try-nagios script against that • rsync copy from deploy box to Nagios hosts • Create symlink for nagios.cfg • Restart Nagios
  64. 64. @beerops - @lozzd Velocity 2016 LESSONS LEARNED: Use the tools you have.
  65. 65. @beerops - @lozzd Velocity 2016 Scaling things up!
  66. 66. @beerops - @lozzd Velocity 2016
  67. 67. @beerops - @lozzd Velocity 2016
  68. 68. @beerops - @lozzd Velocity 2016
  69. 69. @beerops - @lozzd Velocity 2016
  70. 70. @beerops - @lozzd Velocity 2016 Core Workers
  71. 71. @beerops - @lozzd Velocity 2016 Core Workers
  72. 72. @beerops - @lozzd Velocity 2016
  73. 73. @beerops - @lozzd Velocity 2016 LESSONS LEARNED: If at first you don’t succeed, rub some webscale on it.
  74. 74. @beerops - @lozzd Velocity 2016 Iterating and Iterating
  75. 75. @beerops - @lozzd Velocity 2016 LESSONS LEARNED: Iterate Iterate Iterate
  76. 76. @beerops - @lozzd Velocity 2016 To Infinity and Beyond
  77. 77. @beerops - @lozzd Velocity 2016
  78. 78. http://github.com/etsy/opsweekly
  79. 79. http://github.com/etsy/opsweekly
  80. 80. Chart Title
  81. 81. Chart Title
  82. 82. Final Lessons Learned
  83. 83. • Templates are awesome • Start small • Automation is awesome • Trust but verify • Learn from (y)our mistakes • Iterate on the tools you have
  84. 84. Open Source Summary
  85. 85. Open Source Summary • http://github.com/etsy/deployinator • http://github.com/etsy/pushbot • http://github.com/etsy/trylib • http://github.com/etsy/opsweekly • http://github.com/etsy/nagios-herald • http://github.com/RJ/irccat
  86. 86. THANK YOU! @beerops - @lozzd Velocity 2016

×