Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sensu and Sensibility - Puppetconf 2014

4,621 views

Published on

As the Yelp infrastructure and engineering team grew, so did the pain of managing Nagios. Problems like splitting alerting across multiple teams, providing high availability and managing nagios systems in multiple environments had become pressing. As we grew towards a service oriented architecture and pushed some services out into the cloud, we rapidly needed more automated monitoring configuration.

An evolutionary solution wasn’t going to solve all of our problems, we needed to revolutionize our monitoring. Sensu is built from the ground up to solve many of our issues and be easy to extend.

This talk covers our puppet ‘monitoring_check’ API (that sets up monitoring for our services within puppet), how and why we deploy Sensu and our custom handlers and escalations, along with how we provide automatic ‘self service’ monitoring for dynamic services and how we deal with the challenges posed by the more ephemeral nature of cloud architectures.

Published in: Software, Technology
  • Nice !! Download 100 % Free Ebooks, PPts, Study Notes, Novels, etc @ https://www.ThesisScientist.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Sensu and Sensibility - Puppetconf 2014

  1. 1. Sensu and Sensibility Tomas Doran @bobtfish 2014-­‐09-­‐23
  2. 2. 2 Sensu and Sensibility
  3. 3. Cycle of failure and disappointment • Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production 3
  4. 4. 4
  5. 5. Cycle of failure and disappointment • Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production • Escalation of issues is hard • Ops ignore alerts from services • Postmortems 5
  6. 6. 6
  7. 7. Cycle of failure and disappointment • Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production • Escalation of issues is hard • Ops ignore alerts from services • Postmortems • High friction, low trust, low visibility. 7
  8. 8. “Normality” 8 -­‐ http://gunshowcomic.com/648
  9. 9. “Normality” dysfunctional 9 This is -­‐ http://gunshowcomic.com/648
  10. 10. 10 Sensibility
  11. 11. 11 Sensibility
  12. 12. “51 % viewed their ERP implementation as unsuccessful” 12 The Robbins-Gioia Survey (2001)
  13. 13. The Conference Board Survey (2001) “40 % of the projects failed to achieve their business case within one year of going live” 13
  14. 14. McKinsey & Company in conjunction with the University of Oxford (2012) • “17 percent of large IT projects go so badly that they can threaten the very existence of the company” • “On average, large IT projects run 45 percent over budget and 7 percent over time, while delivering 56 percent less value than predicted” 14
  15. 15. Failure is an option -­‐ blog.parasoft.com/single-­‐greatest-­‐barrier-­‐with-­‐sw-­‐delivery 15
  16. 16. Sensibility 16
  17. 17. 17 Sensibility
  18. 18. Why Sensu? • Designed to be pluggable / extensible • Arbitrary check metadata • Simple model • Components do exactly one thing • Ruby • Not afraid to extend (or fork!) 18
  19. 19. ‘industry standard’ ‘enterprise class’ 19
  20. 20. Cheap shot 20
  21. 21. 21
  22. 22. status.dat cmd.dat 22
  23. 23. cmd.dat 23
  24. 24. 24 Centralized
  25. 25. 25
  26. 26. How we use Sensu • Don’t use all of this! • ‘Standalone’ checks only • Default in the puppet module 26
  27. 27. Sensu data flow • Sensu client runs checks on each machine • Pushes results to RabbitMQ • Clustered, clients/messages will fail over. • Sensu server (multiple, ha) • Processes check results, invokes handlers • Writes state to redis • Redis + sentinel • Read by API (2 instances) • All layers behind haproxy 27
  28. 28. Quis custodiet ipsos custodes? 28 “Sensu has so many moving parts that I wouldn’t be able to sleep at night unless I set up a Nagios instance to make sure they were all running.”
  29. 29. Mutually assured monitoring • Multiple independent Sensu installs (per-datacenter) • Monitor each other! 29
  30. 30. Machine readable config • /etc/sensu/conf.d/checks/check_name.json • Extensible with arbitrary metadata • Hash merge • Never edit by hand! 30
  31. 31. monitoring_check monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/ check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', } 31
  32. 32. monitoring_check monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/ check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', } 32
  33. 33. monitoring_check monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/ check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', } 33
  34. 34. monitoring_check monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/ check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', } 34
  35. 35. sensu::check • monitoring_check wraps this • Writes a JSON file for each check • Comment safe 35
  36. 36. "disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/? p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false } 36
  37. 37. "disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/? p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false } 37
  38. 38. "disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/? p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false } 38
  39. 39. "disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/? p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false } 39
  40. 40. "disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/? p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false } 40
  41. 41. "disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/? p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false } 41
  42. 42. Check scripts • Same as nagios checks • Simple (text) output • Exit code • Result sent to server, along with check definition • Including all the custom metadata • Our handlers use the extra data. 42
  43. 43. Handlers • base • JIRA • email • irc • pagerduty • awsprune 43
  44. 44. How do checks get run? • Every machine runs the client. • Client managed by puppet • Client has a TCP socket you can send JSON to • Custom checks + pysensu-yelp 44
  45. 45. 45
  46. 46. Situational awareness 46
  47. 47. Single source of truth • DNS is canonical for sensu servers • Configure things in one place! 47
  48. 48. Single source of truth • DNS is canonical for sensu servers • Configure things in one place! 48
  49. 49. Automatic monitoring • E.g. cron jobs - check successful recently! • cron::d 49
  50. 50. Automatic monitoring • E.g. cron jobs - check successful recently! • cron::d 50
  51. 51. Generate monitoring_check 51
  52. 52. User specified monitoring 52
  53. 53. User specified monitoring 53 • Data lives in the service config • Next to the code to emit metrics!
  54. 54. • Simple checks for free! 54 User specified monitoring
  55. 55. User specified monitoring • Data lives in the service config • Next to the code to emit metrics • Next to metadata about SLAs and LB timeouts • Developers can push without OPS 55
  56. 56. Cluster checks • We’re working on this currently • Assert some % of machines are healthy. • Use to reduce alert noise. • If a service becomes fully unavailable to clients, you want to page someone. • If one machine goes belly up, you don’t (make a JIRA ticket for handling later!) 56
  57. 57. WIP • This is all still a work in progress. • We’ve not 100% migrated off of Nagios • Open sourcing the pieces 57
  58. 58. Thanks! • Slides will be online shortly: • slideshare.net/bobtfish • @bobtfish • Some (most?) of our code is open source: • https://github.com/Yelp/sensu/commit/ aa5c43c2fdfde5e8739952c0b8082000934f3ad2 • https://github.com/Yelp/puppet-monitoring_check • https://github.com/Yelp/puppet-netstdlib • https://github.com/Yelp/sensu_handlers • https://github.com/Yelp/pysensu-yelp 58

×