Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Sensu and Sensibility 
Tomas 
Doran 
@bobtfish 
2014-­‐09-­‐23
2 
Sensu and Sensibility 
I’m part of the SRE team at Yelp. 
One of my jobs is “don’t break the site, ever” 
Another job i...
Cycle of failure and 
disappointment 
• Manually edited and deployed monitoring 
• Changes require two teams 
• Low develo...
4 
This leads to developers being separated from production. 
Pager details out of date. Not all hosts running a service m...
Cycle of failure and 
disappointment 
• Manually edited and deployed monitoring 
• Changes require two teams 
• Low develo...
6 
If monitoring is ‘ops problem’, everything looks on fire all the time. 
It’s very hard to know what’s actually broken. ...
Cycle of failure and 
disappointment 
• Manually edited and deployed monitoring 
• Changes require two teams 
• Low develo...
“Normality” 
8 
-­‐ 
http://gunshowcomic.com/648 
It’s just the way we’ve built our monitoring system is killing us with a...
“Normality” 
dysfunctional 
9 
This is 
-­‐ 
http://gunshowcomic.com/648 
I’m painting a bleak picture here - not actually...
10 
Sensibility 
Monitoring is about enabling communication.
11 
Sensibility 
One of our core competencies is getting monitoring right! 
So, we decided to change everything!!!!1111
“51 % viewed their ERP implementation as 
unsuccessful” 
12 
The Robbins-Gioia Survey (2001) 
Why the hell would we do tha...
The Conference Board Survey (2001) 
“40 % of the projects failed to achieve their 
business case within one year of going ...
McKinsey & Company in conjunction 
with the University of Oxford (2012) 
• “17 percent of large IT projects go so 
badly t...
Failure is an option 
-­‐ 
blog.parasoft.com/single-­‐greatest-­‐barrier-­‐with-­‐sw-­‐delivery 
15 
You’re not gonna get ...
Sensibility 
16 
Large team + many teams - decentralized (multiple time zones for some teams) 
Integration - we can’t pick...
17 
Sensibility 
No big bang change, has to be incremental. 
We don’t know what our requirements are (beyond that the curr...
Why Sensu? 
• Designed to be pluggable / extensible 
• Arbitrary check metadata 
• Simple model 
• Components do exactly o...
‘industry standard’ 
‘enterprise class’ 
19 
So we do have / did have nagios. 
It’s workable. In fact, it works fine, and ...
Cheap shot 
20 
It’s ugly
21 
It tries to solve the full-stack monitoring problem. 
We’d already migrated most contacting to pager duty, rest to fol...
status.dat 
cmd.dat 
22 
The data formats are gross.
cmd.dat 
23
24 
Centralized 
Ephemeral clients are a problem. 
Whitelisting (needing to explicitly add hosts/services) is a problem 
E...
25 
To be fair, this diagram does Sensu no favors at all :)
How we use Sensu 
• Don’t use all of this! 
• ‘Standalone’ checks only 
• Default in the puppet module 
26 
We don’t use i...
Sensu data flow 
• Sensu client runs checks on each machine 
• Pushes results to RabbitMQ 
• Clustered, clients/messages w...
Quis custodiet ipsos custodes? 
28 
“Sensu 
has 
so 
many 
moving 
parts 
that 
I 
wouldn’t 
be 
able 
to 
sleep 
at 
nigh...
Mutually assured monitoring 
• Multiple independent Sensu installs (per-datacenter) 
• Monitor each other! 
29 
We have a ...
Machine readable config 
• /etc/sensu/conf.d/checks/check_name.json 
• Extensible with arbitrary metadata 
• Hash merge 
•...
monitoring_check 
monitoring_check { 'systems-apache-external': 
page => true, 
command => "/usr/lib/nagios/plugins/ 
chec...
monitoring_check 
monitoring_check { 'systems-apache-external': 
page => true, 
command => "/usr/lib/nagios/plugins/ 
chec...
monitoring_check 
monitoring_check { 'systems-apache-external': 
page => true, 
command => "/usr/lib/nagios/plugins/ 
chec...
monitoring_check 
monitoring_check { 'systems-apache-external': 
page => true, 
command => "/usr/lib/nagios/plugins/ 
chec...
sensu::check 
• monitoring_check wraps this 
• Writes a JSON file for each check 
• Comment safe 
35 
We do use the Sensu ...
"disk_ro_mounts": { 
"standalone": true, "handlers": [“default"], "subscribers": [], 
"command": "/usr/lib/nagios/plugins/...
"disk_ro_mounts": { 
"standalone": true, "handlers": [“default"], "subscribers": [], 
"command": "/usr/lib/nagios/plugins/...
"disk_ro_mounts": { 
"standalone": true, "handlers": [“default"], "subscribers": [], 
"command": "/usr/lib/nagios/plugins/...
"disk_ro_mounts": { 
"standalone": true, "handlers": [“default"], "subscribers": [], 
"command": "/usr/lib/nagios/plugins/...
"disk_ro_mounts": { 
"standalone": true, "handlers": [“default"], "subscribers": [], 
"command": "/usr/lib/nagios/plugins/...
"disk_ro_mounts": { 
"standalone": true, "handlers": [“default"], "subscribers": [], 
"command": "/usr/lib/nagios/plugins/...
Check scripts 
• Same as nagios checks 
• Simple (text) output 
• Exit code 
• Result sent to server, along with check def...
Handlers 
• base 
• JIRA 
• email 
• irc 
• pagerduty 
• awsprune 
43
How do checks get run? 
• Every machine runs the client. 
• Client managed by puppet 
• Client has a TCP socket you can se...
45 
Sensu servers know which machine is the master right now (their own leadership election). 
Deploy some checks to sensu...
Situational awareness 
46 
Send alerts about dev box resource usage to the developers using that box. 
Why page OPS as a d...
Single source of truth 
• DNS is canonical for sensu servers 
• Configure things in one place! 
47 
One place can be DNS, ...
Single source of truth 
• DNS is canonical for sensu servers 
• Configure things in one place! 
48 
puppet-netstdlib 
stru...
Automatic monitoring 
• E.g. cron jobs - check successful recently! 
• cron::d 
49 
There are a bunch of general patterns ...
Automatic monitoring 
• E.g. cron jobs - check successful recently! 
• cron::d 
50 
Generic handling! 
Annotations!
Generate monitoring_check 
51 
And under the hood this runs create_resources to generate monitoring_checks 
create_resourc...
User specified monitoring 
52 
This is a cunning one. 
The check returns OK (assuming it can hit graphite), but also emits...
User specified monitoring 
53 
• Data lives in the service config 
• Next to the code to emit metrics! 
This is awesome, a...
• Simple checks for free! 
54 
User specified monitoring 
This example is in ruby :)
User specified monitoring 
• Data lives in the service config 
• Next to the code to emit metrics 
• Next to metadata abou...
Cluster checks 
• We’re working on this currently 
• Assert some % of machines are healthy. 
• Use to reduce alert noise. ...
WIP 
• This is all still a work in progress. 
• We’ve not 100% migrated off of Nagios 
• Open sourcing the pieces 
57
Thanks! 
• Slides will be online shortly: 
• slideshare.net/bobtfish 
• @bobtfish 
• Some (most?) of our code is open sour...
Upcoming SlideShare
Loading in …5
×

“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

4,157 views

Published on

“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - Tomas Doran, Yelp

Published in: Technology
  • Be the first to comment

“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

  1. 1. Sensu and Sensibility Tomas Doran @bobtfish 2014-­‐09-­‐23
  2. 2. 2 Sensu and Sensibility I’m part of the SRE team at Yelp. One of my jobs is “don’t break the site, ever” Another job is to enable developer productivity and fast innovation. These two things can be in conflict.
  3. 3. Cycle of failure and disappointment • Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production 3 This talk is about one particular instance of this conflict - monitoring. We used nagios. It sucked. This is half to do with nagios, half to do with the way we used it.
  4. 4. 4 This leads to developers being separated from production. Pager details out of date. Not all hosts running a service monitored as services move. Permissions issues so developers can’t ack alerts. No sane acks system.
  5. 5. Cycle of failure and disappointment • Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production • Escalation of issues is hard • Ops ignore alerts from services • Postmortems 5 Ops have a lot of pain too. Alerts are too noisy, when they’re for services we can’t triage them. Host issues end up with ops sending email to developers@ and praying. Ops get alert fatigue, stuff gets missed, everything is terrible
  6. 6. 6 If monitoring is ‘ops problem’, everything looks on fire all the time. It’s very hard to know what’s actually broken. Lack of situational awareness, expecting broken windows stops people taking responsibility.
  7. 7. Cycle of failure and disappointment • Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production • Escalation of issues is hard • Ops ignore alerts from services • Postmortems • High friction, low trust, low visibility. 7 Both sides are actually being reasonable. This isn’t even a Hanlon’s razor situation - everyone is really trying.
  8. 8. “Normality” 8 -­‐ http://gunshowcomic.com/648 It’s just the way we’ve built our monitoring system is killing us with a thousand cuts. And we’ve got Stockholm syndrome.
  9. 9. “Normality” dysfunctional 9 This is -­‐ http://gunshowcomic.com/648 I’m painting a bleak picture here - not actually saying that everything was _this_ bad in our organization. But these were the types of problems we identified.
  10. 10. 10 Sensibility Monitoring is about enabling communication.
  11. 11. 11 Sensibility One of our core competencies is getting monitoring right! So, we decided to change everything!!!!1111
  12. 12. “51 % viewed their ERP implementation as unsuccessful” 12 The Robbins-Gioia Survey (2001) Why the hell would we do that? It’s clearly a massive project
  13. 13. The Conference Board Survey (2001) “40 % of the projects failed to achieve their business case within one year of going live” 13 And pretty high risk. If we screw the monitoring up, well, lets just not do that?
  14. 14. McKinsey & Company in conjunction with the University of Oxford (2012) • “17 percent of large IT projects go so badly that they can threaten the very existence of the company” • “On average, large IT projects run 45 percent over budget and 7 percent over time, while delivering 56 percent less value than predicted” 14 This is actually really scary..
  15. 15. Failure is an option -­‐ blog.parasoft.com/single-­‐greatest-­‐barrier-­‐with-­‐sw-­‐delivery 15 You’re not gonna get it right first time Different teams want to work in different ways. Different environments are different How do you test your monitoring system?
  16. 16. Sensibility 16 Large team + many teams - decentralized (multiple time zones for some teams) Integration - we can’t pick a product off of a shelf (and get the level of value we need)
  17. 17. 17 Sensibility No big bang change, has to be incremental. We don’t know what our requirements are (beyond that the current system doesn’t meet them) Iteration is absolutely key to project success
  18. 18. Why Sensu? • Designed to be pluggable / extensible • Arbitrary check metadata • Simple model • Components do exactly one thing • Ruby • Not afraid to extend (or fork!) 18 So why did we choose Sensu - Nagios is workable, right? Want to work with the monitoring system to integrate it into our infra, not hack around it.
  19. 19. ‘industry standard’ ‘enterprise class’ 19 So we do have / did have nagios. It’s workable. In fact, it works fine, and scales pretty well (to a point). This is not a hate on nagios. It _could_ do all the things I talk about here….
  20. 20. Cheap shot 20 It’s ugly
  21. 21. 21 It tries to solve the full-stack monitoring problem. We’d already migrated most contacting to pager duty, rest to follow. Half the objects useless to us. Monolithic.
  22. 22. status.dat cmd.dat 22 The data formats are gross.
  23. 23. cmd.dat 23
  24. 24. 24 Centralized Ephemeral clients are a problem. Whitelisting (needing to explicitly add hosts/services) is a problem Exported resources are horrible (slow + bad for ephemeral envs)
  25. 25. 25 To be fair, this diagram does Sensu no favors at all :)
  26. 26. How we use Sensu • Don’t use all of this! • ‘Standalone’ checks only • Default in the puppet module 26 We don’t use it like this, much simpler model!
  27. 27. Sensu data flow • Sensu client runs checks on each machine • Pushes results to RabbitMQ • Clustered, clients/messages will fail over. • Sensu server (multiple, ha) • Processes check results, invokes handlers • Writes state to redis • Redis + sentinel • Read by API (2 instances) • All layers behind haproxy 27
  28. 28. Quis custodiet ipsos custodes? 28 “Sensu has so many moving parts that I wouldn’t be able to sleep at night unless I set up a Nagios instance to make sure they were all running.” Nagios does all of these things, itself. With no introspection - ‘how deep are my queues, why are things not getting scheduled’
  29. 29. Mutually assured monitoring • Multiple independent Sensu installs (per-datacenter) • Monitor each other! 29 We have a big environment, we run a Sensu per DC, they can monitor each other.
  30. 30. Machine readable config • /etc/sensu/conf.d/checks/check_name.json • Extensible with arbitrary metadata • Hash merge • Never edit by hand! 30 One of (IMO) the nice decisions is the use of JSON for config. JSON is a terrible format for hand-edited config, but we deploy all the config by puppet.
  31. 31. monitoring_check monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/ check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', } 31 This is our interface to Sensu in puppet. It’s a custom define which applies our business rules.
  32. 32. monitoring_check monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/ check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', } 32 Default to not paging people (for sanity), but turn that on easily. Automatically uses the default team (whoever owns the box). Can be overridden.
  33. 33. monitoring_check monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/ check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', } 33 We didn’t like Sensu’s alert scheduling logic. So we rewrote it :) (This is easy - just in the base class)
  34. 34. monitoring_check monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/ check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', } 34 Mandatory documentation!
  35. 35. sensu::check • monitoring_check wraps this • Writes a JSON file for each check • Comment safe 35 We do use the Sensu official puppet module. “Comment safe” - if you comment the puppet code out, the check goes away. Working on auto-resolving checks that are deleted now!
  36. 36. "disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/? p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false } 36 This is what an actual auto generated check JSON looks like BIG BLOB OF JSON! Don’t stress, we’ll work through it.
  37. 37. "disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/? p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false } 37 This looks the same for all of our Sensu checks. This is the using ‘simple mode’ and turning off half the features - servers can’t/don’t trigger checks on clients, it’s all client scheduled
  38. 38. "disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/? p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false } 38 These are custom (in our base handler) - as noted before in the define. Times are converted to seconds (in puppet) so that all time intervals in JSON are seconds.
  39. 39. "disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/? p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false } 39 Every check has to have a run book!
  40. 40. "disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/? p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false } 40 Generated by a custom function. Goes up the parser stack and finds where it was called from.
  41. 41. "disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/? p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false } 41 This stuff (more than half the check!) is the custom metadata Every alert has a team owning it. We can report in irc, JIRA, email (why? but some people do want this) or page!
  42. 42. Check scripts • Same as nagios checks • Simple (text) output • Exit code • Result sent to server, along with check definition • Including all the custom metadata • Our handlers use the extra data. 42 So, to recap - checks are scheduled and run on the client. It pushes the results to RabbitMQ, sends it’s results and definitions to the server. This is then all piped to the handlers setup.
  43. 43. Handlers • base • JIRA • email • irc • pagerduty • awsprune 43
  44. 44. How do checks get run? • Every machine runs the client. • Client managed by puppet • Client has a TCP socket you can send JSON to • Custom checks + pysensu-yelp 44 Check scripts are simple (as per nagios). Can write them in shell/ruby/python/whatever. More complex things can send data to the local socket. We have a python library for this (also use the ruby libraries from the sensu project)
  45. 45. 45 Sensu servers know which machine is the master right now (their own leadership election). Deploy some checks to sensu servers (e.g. cloudwatch checks!), run on the master. Fake hostname!
  46. 46. Situational awareness 46 Send alerts about dev box resource usage to the developers using that box. Why page OPS as a developer used 90% of the disk?
  47. 47. Single source of truth • DNS is canonical for sensu servers • Configure things in one place! 47 One place can be DNS, or hiera, or whatever - but not multiple places. DNS AND hiera sucks
  48. 48. Single source of truth • DNS is canonical for sensu servers • Configure things in one place! 48 puppet-netstdlib structured facts
  49. 49. Automatic monitoring • E.g. cron jobs - check successful recently! • cron::d 49 There are a bunch of general patterns where you can automate monitoring. Who hates ‘cron spam’? We use a custom define which defaults to /dev/null Check jobs completed successfully (with Sensu) - make JIRA tickets!
  50. 50. Automatic monitoring • E.g. cron jobs - check successful recently! • cron::d 50 Generic handling! Annotations!
  51. 51. Generate monitoring_check 51 And under the hood this runs create_resources to generate monitoring_checks create_resources is your friend!
  52. 52. User specified monitoring 52 This is a cunning one. The check returns OK (assuming it can hit graphite), but also emits a bunch of additional check results to the local socket
  53. 53. User specified monitoring 53 • Data lives in the service config • Next to the code to emit metrics! This is awesome, as it reads our service configs. Developers can add their own alerts.
  54. 54. • Simple checks for free! 54 User specified monitoring This example is in ruby :)
  55. 55. User specified monitoring • Data lives in the service config • Next to the code to emit metrics • Next to metadata about SLAs and LB timeouts • Developers can push without OPS 55 Allowing developers to add their own monitoring is awesome. Putting the config for the monitoring in their application codebase is awesome.
  56. 56. Cluster checks • We’re working on this currently • Assert some % of machines are healthy. • Use to reduce alert noise. • If a service becomes fully unavailable to clients, you want to page someone. • If one machine goes belly up, you don’t (make a JIRA ticket for handling later!) 56
  57. 57. WIP • This is all still a work in progress. • We’ve not 100% migrated off of Nagios • Open sourcing the pieces 57
  58. 58. Thanks! • Slides will be online shortly: • slideshare.net/bobtfish • @bobtfish • Some (most?) of our code is open source: • https://github.com/Yelp/sensu/commit/ aa5c43c2fdfde5e8739952c0b8082000934f3ad2 • https://github.com/Yelp/puppet-monitoring_check • https://github.com/Yelp/puppet-netstdlib • https://github.com/Yelp/sensu_handlers • https://github.com/Yelp/pysensu-yelp 58

×