Sensu and Sensibility 
Tomas 
Doran 
@bobtfish 
2014-­‐09-­‐23
2 
Sensu and Sensibility
Cycle of failure and 
disappointment 
• Manually edited and deployed monitoring 
• Changes require two teams 
• Low developer visibility about production 
3
4
Cycle of failure and 
disappointment 
• Manually edited and deployed monitoring 
• Changes require two teams 
• Low developer visibility about production 
• Escalation of issues is hard 
• Ops ignore alerts from services 
• Postmortems 
5
6
Cycle of failure and 
disappointment 
• Manually edited and deployed monitoring 
• Changes require two teams 
• Low developer visibility about production 
• Escalation of issues is hard 
• Ops ignore alerts from services 
• Postmortems 
• High friction, low trust, low visibility. 
7
“Normality” 
8 
-­‐ 
http://gunshowcomic.com/648
“Normality” 
dysfunctional 
9 
This is 
-­‐ 
http://gunshowcomic.com/648
10 
Sensibility
11 
Sensibility
“51 % viewed their ERP implementation as 
unsuccessful” 
12 
The Robbins-Gioia Survey (2001)
The Conference Board Survey (2001) 
“40 % of the projects failed to achieve their 
business case within one year of going live” 
13
McKinsey & Company in conjunction 
with the University of Oxford (2012) 
• “17 percent of large IT projects go so 
badly that they can threaten the very 
existence of the company” 
• “On average, large IT projects run 45 
percent over budget and 7 percent over 
time, while delivering 56 percent less 
value than predicted” 
14
Failure is an option 
-­‐ 
blog.parasoft.com/single-­‐greatest-­‐barrier-­‐with-­‐sw-­‐delivery 
15
Sensibility 
16
17 
Sensibility
Why Sensu? 
• Designed to be pluggable / extensible 
• Arbitrary check metadata 
• Simple model 
• Components do exactly one thing 
• Ruby 
• Not afraid to extend (or fork!) 
18
‘industry standard’ 
‘enterprise class’ 
19
Cheap shot 
20
21
status.dat 
cmd.dat 
22
cmd.dat 
23
24 
Centralized
25
How we use Sensu 
• Don’t use all of this! 
• ‘Standalone’ checks only 
• Default in the puppet module 
26
Sensu data flow 
• Sensu client runs checks on each machine 
• Pushes results to RabbitMQ 
• Clustered, clients/messages will fail over. 
• Sensu server (multiple, ha) 
• Processes check results, invokes handlers 
• Writes state to redis 
• Redis + sentinel 
• Read by API (2 instances) 
• All layers behind haproxy 
27
Quis custodiet ipsos custodes? 
28 
“Sensu 
has 
so 
many 
moving 
parts 
that 
I 
wouldn’t 
be 
able 
to 
sleep 
at 
night 
unless 
I 
set 
up 
a 
Nagios 
instance 
to 
make 
sure 
they 
were 
all 
running.”
Mutually assured monitoring 
• Multiple independent Sensu installs (per-datacenter) 
• Monitor each other! 
29
Machine readable config 
• /etc/sensu/conf.d/checks/check_name.json 
• Extensible with arbitrary metadata 
• Hash merge 
• Never edit by hand! 
30
monitoring_check 
monitoring_check { 'systems-apache-external': 
page => true, 
command => "/usr/lib/nagios/plugins/ 
check_tcp -H ${external_ip_address} -p 443", 
check_every => ‘5m', 
alert_after => '30m', 
realert_every => 10, 
runbook => 'y/apache', 
} 
31
monitoring_check 
monitoring_check { 'systems-apache-external': 
page => true, 
command => "/usr/lib/nagios/plugins/ 
check_tcp -H ${external_ip_address} -p 443", 
check_every => ‘5m', 
alert_after => '30m', 
realert_every => 10, 
runbook => 'y/apache', 
} 
32
monitoring_check 
monitoring_check { 'systems-apache-external': 
page => true, 
command => "/usr/lib/nagios/plugins/ 
check_tcp -H ${external_ip_address} -p 443", 
check_every => ‘5m', 
alert_after => '30m', 
realert_every => 10, 
runbook => 'y/apache', 
} 
33
monitoring_check 
monitoring_check { 'systems-apache-external': 
page => true, 
command => "/usr/lib/nagios/plugins/ 
check_tcp -H ${external_ip_address} -p 443", 
check_every => ‘5m', 
alert_after => '30m', 
realert_every => 10, 
runbook => 'y/apache', 
} 
34
sensu::check 
• monitoring_check wraps this 
• Writes a JSON file for each check 
• Comment safe 
35
"disk_ro_mounts": { 
"standalone": true, "handlers": [“default"], "subscribers": [], 
"command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", 
"interval": 60, 
"alert_after": 0, "realert_every": “-1", 
"dependencies": [], 
"runbook": "http://lmgtfy.com/?q=linux+read+only+disk", 
"annotation": "https://gitweb.yelpcorp.com/? 
p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", 
"team": "operations", 
"irc_channels": "operations-notifications", 
"notification_email": "undef", 
"ticket": true, 
"project": “OPS”, 
"page": false, 
"tip": false 
} 
36
"disk_ro_mounts": { 
"standalone": true, "handlers": [“default"], "subscribers": [], 
"command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", 
"interval": 60, 
"alert_after": 0, "realert_every": “-1", 
"dependencies": [], 
"runbook": "http://lmgtfy.com/?q=linux+read+only+disk", 
"annotation": "https://gitweb.yelpcorp.com/? 
p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", 
"team": "operations", 
"irc_channels": "operations-notifications", 
"notification_email": "undef", 
"ticket": true, 
"project": “OPS”, 
"page": false, 
"tip": false 
} 
37
"disk_ro_mounts": { 
"standalone": true, "handlers": [“default"], "subscribers": [], 
"command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", 
"interval": 60, 
"alert_after": 0, "realert_every": “-1", 
"dependencies": [], 
"runbook": "http://lmgtfy.com/?q=linux+read+only+disk", 
"annotation": "https://gitweb.yelpcorp.com/? 
p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", 
"team": "operations", 
"irc_channels": "operations-notifications", 
"notification_email": "undef", 
"ticket": true, 
"project": “OPS”, 
"page": false, 
"tip": false 
} 
38
"disk_ro_mounts": { 
"standalone": true, "handlers": [“default"], "subscribers": [], 
"command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", 
"interval": 60, 
"alert_after": 0, "realert_every": “-1", 
"dependencies": [], 
"runbook": "http://lmgtfy.com/?q=linux+read+only+disk", 
"annotation": "https://gitweb.yelpcorp.com/? 
p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", 
"team": "operations", 
"irc_channels": "operations-notifications", 
"notification_email": "undef", 
"ticket": true, 
"project": “OPS”, 
"page": false, 
"tip": false 
} 
39
"disk_ro_mounts": { 
"standalone": true, "handlers": [“default"], "subscribers": [], 
"command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", 
"interval": 60, 
"alert_after": 0, "realert_every": “-1", 
"dependencies": [], 
"runbook": "http://lmgtfy.com/?q=linux+read+only+disk", 
"annotation": "https://gitweb.yelpcorp.com/? 
p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", 
"team": "operations", 
"irc_channels": "operations-notifications", 
"notification_email": "undef", 
"ticket": true, 
"project": “OPS”, 
"page": false, 
"tip": false 
} 
40
"disk_ro_mounts": { 
"standalone": true, "handlers": [“default"], "subscribers": [], 
"command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", 
"interval": 60, 
"alert_after": 0, "realert_every": “-1", 
"dependencies": [], 
"runbook": "http://lmgtfy.com/?q=linux+read+only+disk", 
"annotation": "https://gitweb.yelpcorp.com/? 
p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", 
"team": "operations", 
"irc_channels": "operations-notifications", 
"notification_email": "undef", 
"ticket": true, 
"project": “OPS”, 
"page": false, 
"tip": false 
} 
41
Check scripts 
• Same as nagios checks 
• Simple (text) output 
• Exit code 
• Result sent to server, along with check definition 
• Including all the custom metadata 
• Our handlers use the extra data. 
42
Handlers 
• base 
• JIRA 
• email 
• irc 
• pagerduty 
• awsprune 
43
How do checks get run? 
• Every machine runs the client. 
• Client managed by puppet 
• Client has a TCP socket you can send JSON to 
• Custom checks + pysensu-yelp 
44
45
Situational awareness 
46
Single source of truth 
• DNS is canonical for sensu servers 
• Configure things in one place! 
47
Single source of truth 
• DNS is canonical for sensu servers 
• Configure things in one place! 
48
Automatic monitoring 
• E.g. cron jobs - check successful recently! 
• cron::d 
49
Automatic monitoring 
• E.g. cron jobs - check successful recently! 
• cron::d 
50
Generate monitoring_check 
51
User specified monitoring 
52
User specified monitoring 
53 
• Data lives in the service config 
• Next to the code to emit metrics!
• Simple checks for free! 
54 
User specified monitoring
User specified monitoring 
• Data lives in the service config 
• Next to the code to emit metrics 
• Next to metadata about SLAs and LB timeouts 
• Developers can push without OPS 
55
Cluster checks 
• We’re working on this currently 
• Assert some % of machines are healthy. 
• Use to reduce alert noise. 
• If a service becomes fully unavailable to clients, 
you want to page someone. 
• If one machine goes belly up, you don’t (make 
a JIRA ticket for handling later!) 
56
WIP 
• This is all still a work in progress. 
• We’ve not 100% migrated off of Nagios 
• Open sourcing the pieces 
57
Thanks! 
• Slides will be online shortly: 
• slideshare.net/bobtfish 
• @bobtfish 
• Some (most?) of our code is open source: 
• https://github.com/Yelp/sensu/commit/ 
aa5c43c2fdfde5e8739952c0b8082000934f3ad2 
• https://github.com/Yelp/puppet-monitoring_check 
• https://github.com/Yelp/puppet-netstdlib 
• https://github.com/Yelp/sensu_handlers 
• https://github.com/Yelp/pysensu-yelp 
58

Sensu and Sensibility - Puppetconf 2014

  • 1.
    Sensu and Sensibility Tomas Doran @bobtfish 2014-­‐09-­‐23
  • 2.
    2 Sensu andSensibility
  • 3.
    Cycle of failureand disappointment • Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production 3
  • 4.
  • 5.
    Cycle of failureand disappointment • Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production • Escalation of issues is hard • Ops ignore alerts from services • Postmortems 5
  • 6.
  • 7.
    Cycle of failureand disappointment • Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production • Escalation of issues is hard • Ops ignore alerts from services • Postmortems • High friction, low trust, low visibility. 7
  • 8.
    “Normality” 8 -­‐ http://gunshowcomic.com/648
  • 9.
    “Normality” dysfunctional 9 This is -­‐ http://gunshowcomic.com/648
  • 10.
  • 11.
  • 12.
    “51 % viewedtheir ERP implementation as unsuccessful” 12 The Robbins-Gioia Survey (2001)
  • 13.
    The Conference BoardSurvey (2001) “40 % of the projects failed to achieve their business case within one year of going live” 13
  • 14.
    McKinsey & Companyin conjunction with the University of Oxford (2012) • “17 percent of large IT projects go so badly that they can threaten the very existence of the company” • “On average, large IT projects run 45 percent over budget and 7 percent over time, while delivering 56 percent less value than predicted” 14
  • 15.
    Failure is anoption -­‐ blog.parasoft.com/single-­‐greatest-­‐barrier-­‐with-­‐sw-­‐delivery 15
  • 16.
  • 17.
  • 18.
    Why Sensu? •Designed to be pluggable / extensible • Arbitrary check metadata • Simple model • Components do exactly one thing • Ruby • Not afraid to extend (or fork!) 18
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
    How we useSensu • Don’t use all of this! • ‘Standalone’ checks only • Default in the puppet module 26
  • 27.
    Sensu data flow • Sensu client runs checks on each machine • Pushes results to RabbitMQ • Clustered, clients/messages will fail over. • Sensu server (multiple, ha) • Processes check results, invokes handlers • Writes state to redis • Redis + sentinel • Read by API (2 instances) • All layers behind haproxy 27
  • 28.
    Quis custodiet ipsoscustodes? 28 “Sensu has so many moving parts that I wouldn’t be able to sleep at night unless I set up a Nagios instance to make sure they were all running.”
  • 29.
    Mutually assured monitoring • Multiple independent Sensu installs (per-datacenter) • Monitor each other! 29
  • 30.
    Machine readable config • /etc/sensu/conf.d/checks/check_name.json • Extensible with arbitrary metadata • Hash merge • Never edit by hand! 30
  • 31.
    monitoring_check monitoring_check {'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/ check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', } 31
  • 32.
    monitoring_check monitoring_check {'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/ check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', } 32
  • 33.
    monitoring_check monitoring_check {'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/ check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', } 33
  • 34.
    monitoring_check monitoring_check {'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/ check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', } 34
  • 35.
    sensu::check • monitoring_checkwraps this • Writes a JSON file for each check • Comment safe 35
  • 36.
    "disk_ro_mounts": { "standalone":true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/? p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false } 36
  • 37.
    "disk_ro_mounts": { "standalone":true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/? p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false } 37
  • 38.
    "disk_ro_mounts": { "standalone":true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/? p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false } 38
  • 39.
    "disk_ro_mounts": { "standalone":true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/? p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false } 39
  • 40.
    "disk_ro_mounts": { "standalone":true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/? p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false } 40
  • 41.
    "disk_ro_mounts": { "standalone":true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/? p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false } 41
  • 42.
    Check scripts •Same as nagios checks • Simple (text) output • Exit code • Result sent to server, along with check definition • Including all the custom metadata • Our handlers use the extra data. 42
  • 43.
    Handlers • base • JIRA • email • irc • pagerduty • awsprune 43
  • 44.
    How do checksget run? • Every machine runs the client. • Client managed by puppet • Client has a TCP socket you can send JSON to • Custom checks + pysensu-yelp 44
  • 45.
  • 46.
  • 47.
    Single source oftruth • DNS is canonical for sensu servers • Configure things in one place! 47
  • 48.
    Single source oftruth • DNS is canonical for sensu servers • Configure things in one place! 48
  • 49.
    Automatic monitoring •E.g. cron jobs - check successful recently! • cron::d 49
  • 50.
    Automatic monitoring •E.g. cron jobs - check successful recently! • cron::d 50
  • 51.
  • 52.
  • 53.
    User specified monitoring 53 • Data lives in the service config • Next to the code to emit metrics!
  • 54.
    • Simple checksfor free! 54 User specified monitoring
  • 55.
    User specified monitoring • Data lives in the service config • Next to the code to emit metrics • Next to metadata about SLAs and LB timeouts • Developers can push without OPS 55
  • 56.
    Cluster checks •We’re working on this currently • Assert some % of machines are healthy. • Use to reduce alert noise. • If a service becomes fully unavailable to clients, you want to page someone. • If one machine goes belly up, you don’t (make a JIRA ticket for handling later!) 56
  • 57.
    WIP • Thisis all still a work in progress. • We’ve not 100% migrated off of Nagios • Open sourcing the pieces 57
  • 58.
    Thanks! • Slideswill be online shortly: • slideshare.net/bobtfish • @bobtfish • Some (most?) of our code is open source: • https://github.com/Yelp/sensu/commit/ aa5c43c2fdfde5e8739952c0b8082000934f3ad2 • https://github.com/Yelp/puppet-monitoring_check • https://github.com/Yelp/puppet-netstdlib • https://github.com/Yelp/sensu_handlers • https://github.com/Yelp/pysensu-yelp 58