Sensu @ Yelp!: A Guided Tour
Upcoming SlideShare
Loading in...5

Sensu @ Yelp!: A Guided Tour



This is a presentation demonstrating how Sensu is used at Yelp to support dynamic infrastructure, and promote self-service monitoring among teams. ...

This is a presentation demonstrating how Sensu is used at Yelp to support dynamic infrastructure, and promote self-service monitoring among teams.

Video Part 1:
Video Part 2:



Total Views
Views on SlideShare
Embed Views



2 Embeds 24 16 8



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Sensu @ Yelp!: A Guided Tour Sensu @ Yelp!: A Guided Tour Presentation Transcript

  • Sensu @ Yelp - A Guided Tour Kyle Anderson
  • Disclaimer I’m just a dude. I know that when I watch a presentation by a company that I recognize, I think to myself, “Hmm, $company, I’ve heard of them. They probably have their stuff together. Lets see what they do…” I’m here to describe, not persuade. I may not have everything together. Just because I have things with “Unit Tests”, doesn’t mean I’ m “Right”. Especially with a “framework” like Sensu, there can be more than one way to do things. The trick is figuring out what works for you. I hope by giving a real concrete example, you might be inspired to step up your monitoring game?
  • Outline 1. Overall Architecture 2. Sensu Server Setup a. Custom Base Handler 3. Client Configuration a. Sensu Check Puppet Wrapper 4. Yelp SOA Checks 5. AWS/Cloudwatch Checks 6. Dealing with Ephemeral EC Servers 7. Cron Job Monitoring 8. Future Work
  • Overall Architecture ● profile::sensu_client ○ Sensu clients connect to RabbitMQ on one of the servers (DNS Round Robin) ● profile::sensu_server ○ Base HAProxy install ○ RabbitMQ in Mirror Mode, load balanced via HAProxy ○ Redis in Master/slave mode, load balanced via HAProxy. (only master passes healthcheck) ○ Sensu Server installed, subscribes on RabbitMQ ○ API Load balanced via HAProxy ○ Dashboard Load balanced by HAProxy
  • Logical Diagram
  • Puppet Modules in Use puppetlabs/rabbitmq puppetlabs/haproxy kyleanderson/redis_sentinel arioch/redis sensu/sensu
  • Addressing Complexity “Sensu has so many moving parts that I wouldn’t be able to sleep at night unless I set up a Nagios instance to make sure they were all running.” Laurie Denness bit-longer-thank-you-very-much/
  • Addressing Complexity “I will be honest; I haven’t used Sensu, because I’m in a happy place right now, but just the architectural diagram of how it works scares the shit out of me. When you need 7 arrow colours to describe where data is going in a monitoring system, I’m starting to fear it slightly. But hey, if it works, good on you guys. It just looks a lot like this. Nothing wrong with that, if you can make it stable and reliable.” Laurie Denness bit-longer-thank-you-very-much/
  • First Principle: Single Point of Truth
  • Pop Quiz: Determine what Servers are Puppetmasters? • A: Puppet manifests (include puppetmaster) • B: DNS (puppet.local A 10.5.x.x) • C: update-live script (for Server in ….) • D: The servers that have had the puppetmaster bootstrap script run on them • E: What MCollective says (mco find -C puppetmaster) Answer: All / None of the above!
  • Sensu Server Detection # Use DNS to detect if this server is a sensu server $local_sensu_server_array = gethostbyname2array("sensu.local-${::habitat}") $ip_address_array = split($::all_ipaddresses, ',') validate_array($local_sensu_server_array) validate_array($ip_address_array) $array_intersection = intersection($ip_address_array, $local_sensu_server_array) # If our ipaddresses are in the dns entries, we must be a sensu server! if size($array_intersection) > 0 { $is_sensu_server = true } else { $is_sensu_server = false }
  • HAProxy • Every server in the sensu cluster runs its own HAProxy • HAProxy listens on the “standard” ports, individual instances listen on standard + 1 • Having an array of sensu servers from DNS allows us to grow the backends • If HAProxy dies, clients will re-resolve, and reconnect.
  • RabbitMQ • Every server in the sensu cluster runs a rabbitmq server in mirror mode (with autoheal for AP) • Lots of individual clusters, not doing shoveling. • Client authentication via SSL client certs (controlled by puppet) • Load balanced by haproxy • Sensu-clients automatically reconnect on failure
  • Redis • Redis is the persistent store used by Sensu to keep track of heartbeats, what alerts are silenced, how many times a check has failed, etc • Redis is setup in a cluster mode, with redis-sentinel doing automatic master/slave promotion. (Kinda CP) • We use the redis-role haproxy master pattern suggestion from availability-sensu/
  • Sensu API + Dashboard • sensu-api provides a rest api with json output for integration. • sensu-cli is provided for easy command line interactive use • Both the API and Dashboard use basic auth internally (shared secret), and then LDAP+SSL auth externally. • sensu-dashboard uses this api, and is behind our external facing apache for authentication.
  • Sensu Servers: • Automatically does master election, good. Build for 3. • Connects to RabbitMQ, pulls events off and acts on them • Runs “handlers” on the event data • Thats kinda it • Which leads to handlers….
  • Sensu Timing Tunables Before/After Custom check definition key-values Custom key-values can be added to a check definition, which will be included in event data, enabling handler creativity. Common custom check definitions: • interval: How frequently (in seconds) the check will be executed • occurrences: Number of event occurrences before the handler should take action • refresh: Number of seconds handlers should wait before taking second action. Relies on sensu-plugin. Yelp Monitoring Check Definition Key Values The custom base handler interprets these values: • check_every = '5m', • alert_after = '0s', • realert_every = '1',
  • Custom Base Handler def filter_repeated interval = @event['check']['interval'] || 0 alert_after = @event['check']['alert_after'] || 0 realert_every = @event['check']['realert_every'] || 1 failing_for = @event['occurrences'].to_i * @event['check']['interval'].to_i if failing_for < alert_after bail "Only failing for #{failing_for}, less than #{alert_after}. Not performing any action yet." elsif interval > 0 and @event['action'] == 'create' initial_failing_occurrences = alert_after.fdiv(interval).to_i number_of_failed_attempts = @event['occurrences'] - initial_failing_occurrences unless number_of_failed_attempts == 0 || number_of_failed_attempts % realert_every == 0 bail 'only handling every ' + number.to_s + ' occurrences' end end end
  • Other Handlers In Use ● IRC (Triaged by who is “on-point”) ● Email (not a thing) ● Pagerduty (Handled by “on-call”) ● OpsGenie (trialing) ● aws_prune (only on ec2 nodes) ● motd (sensu-report, not really a handler. Used for situation awareness) Future Handlers ● JIRA (auto create/close a ticket after a while?) ● Flapjack?
  • Sensu Clients • Almost every server @yelp runs the sensu client (thank you omnibus packages!) • They connect to the Round-Robin dns entry local to their zone. • All checks are standalone, configured by puppet
  • Monitoring Check Puppet Wrapper define monitoring_check ( $command, $runbook, $check_every = '5m', $alert_after = '0s', $realert_every = '1', $irc_channels = undef, $tip = false, $page = false, $wake = true, $needs_sudo = false, $sudo_user = 'root', $team = 'operations', $ensure = 'present', $dependencies = [], $sensu_custom = {}, ) { …… Lots of validation. Lots of tests. mandatory runbook! Human readable time units! Easy to add sudo rules! TIP: The one line runbook for lazy humans! Team defaults to ops for convenience. Usually set to $::profile::server::team
  • Monitoring Check Puppet Wrapper Example # Make sure apt-mirroring is working by checking the age of the NEW file left over. monitoring_check { 'apt-mirror': check_every => '4h', team => 'operations', page => false, runbook => 'y/rb-package-mirroring', tip => 'Talk to kwa. Check /var/spool/apt-mirror/var/cron.log, then /nail/apt-mirror/var/apt-mirror.lock.', command => '/usr/lib/nagios/plugins/check_file_age /nail/apt-mirror/var/NEW -w 86400 -c 172800', }
  • Why Not Use The Native Puppet Type? ● The wrapper reduces the boilerplate and gives good defaults ● Enforces site-specific policies and validation (team names, mandatory runbooks) ● Allows us to modify all puppet-controlled sensu checks in the future from a single spot. ● Custom tests ● Allows us to be backend agnostic (maybe)
  • Yelp SOA Checks • How do we (Yelp) empower our developers to monitor their services? • How can we safely and conveniently allow devs to define checks within our SOA framework? • How can Devs not be blocked by Ops for service deployment?
  • Define the Meta Check # Defined on all hosts that run yelp SOA infrastructure monitoring_check { 'check-yelp_soa': check_every => '1m', alert_after => '10m', page => true, runbook => 'http://y/rb-check-yelpsoa', tip => 'Run /etc/sensu/plugins/check-yelp_soa.rb --debug to see what is wrong?', command => '/etc/sensu/plugins/check-yelp_soa.rb', require => Class['::yelp_soa'] }
  • check-yelp_soa.rb redux def run # TODO: Parallelize? configs.each do | service, config | next unless services_that_run_here.include?(service) $log.debug "Processing #{service} as apparently it runs here" srv_configs = read_srv_configs(service) next unless srv_configs.include?('monitoring_check') monitoring_check = srv_configs['monitoring_check'] if numeric?(config['port']) ... if command == 'check_http' url = monitoring_check['check_url'] || '/status' $log.debug "Making a http check for #{service}, team: #{team}, warn_timeout: #{warn_timeout}, crit_timeout: #{crit_timeout}" output, status = check_http(port,url,http_expect,warn_timeout,crit_timeout) elsif monitoring_check['command'] == 'check_tcp' $log.debug "Making a tcp check for #{service}, team: #{team}, warn_timeout: #{warn_timeout}, crit_timeout: #{crit_timeout}" output, status = check_tcp(port,warn_timeout,crit_timeout) else $log.debug "Not spawning a check for #{service} because I don't know how to run #{command}" next end send_result_to_sensu(service, status, output, team, runbook, tip, page, alert_after, realert_every, irc_channels) services_checked << service end # End port check end # End for loop ok "Finished run. Ran checks on #{services_checked}" end
  • What was that? Iterate through the SOA services that are configured to run on a server. Determine if that service has monitoring metadata defined by the authors Operate on that metadata to check it (usually check_http) Send the results of the check to the localhost:3030 socket as a *Different* check (“soa_$servicename”) See for another example
  • An example service (request_blocking) # from request_blocking.yaml monitoring_check: team: 'infra' alert_after: 2m realert_every: 2 irc_channels: 'infra' url: '/status' tip: "no tips yet" warn_timout: 2.0 crit_timeout: 5.0
  • AWS/Cloudwatch Checks • Pretty much the same thing, except: • Checks are executed on special monitoring hosts in the AZ (not on the ephemeral node) • Runs graphite/check_data.rb against the provided metric name • Written in python this time! (https://pypi.python. org/pypi/sensu)
  • Dealing with Ephemeral EC2 Nodes • Yelps lives in a hybrid world, we have lots of “ephemeral” EC2 nodes that are baked and do NOT run puppet. Can Sensu still work on them? • How do we prevent ourselves from being spammed when hosts go away “normally”? • How do we know what a host is without logging into it? (EC2 metadata) • Baking………..
  • EC2 Considerations • We use puppet to bake AMIs for ELBs, so we can control (via puppet) how Sensu is configured at bake time. • We can query the AWS API to know if a host has gone away, and prune it from the Queue to squelch alerts. • Using custom client metadata, we can add things like puppet cert name, AMI_ID, etc at runtime with a special init script.
  • For Non-Ephemeral Instances if str2bool($::is_ec2) == true { $client_custom = { 'instance_id' => $::ec2_instanceid, 'keepalive' => { 'handlers' => [ 'aws_prune', 'default' ], 'team' => $team, 'page' => true } } } else { $client_custom = { 'team' => $team, 'page' => true } } Only EC2 Servers need the special aws_prune handler A Fact! Embed it for easy troubleshooting
  • For Ephemeral (baked) Instances description "Fix Sensu clientinfo on startup for baked ec2 instances" author "Kyle Anderson <>" start on starting sensu-client task script ADDRESS=$(curl -s AMI_ID=$(curl -s INSTANCE_ID=$(curl -s /usr/bin/jq " = "$(/usr/local/sbin/puppet-certname)" | .client.address = "$ADDRESS" | .client.instance_id = "$INSTANCE_ID" | .client.ami_id = "$AMI_ID" " /etc/sensu/conf.d/client.json > /etc/sensu/conf.d/newclient.json mv /etc/sensu/conf.d/client.json /etc/sensu/conf.d/client.json.old mv /etc/sensu/conf.d/newclient.json /etc/sensu/conf.d/client.json end script Only run once, right before sensu-client Real data. Can’t lie. Overwrite what we were baked with. It is wrong. jq FTW
  • Pruning Terminated EC2 Nodes ● Modification of plugins/blob/master/handlers/other/ec2_node.rb ● Instead we use a cron job to cache the results of the api call into json so we can be nice to AWS ● Then we can have *every* check use this handler, as it is easy to just to check on disk if the instance_id is active. ● Use the instance_id from the client data to figure out who you are. (which should be correct from the above)
  • What Does It Look Like? file { '/etc/sensu/plugins/cache_instance_list.rb': owner => 'root', group => 'root', mode => '0500', source => 'puppet:///modules/profile/sensu/handlers/cache_instance_list.rb', } -> cron::d { 'cache_instance_list': minute => '*', user => 'root', command => "/etc/sensu/plugins/cache_instance_list.rb -a ${access_key} -r ${region} -k ${secret_key}", } -> monitoring_check { 'cache_instance_list-staleness': check_every => '10m', alert_after => '1h', team => 'test', runbook => 'y/rb-aws-prune', command => "/usr/lib/nagios/plugins/check_file_age /var/cache/instance_list.json -w 1800 -c 3600", page => false, }
  • The Handler (puppet) $access_key = hiera('sensu::aws_key') $secret_key = hiera('sensu::aws_secret') $aws_config_hash = { access_key => $access_key, secret_key => $secret_key, region => $region, blacklist_name_array => [ 'bake_soa_ami', 'Packer Builder' ] } sensu::handler { 'aws_prune': type => 'pipe', source => 'puppet:///modules/profile/sensu/handlers/aws_prune.rb', config => $aws_config_hash, require => [ Package['rubygem-fog'], Package['rubygem-sensu-plugin'], Package['rubygem-unf'] ], } }
  • The Handler (Ruby) def ec2_node_exists? running_instances = load_instances_cache instance_ids = running_instances.collect { |s| Hash[ 'id', s['id'], 'tags', s['tags'] ]} my_instance_id = @event['client']['instance_id'] instance_ids.each do |instance| # YELP SPECIFIC CODE instance_name = instance['tags']['Name'].to_s # Yelp specific: pretend that the node does not exist if we are in our blacklist return false if blacklist_name_array.include?(instance_name) return true if my_instance_id == instance['id'] end return false # no match found, node doesn't exist end
  • Cron Job Monitoring • I believe cron sending emails is an anti-pattern and not *web-scale* • Lets use Sensu to monitor our cron jobs! • Use a combination of a cron puppet type wrapper and my Sensu-Shell-Helper • Modified sensu-shell-helper includes fields for team and page for yelp-specific things: https://github. com/solarkennedy/sensu-shell-helper
  • What does it look like? $command = 'chgrp -R admin /nail/packages/' cron::d { 'fix-packages-permissions': mailto => '', minute => '10', user => 'root', comment => 'Make permissions group writable for collaboration purposes', command => “sensu-shell-helper -n fix-packages-permissions -p false -t operations ${command}”, ensure => 'present' } See for related work.
  • Future Work ● battle-test more of the pagerduty stuff (blocked on bogus aws nodes still) ● sort out AWS pruning, harder (#61626) ● make tools that work on nagios *and* sensu? ● really monitor the sensu instances in nagios with alerts (#60164) ● enable self-serve sensu alerts for services (#62201) ● make a library for sending passive checks (#62440) ● set up infrastructure for “aggregate” checks (cluster checks) ● better test the alerting tunables we have (#61628) ● enable sensu alerts for Asgardy services (#57450) ● set up easy to use metric based alerting (like horsefly, blocked on #67000) ● write my sensu-downtime tool ● write an super-dashboard (hackathon) ● write the sensu archive service (sensu-db?)
  • Thanks!