OSMC 2011 | Case Study - Icinga at Hyves.nl by Jeffrey Lensen

Icinga at Hyves.nl
Jeffrey Lensen
System Engineer

2
Hyves
• Dutch social network website
• 3 billion pageviews / month
• 10M dutch members (17M population)
• ~7M unique visitors / month (Comscore 09/2011)
• ~2.3M unique visitors / day
• 800.000 photo uploads / day
• 7M chat messages / day
• 6Gbps daily outgoing trafﬁc

3
Hyves environment
• 3000 hosts running Gentoo
• 3 Datacenters
• 190 types of server functions
• 160 Employees
• System Engineering team: 12
• Developers: 45

4
Back in the day
• 1 Datacenter
• 150 servers
• 4 System Engineers
• 1 Nagios instance
• Manual conﬁguration

5
Keep up with serverpark growth
• Popularity required expansion
• Receiving 100 - 200 servers at a time
• Manual conﬁguration became unmanageable

6
Solutions to growth
• Templates for host and hostgroup configurations
• Servicechecks defined per hostgroup
• Automated configuration with scripts (hosts, hostgroups,
servicedependencies)
• Server management database as source
• Servicedependencies generated based on check_name prefix

7
Keep up with more serverpark growth
• From 1 to 3 datacenters
• Serverpark grew to 1500 hosts
• 1 Nagios host isn’t enough anymore

8
Solutions to more growth
• Distributed Nagios setup consisting of:
• 1 Central Nagios server for alerting and webinterface
• 9 Distributed Nagios servers
• Required little changes to conﬁguration scripting
• Distribution based on location and function

9
Watching the watchers
• Monitoring Nagios hosts with Nagios on NOC
• NOC monitored by one of the Nagios hosts
• Monitoring all datacenters from HQ

10
Distributed Nagios scaling problems
• Long reloads due to large conﬁguration (mainly Central server)
• Freezes during large (network) fall-outs -> No alerting!
• Webinterface could no longer load

11
Icinga
• Switched in November 2010
• No more central monitoring server needed
• Standalone web interface
• Database backend
• API
• Rapid development
• Painless migration:
• sed -i ‘s/nagios/icinga/g’ /etc/nagios/*cfg
• mv /etc/nagios/* /etc/icinga/

• 12 Icinga hosts
• 1 NOC Icinga host
• 100.000 service checks
• 3.500 hosts
12
Icinga setup

• 2 Icinga-web + database hosts
• Loadbalanced database and API
• Easy failover
13
Icinga setup

14
Make use of the API: Overview checks
• Overview checks for hostgroups and services
• Minimizes alerts during large failures
• Python script using API
• Example:
python check_monitoring_overview.py --hostgroup webserver
--service HTTP,HipHop -w 5% -c 10%
All 472 'HTTP', 'HipHop' services for 'mainweb' are OK

15
Missing monitoring
• Is everything that should be monitored, being monitored?
• Won’t realize until it’s too late
• Angry people..

16
Solution: Puppet
Puppet is an open-source next-generation server automation
tool. It is composed of a declarative language for expressing
system configuration, a client and server for distributing it, and a
library for realizing the configuration.
• Modules for each application (Nginx, Postfix, SNMP etc.)
• Roles based on function as set in server management database
• Everything is defined in Puppet

17
Example: Nginx module
class nginx {
tag("nginx")
package { "nginx":
ensure => "latest",
category => "www-servers"
}
service { "nginx":
enable => true,
ensure => running
}
}

18
Example: Role module
class role::webserver inherits role {
include nginx
}

19
Using Puppet to generate conﬁgs
• Supports “Nagios” Exported Resources
• Exported Resources stored in MySQL backend
• Deﬁne nagios_services in the matching modules

20
Include monitoring in NGINX module
modules/nginx/manifests/init.pp:
class nginx {
tag("nginx")
<snip>
@@nagios_service { "HTTP $hostname":
service_description => "HTTP",
check_command => "check_web_http",
event_handler => "service_restart!nginx!CRITICAL",
contact_groups => "sysadmins"
}
}

21
Predefine defaults in defines.pp
$__notifications_enabled = $systemstatus ? {
operational => "1",
fail => "0"
}
Nagios_service {
ensure => present,
host_name => $hostname.$domain,
use => "generic-service",
notifications_enabled => $__notifications_enabled,
target => "/etc/icinga/puppetgenerated/services/$hostname.cfg",
notes => $monitoringhost
}

22
Nagios_host {
ensure => present,
host_name => $hostname.$domain,
hostgroups => $role,
use => "generic-host",
alias => $hostname,
notifications_enabled => $__notifications_enabled,
target => "/etc/icinga/puppetgenerated/hosts/$hostname.cfg",
notes => $monitoringhost
}
Predefine defaults in defines.pp

Deﬁne host in monitoring module
23
modules/monitoring/manifests/init.pp:
class monitoring {
@@nagios_host { "$hostname":
address => $ip
}
}
modules/role/manifests/init.pp:
class role {
include monitoring
}

24
Retrieving resources
class icinga {
tag("icinga")
Nagios_host <<| notes == "$hostname" |>> {
require => File["/etc/icinga/puppetgenerated/hosts"]
}
Nagios_service <<| notes == "$hostname" |>> {
require => File["/etc/icinga/puppetgenerated/services"]
}
}

25
Checking generated conﬁguration
class icinga {
<snip>
exec { "verify new cfg":
command => "/usr/bin/icinga -v /etc/icinga/verify-puppetgenerated.cfg",
require => Class["get_icinga_puppet_resources"]
}
exec { "mv cfgs":
command => "rm -rf /etc/icinga/puppet/*; mv /etc/icinga/puppetgenerated/* /etc/icinga/
puppet/",
require => Exec["verify new cfg"]
}
exec { "restart icinga":
command => ""/usr/bin/printf '[] RESTART_PROGRAMn' > /var/icinga/rw/icinga.cmd"",
require => [
Exec["mv cfgs"],
Service["icinga"]
]
}
}

26
Problems exporting resources
• Puppet runs on Icinga hosts took between 10 and 30 minutes!
• Makes it hard to quickly change monitoring
• Most time spent retrieving and processing (Nagios) resources

27
get_icinga_puppet_resources.py
• Determined queries used by Puppet
• Get all resource IDs
• For each ID get parameter name and value
• Write to deﬁned ﬁle (“target”)
• Finishes in 15 seconds!

28
Retrieving resources ourselves
class icinga {
<snip>
exec { "get_icinga_puppet_resources":
command => "/usr/bin/python
/usr/local/bin/get_icinga_puppet_resources.py",
require => [
File["/etc/icinga/puppetgenerated/hosts"],
File["/etc/icinga/puppetgenerated/services"]
]
}
}

29
Other cool stuff to do with Puppet
• Generate daemon checks for servers based on conﬁg ﬁle
• Generate overview daemon checks using Icinga API

30
Retrieve daemons from config
modules/role/lib/facter/hyvesfacters.rb:
Facter.add("hyves_daemons") do
daemons = ["None"]
if File::exists?( "/<path_to_config>/daemons.conf" )
daemons = []
daemonarray = []
daemonconf = %x{grep name /<path_to_config>/
daemons.conf}
for daemon in daemonconf
daemon.sub!(/.** name:/, '')
daemonarray.push(daemon.chomp)
end
end
setcode do
daemonarray.uniq
end
end

31
Create services for daemons
modules/daemons/manifests/init.pp:
class daemons {
deﬁne add_daemon_check {
@@nagios_service { "$name Daemon $hostname":
use => "Daemon-check",
service_description => "$name Daemon",
check_command => "check_daemon!$name"
}
}
add_daemon_check { $hyves_daemons: }
}

32
Retrieving unique daemons from API
require 'net/http'
module Puppet::Parser::Functions
newfunction(:get_daemons, :type => :rvalue, :docs => "
This function returns an array of all current daemons, based on the Icinga API
") do |args|
domain = "<icinga-web_url>"
url = "/icinga-web/web/api/service/ﬁlter[AND(SERVICE_NAME%7Clike%7C*Daemon)]/
columns[SERVICE_NAME]/order[SERVICE_NAME;ASC]/authkey=<api_key>/json"
response = Net::HTTP.get_response(domain, url)
data = response.body
results = PSON.parse(data)
daemons = Array.new
results.each { |result|
daemon = result['SERVICE_NAME']
daemon.sub!(/ Daemon/, '')
daemons << daemon
}
daemons.uniq
end
end

33
Create overview services for daemons
modules/icinga/manifests/noc.pp:
$__daemons = get_daemons()
templateﬁle { "/etc/icinga/puppetgenerated/other/daemons.cfg":
template => template("icinga/hyvesdaemons.cfg.erb")
}
hyvesdaemons.cfg.erb:
<% __daemons.each do |daemon| -%>
deﬁne service{
use DaemonOverview-check
host_name daemons
service_description <%= daemon %>
}
<% end -%>

34
Deployment
• Deploy script to start Puppet runs on all monitoring hosts
• Reports status of Puppet runs once they’re ﬁnished
• Starts Puppet run on NOC monitoring host

What if a machine doesn’t run Puppet?
35
• Check to check conﬁguration
• Retrieve all operational hosts from servermanagent DB
• Retrieve all hosts from Icinga API
• Alert if something is missing or notiﬁcations are off

What about failover?
36
• Requires puppet run on all server
• Speed up puppet “runs” with --noop
• Redeploy Icinga

37
ICL (Icinga CommandLine)
• Python based script
• Libraries for access to Icinga API and MK_Livestatus
• Library for things like translating exit codes, and statuses
• See host/service status information
• Control monitoring and alerting
• Quickly see open problems

38
Integration with other tools
• Integration with server administration script to change status
• Fail -> disable notifications
• Operational -> check if everything is OK + enable notifications
• Deprecated -> disable notification + remove from Puppet DB
• Integration with failover scripts
• Deploy monitoring when adding new servers
• Scripts can check status of hosts and services before continuing

40
Plans for the (near) future
• Upgrade Icinga to 1.6
• Clean up ICL and make compatible with Icinga 1.6
• Put ICL on GitHub
• Expose API to developers
• Trend analysis / integration with Ganglia/Graphite

41
Thank you, questions?
Puppet: http://puppetlabs.com/
Github: https://github.com/hyves-org/
Email: jeffrey@hyves.nl
Hyves: http://skyler.hyves.nl/
Twitter: @0skyler0

OSMC 2011 | Case Study - Icinga at Hyves.nl by Jeffrey Lensen

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to OSMC 2011 | Case Study - Icinga at Hyves.nl by Jeffrey Lensen

Similar to OSMC 2011 | Case Study - Icinga at Hyves.nl by Jeffrey Lensen (20)

Recently uploaded

Recently uploaded (20)

OSMC 2011 | Case Study - Icinga at Hyves.nl by Jeffrey Lensen