SlideShare a Scribd company logo
1 of 41
Download to read offline
Icinga at Hyves.nl
Jeffrey Lensen
System Engineer
2
Hyves
• Dutch social network website
• 3 billion pageviews / month
• 10M dutch members (17M population)
• ~7M unique visitors / month (Comscore 09/2011)
• ~2.3M unique visitors / day
• 800.000 photo uploads / day
• 7M chat messages / day
• 6Gbps daily outgoing traffic
3
Hyves environment
• 3000 hosts running Gentoo
• 3 Datacenters
• 190 types of server functions
• 160 Employees
• System Engineering team: 12
• Developers: 45
4
Back in the day
• 1 Datacenter
• 150 servers
• 4 System Engineers
• 1 Nagios instance
• Manual configuration
5
Keep up with serverpark growth
• Popularity required expansion
• Receiving 100 - 200 servers at a time
• Manual configuration became unmanageable
6
Solutions to growth
• Templates for host and hostgroup configurations
• Servicechecks defined per hostgroup
• Automated configuration with scripts (hosts, hostgroups,
servicedependencies)
• Server management database as source
• Servicedependencies generated based on check_name prefix
7
Keep up with more serverpark growth
• From 1 to 3 datacenters
• Serverpark grew to 1500 hosts
• 1 Nagios host isn’t enough anymore
8
Solutions to more growth
• Distributed Nagios setup consisting of:
• 1 Central Nagios server for alerting and webinterface
• 9 Distributed Nagios servers
• Required little changes to configuration scripting
• Distribution based on location and function
9
Watching the watchers
• Monitoring Nagios hosts with Nagios on NOC
• NOC monitored by one of the Nagios hosts
• Monitoring all datacenters from HQ
10
Distributed Nagios scaling problems
• Long reloads due to large configuration (mainly Central server)
• Freezes during large (network) fall-outs -> No alerting!
• Webinterface could no longer load
11
Icinga
• Switched in November 2010
• No more central monitoring server needed
• Standalone web interface
• Database backend
• API
• Rapid development
• Painless migration:
• sed -i ‘s/nagios/icinga/g’ /etc/nagios/*cfg
• mv /etc/nagios/* /etc/icinga/
• 12 Icinga hosts
• 1 NOC Icinga host
• 100.000 service checks
• 3.500 hosts
12
Icinga setup
• 2 Icinga-web + database hosts
• Loadbalanced database and API
• Easy failover
13
Icinga setup
14
Make use of the API: Overview checks
• Overview checks for hostgroups and services
• Minimizes alerts during large failures
• Python script using API
• Example:
python check_monitoring_overview.py --hostgroup webserver
--service HTTP,HipHop -w 5% -c 10%
All 472 'HTTP', 'HipHop' services for 'mainweb' are OK
15
Missing monitoring
• Is everything that should be monitored, being monitored?
• Won’t realize until it’s too late
• Angry people..
16
Solution: Puppet
Puppet is an open-source next-generation server automation
tool. It is composed of a declarative language for expressing
system configuration, a client and server for distributing it, and a
library for realizing the configuration.
• Modules for each application (Nginx, Postfix, SNMP etc.)
• Roles based on function as set in server management database
• Everything is defined in Puppet
17
Example: Nginx module
class nginx {
tag("nginx")
package { "nginx":
ensure => "latest",
category => "www-servers"
}
service { "nginx":
enable => true,
ensure => running
}
}
18
Example: Role module
class role::webserver inherits role {
include nginx
}
19
Using Puppet to generate configs
• Supports “Nagios” Exported Resources
• Exported Resources stored in MySQL backend
• Define nagios_services in the matching modules
20
Include monitoring in NGINX module
modules/nginx/manifests/init.pp:
class nginx {
tag("nginx")
<snip>
@@nagios_service { "HTTP $hostname":
service_description => "HTTP",
check_command => "check_web_http",
event_handler => "service_restart!nginx!CRITICAL",
contact_groups => "sysadmins"
}
}
21
Predefine defaults in defines.pp
$__notifications_enabled = $systemstatus ? {
operational => "1",
fail => "0"
}
Nagios_service {
ensure => present,
host_name => $hostname.$domain,
use => "generic-service",
notifications_enabled => $__notifications_enabled,
target => "/etc/icinga/puppetgenerated/services/$hostname.cfg",
notes => $monitoringhost
}
22
Nagios_host {
ensure => present,
host_name => $hostname.$domain,
hostgroups => $role,
use => "generic-host",
alias => $hostname,
notifications_enabled => $__notifications_enabled,
target => "/etc/icinga/puppetgenerated/hosts/$hostname.cfg",
notes => $monitoringhost
}
Predefine defaults in defines.pp
Define host in monitoring module
23
modules/monitoring/manifests/init.pp:
class monitoring {
@@nagios_host { "$hostname":
address => $ip
}
}
modules/role/manifests/init.pp:
class role {
include monitoring
}
24
Retrieving resources
class icinga {
tag("icinga")
Nagios_host <<| notes == "$hostname" |>> {
require => File["/etc/icinga/puppetgenerated/hosts"]
}
Nagios_service <<| notes == "$hostname" |>> {
require => File["/etc/icinga/puppetgenerated/services"]
}
}
25
Checking generated configuration
class icinga {
<snip>
exec { "verify new cfg":
command => "/usr/bin/icinga -v /etc/icinga/verify-puppetgenerated.cfg",
require => Class["get_icinga_puppet_resources"]
}
exec { "mv cfgs":
command => "rm -rf /etc/icinga/puppet/*; mv /etc/icinga/puppetgenerated/* /etc/icinga/
puppet/",
require => Exec["verify new cfg"]
}
exec { "restart icinga":
command => ""/usr/bin/printf '[] RESTART_PROGRAMn' > /var/icinga/rw/icinga.cmd"",
require => [
Exec["mv cfgs"],
Service["icinga"]
]
}
}
26
Problems exporting resources
• Puppet runs on Icinga hosts took between 10 and 30 minutes!
• Makes it hard to quickly change monitoring
• Most time spent retrieving and processing (Nagios) resources
27
get_icinga_puppet_resources.py
• Determined queries used by Puppet
• Get all resource IDs
• For each ID get parameter name and value
• Write to defined file (“target”)
• Finishes in 15 seconds!
28
Retrieving resources ourselves
class icinga {
<snip>
exec { "get_icinga_puppet_resources":
command => "/usr/bin/python
/usr/local/bin/get_icinga_puppet_resources.py",
require => [
File["/etc/icinga/puppetgenerated/hosts"],
File["/etc/icinga/puppetgenerated/services"]
]
}
}
29
Other cool stuff to do with Puppet
• Generate daemon checks for servers based on config file
• Generate overview daemon checks using Icinga API
30
Retrieve daemons from config
modules/role/lib/facter/hyvesfacters.rb:
Facter.add("hyves_daemons") do
daemons = ["None"]
if File::exists?( "/<path_to_config>/daemons.conf" )
daemons = []
daemonarray = []
daemonconf = %x{grep name /<path_to_config>/
daemons.conf}
for daemon in daemonconf
daemon.sub!(/.** name:/, '')
daemonarray.push(daemon.chomp)
end
end
setcode do
daemonarray.uniq
end
end
31
Create services for daemons
modules/daemons/manifests/init.pp:
class daemons {
define add_daemon_check {
@@nagios_service { "$name Daemon $hostname":
use => "Daemon-check",
service_description => "$name Daemon",
check_command => "check_daemon!$name"
}
}
add_daemon_check { $hyves_daemons: }
}
32
Retrieving unique daemons from API
require 'net/http'
module Puppet::Parser::Functions
newfunction(:get_daemons, :type => :rvalue, :docs => "
This function returns an array of all current daemons, based on the Icinga API
") do |args|
domain = "<icinga-web_url>"
url = "/icinga-web/web/api/service/filter[AND(SERVICE_NAME%7Clike%7C*Daemon)]/
columns[SERVICE_NAME]/order[SERVICE_NAME;ASC]/authkey=<api_key>/json"
response = Net::HTTP.get_response(domain, url)
data = response.body
results = PSON.parse(data)
daemons = Array.new
results.each { |result|
daemon = result['SERVICE_NAME']
daemon.sub!(/ Daemon/, '')
daemons << daemon
}
daemons.uniq
end
end
33
Create overview services for daemons
modules/icinga/manifests/noc.pp:
$__daemons = get_daemons()
templatefile { "/etc/icinga/puppetgenerated/other/daemons.cfg":
template => template("icinga/hyvesdaemons.cfg.erb")
}
hyvesdaemons.cfg.erb:
<% __daemons.each do |daemon| -%>
define service{
use DaemonOverview-check
host_name daemons
service_description <%= daemon %>
}
<% end -%>
34
Deployment
• Deploy script to start Puppet runs on all monitoring hosts
• Reports status of Puppet runs once they’re finished
• Starts Puppet run on NOC monitoring host
What if a machine doesn’t run Puppet?
35
• Check to check configuration
• Retrieve all operational hosts from servermanagent DB
• Retrieve all hosts from Icinga API
• Alert if something is missing or notifications are off
What about failover?
36
• Requires puppet run on all server
• Speed up puppet “runs” with --noop
• Redeploy Icinga
37
ICL (Icinga CommandLine)
• Python based script
• Libraries for access to Icinga API and MK_Livestatus
• Library for things like translating exit codes, and statuses
• See host/service status information
• Control monitoring and alerting
• Quickly see open problems
38
Integration with other tools
• Integration with server administration script to change status
• Fail -> disable notifications
• Operational -> check if everything is OK + enable notifications
• Deprecated -> disable notification + remove from Puppet DB
• Integration with failover scripts
• Deploy monitoring when adding new servers
• Scripts can check status of hosts and services before continuing
Demo time
39
40
Plans for the (near) future
• Upgrade Icinga to 1.6
• Clean up ICL and make compatible with Icinga 1.6
• Put ICL on GitHub
• Expose API to developers
• Trend analysis / integration with Ganglia/Graphite
41
Thank you, questions?
Puppet: http://puppetlabs.com/
Github: https://github.com/hyves-org/
Email: jeffrey@hyves.nl
Hyves: http://skyler.hyves.nl/
Twitter: @0skyler0

More Related Content

What's hot

Automation with Ansible and Containers
Automation with Ansible and ContainersAutomation with Ansible and Containers
Automation with Ansible and ContainersRodolfo Carvalho
 
Automating complex infrastructures with Puppet
Automating complex infrastructures with PuppetAutomating complex infrastructures with Puppet
Automating complex infrastructures with PuppetKris Buytaert
 
Puppet and the HashiStack
Puppet and the HashiStackPuppet and the HashiStack
Puppet and the HashiStackBram Vogelaar
 
Automating Complex Setups with Puppet
Automating Complex Setups with PuppetAutomating Complex Setups with Puppet
Automating Complex Setups with PuppetKris Buytaert
 
Ansible Meetup Hamburg / Quickstart
Ansible Meetup Hamburg / QuickstartAnsible Meetup Hamburg / Quickstart
Ansible Meetup Hamburg / QuickstartHenry Stamerjohann
 
Network Automation: Ansible 102
Network Automation: Ansible 102Network Automation: Ansible 102
Network Automation: Ansible 102APNIC
 
HTTP Caching and PHP
HTTP Caching and PHPHTTP Caching and PHP
HTTP Caching and PHPDavid de Boer
 
[JCConf 2020] 用 Kotlin 跨入 Serverless 世代
[JCConf 2020] 用 Kotlin 跨入 Serverless 世代[JCConf 2020] 用 Kotlin 跨入 Serverless 世代
[JCConf 2020] 用 Kotlin 跨入 Serverless 世代Shengyou Fan
 
Ansible leveraging 2.0
Ansible leveraging 2.0Ansible leveraging 2.0
Ansible leveraging 2.0bcoca
 
Hacking ansible
Hacking ansibleHacking ansible
Hacking ansiblebcoca
 
Ansible fest Presentation slides
Ansible fest Presentation slidesAnsible fest Presentation slides
Ansible fest Presentation slidesAaron Carey
 
Infrastructure as Code in Google Cloud
Infrastructure as Code in Google CloudInfrastructure as Code in Google Cloud
Infrastructure as Code in Google CloudRadek Simko
 
PuppetCamp SEA 1 - Version Control with Puppet
PuppetCamp SEA 1 - Version Control with PuppetPuppetCamp SEA 1 - Version Control with Puppet
PuppetCamp SEA 1 - Version Control with PuppetWalter Heck
 
Puppet and the HashiCorp Suite
Puppet and the HashiCorp SuitePuppet and the HashiCorp Suite
Puppet and the HashiCorp SuiteBram Vogelaar
 
PuppetCamp SEA 1 - Use of Puppet
PuppetCamp SEA 1 - Use of PuppetPuppetCamp SEA 1 - Use of Puppet
PuppetCamp SEA 1 - Use of PuppetWalter Heck
 
PuppetCamp SEA 1 - Puppet Deployment at OnApp
PuppetCamp SEA 1 - Puppet Deployment  at OnAppPuppetCamp SEA 1 - Puppet Deployment  at OnApp
PuppetCamp SEA 1 - Puppet Deployment at OnAppWalter Heck
 
More tips n tricks
More tips n tricksMore tips n tricks
More tips n tricksbcoca
 
Hopping in clouds: a tale of migration from one cloud provider to another
Hopping in clouds: a tale of migration from one cloud provider to anotherHopping in clouds: a tale of migration from one cloud provider to another
Hopping in clouds: a tale of migration from one cloud provider to anotherMichele Orselli
 
Ground Control to Nomad Job Dispatch
Ground Control to Nomad Job DispatchGround Control to Nomad Job Dispatch
Ground Control to Nomad Job DispatchMichael Lange
 

What's hot (20)

Automation with Ansible and Containers
Automation with Ansible and ContainersAutomation with Ansible and Containers
Automation with Ansible and Containers
 
Automating complex infrastructures with Puppet
Automating complex infrastructures with PuppetAutomating complex infrastructures with Puppet
Automating complex infrastructures with Puppet
 
Puppet and the HashiStack
Puppet and the HashiStackPuppet and the HashiStack
Puppet and the HashiStack
 
Bosh 2.0
Bosh 2.0Bosh 2.0
Bosh 2.0
 
Automating Complex Setups with Puppet
Automating Complex Setups with PuppetAutomating Complex Setups with Puppet
Automating Complex Setups with Puppet
 
Ansible Meetup Hamburg / Quickstart
Ansible Meetup Hamburg / QuickstartAnsible Meetup Hamburg / Quickstart
Ansible Meetup Hamburg / Quickstart
 
Network Automation: Ansible 102
Network Automation: Ansible 102Network Automation: Ansible 102
Network Automation: Ansible 102
 
HTTP Caching and PHP
HTTP Caching and PHPHTTP Caching and PHP
HTTP Caching and PHP
 
[JCConf 2020] 用 Kotlin 跨入 Serverless 世代
[JCConf 2020] 用 Kotlin 跨入 Serverless 世代[JCConf 2020] 用 Kotlin 跨入 Serverless 世代
[JCConf 2020] 用 Kotlin 跨入 Serverless 世代
 
Ansible leveraging 2.0
Ansible leveraging 2.0Ansible leveraging 2.0
Ansible leveraging 2.0
 
Hacking ansible
Hacking ansibleHacking ansible
Hacking ansible
 
Ansible fest Presentation slides
Ansible fest Presentation slidesAnsible fest Presentation slides
Ansible fest Presentation slides
 
Infrastructure as Code in Google Cloud
Infrastructure as Code in Google CloudInfrastructure as Code in Google Cloud
Infrastructure as Code in Google Cloud
 
PuppetCamp SEA 1 - Version Control with Puppet
PuppetCamp SEA 1 - Version Control with PuppetPuppetCamp SEA 1 - Version Control with Puppet
PuppetCamp SEA 1 - Version Control with Puppet
 
Puppet and the HashiCorp Suite
Puppet and the HashiCorp SuitePuppet and the HashiCorp Suite
Puppet and the HashiCorp Suite
 
PuppetCamp SEA 1 - Use of Puppet
PuppetCamp SEA 1 - Use of PuppetPuppetCamp SEA 1 - Use of Puppet
PuppetCamp SEA 1 - Use of Puppet
 
PuppetCamp SEA 1 - Puppet Deployment at OnApp
PuppetCamp SEA 1 - Puppet Deployment  at OnAppPuppetCamp SEA 1 - Puppet Deployment  at OnApp
PuppetCamp SEA 1 - Puppet Deployment at OnApp
 
More tips n tricks
More tips n tricksMore tips n tricks
More tips n tricks
 
Hopping in clouds: a tale of migration from one cloud provider to another
Hopping in clouds: a tale of migration from one cloud provider to anotherHopping in clouds: a tale of migration from one cloud provider to another
Hopping in clouds: a tale of migration from one cloud provider to another
 
Ground Control to Nomad Job Dispatch
Ground Control to Nomad Job DispatchGround Control to Nomad Job Dispatch
Ground Control to Nomad Job Dispatch
 

Similar to OSMC 2011 | Case Study - Icinga at Hyves.nl by Jeffrey Lensen

AAI-3218 Production Deployment Best Practices for WebSphere Liberty Profile
AAI-3218 Production Deployment Best Practices for WebSphere Liberty ProfileAAI-3218 Production Deployment Best Practices for WebSphere Liberty Profile
AAI-3218 Production Deployment Best Practices for WebSphere Liberty ProfileWASdev Community
 
Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...
Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...
Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...Nagios
 
Why favour Icinga over Nagios @ FrOSCon 2015
Why favour Icinga over Nagios @ FrOSCon 2015Why favour Icinga over Nagios @ FrOSCon 2015
Why favour Icinga over Nagios @ FrOSCon 2015Icinga
 
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann Danny Abukalam
 
Session: A Reference Architecture for Running Modern APIs with NGINX Unit and...
Session: A Reference Architecture for Running Modern APIs with NGINX Unit and...Session: A Reference Architecture for Running Modern APIs with NGINX Unit and...
Session: A Reference Architecture for Running Modern APIs with NGINX Unit and...NGINX, Inc.
 
Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015Pavel Chunyayev
 
High Availability Content Caching with NGINX
High Availability Content Caching with NGINXHigh Availability Content Caching with NGINX
High Availability Content Caching with NGINXKevin Jones
 
OSMC 2014: MonitoringLove with Sensu | Jochen Lillich
OSMC 2014: MonitoringLove with Sensu | Jochen LillichOSMC 2014: MonitoringLove with Sensu | Jochen Lillich
OSMC 2014: MonitoringLove with Sensu | Jochen LillichNETWAYS
 
Orchestration Tool Roundup - Arthur Berezin & Trammell Scruggs
Orchestration Tool Roundup - Arthur Berezin & Trammell ScruggsOrchestration Tool Roundup - Arthur Berezin & Trammell Scruggs
Orchestration Tool Roundup - Arthur Berezin & Trammell ScruggsCloud Native Day Tel Aviv
 
High Availability Content Caching with NGINX
High Availability Content Caching with NGINXHigh Availability Content Caching with NGINX
High Availability Content Caching with NGINXNGINX, Inc.
 
Facebook的缓存系统
Facebook的缓存系统Facebook的缓存系统
Facebook的缓存系统yiditushe
 
OSMC 2014 | Monitoring Love with Sensu by Jochen Lillich
OSMC 2014 | Monitoring Love with Sensu by Jochen LillichOSMC 2014 | Monitoring Love with Sensu by Jochen Lillich
OSMC 2014 | Monitoring Love with Sensu by Jochen LillichNETWAYS
 
Distributed monitoring at Hyves- Puppet
Distributed monitoring at Hyves- PuppetDistributed monitoring at Hyves- Puppet
Distributed monitoring at Hyves- PuppetPuppet
 
Open Source Logging and Metrics Tools
Open Source Logging and Metrics ToolsOpen Source Logging and Metrics Tools
Open Source Logging and Metrics ToolsPhase2
 
Open Source Logging and Monitoring Tools
Open Source Logging and Monitoring ToolsOpen Source Logging and Monitoring Tools
Open Source Logging and Monitoring ToolsPhase2
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performanceEngine Yard
 
OSMC 2009 | Icinga by Icinga Team
OSMC 2009 | Icinga by Icinga TeamOSMC 2009 | Icinga by Icinga Team
OSMC 2009 | Icinga by Icinga TeamNETWAYS
 

Similar to OSMC 2011 | Case Study - Icinga at Hyves.nl by Jeffrey Lensen (20)

AAI-3218 Production Deployment Best Practices for WebSphere Liberty Profile
AAI-3218 Production Deployment Best Practices for WebSphere Liberty ProfileAAI-3218 Production Deployment Best Practices for WebSphere Liberty Profile
AAI-3218 Production Deployment Best Practices for WebSphere Liberty Profile
 
Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...
Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...
Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...
 
Why favour Icinga over Nagios @ FrOSCon 2015
Why favour Icinga over Nagios @ FrOSCon 2015Why favour Icinga over Nagios @ FrOSCon 2015
Why favour Icinga over Nagios @ FrOSCon 2015
 
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
 
Session: A Reference Architecture for Running Modern APIs with NGINX Unit and...
Session: A Reference Architecture for Running Modern APIs with NGINX Unit and...Session: A Reference Architecture for Running Modern APIs with NGINX Unit and...
Session: A Reference Architecture for Running Modern APIs with NGINX Unit and...
 
Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015
 
High Availability Content Caching with NGINX
High Availability Content Caching with NGINXHigh Availability Content Caching with NGINX
High Availability Content Caching with NGINX
 
Dancing with websocket
Dancing with websocketDancing with websocket
Dancing with websocket
 
OSMC 2014: MonitoringLove with Sensu | Jochen Lillich
OSMC 2014: MonitoringLove with Sensu | Jochen LillichOSMC 2014: MonitoringLove with Sensu | Jochen Lillich
OSMC 2014: MonitoringLove with Sensu | Jochen Lillich
 
Orchestration Tool Roundup - Arthur Berezin & Trammell Scruggs
Orchestration Tool Roundup - Arthur Berezin & Trammell ScruggsOrchestration Tool Roundup - Arthur Berezin & Trammell Scruggs
Orchestration Tool Roundup - Arthur Berezin & Trammell Scruggs
 
High Availability Content Caching with NGINX
High Availability Content Caching with NGINXHigh Availability Content Caching with NGINX
High Availability Content Caching with NGINX
 
Iac d.damyanov 4.pptx
Iac d.damyanov 4.pptxIac d.damyanov 4.pptx
Iac d.damyanov 4.pptx
 
Beyond Puppet
Beyond PuppetBeyond Puppet
Beyond Puppet
 
Facebook的缓存系统
Facebook的缓存系统Facebook的缓存系统
Facebook的缓存系统
 
OSMC 2014 | Monitoring Love with Sensu by Jochen Lillich
OSMC 2014 | Monitoring Love with Sensu by Jochen LillichOSMC 2014 | Monitoring Love with Sensu by Jochen Lillich
OSMC 2014 | Monitoring Love with Sensu by Jochen Lillich
 
Distributed monitoring at Hyves- Puppet
Distributed monitoring at Hyves- PuppetDistributed monitoring at Hyves- Puppet
Distributed monitoring at Hyves- Puppet
 
Open Source Logging and Metrics Tools
Open Source Logging and Metrics ToolsOpen Source Logging and Metrics Tools
Open Source Logging and Metrics Tools
 
Open Source Logging and Monitoring Tools
Open Source Logging and Monitoring ToolsOpen Source Logging and Monitoring Tools
Open Source Logging and Monitoring Tools
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performance
 
OSMC 2009 | Icinga by Icinga Team
OSMC 2009 | Icinga by Icinga TeamOSMC 2009 | Icinga by Icinga Team
OSMC 2009 | Icinga by Icinga Team
 

Recently uploaded

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 

Recently uploaded (20)

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 

OSMC 2011 | Case Study - Icinga at Hyves.nl by Jeffrey Lensen

  • 1. Icinga at Hyves.nl Jeffrey Lensen System Engineer
  • 2. 2 Hyves • Dutch social network website • 3 billion pageviews / month • 10M dutch members (17M population) • ~7M unique visitors / month (Comscore 09/2011) • ~2.3M unique visitors / day • 800.000 photo uploads / day • 7M chat messages / day • 6Gbps daily outgoing traffic
  • 3. 3 Hyves environment • 3000 hosts running Gentoo • 3 Datacenters • 190 types of server functions • 160 Employees • System Engineering team: 12 • Developers: 45
  • 4. 4 Back in the day • 1 Datacenter • 150 servers • 4 System Engineers • 1 Nagios instance • Manual configuration
  • 5. 5 Keep up with serverpark growth • Popularity required expansion • Receiving 100 - 200 servers at a time • Manual configuration became unmanageable
  • 6. 6 Solutions to growth • Templates for host and hostgroup configurations • Servicechecks defined per hostgroup • Automated configuration with scripts (hosts, hostgroups, servicedependencies) • Server management database as source • Servicedependencies generated based on check_name prefix
  • 7. 7 Keep up with more serverpark growth • From 1 to 3 datacenters • Serverpark grew to 1500 hosts • 1 Nagios host isn’t enough anymore
  • 8. 8 Solutions to more growth • Distributed Nagios setup consisting of: • 1 Central Nagios server for alerting and webinterface • 9 Distributed Nagios servers • Required little changes to configuration scripting • Distribution based on location and function
  • 9. 9 Watching the watchers • Monitoring Nagios hosts with Nagios on NOC • NOC monitored by one of the Nagios hosts • Monitoring all datacenters from HQ
  • 10. 10 Distributed Nagios scaling problems • Long reloads due to large configuration (mainly Central server) • Freezes during large (network) fall-outs -> No alerting! • Webinterface could no longer load
  • 11. 11 Icinga • Switched in November 2010 • No more central monitoring server needed • Standalone web interface • Database backend • API • Rapid development • Painless migration: • sed -i ‘s/nagios/icinga/g’ /etc/nagios/*cfg • mv /etc/nagios/* /etc/icinga/
  • 12. • 12 Icinga hosts • 1 NOC Icinga host • 100.000 service checks • 3.500 hosts 12 Icinga setup
  • 13. • 2 Icinga-web + database hosts • Loadbalanced database and API • Easy failover 13 Icinga setup
  • 14. 14 Make use of the API: Overview checks • Overview checks for hostgroups and services • Minimizes alerts during large failures • Python script using API • Example: python check_monitoring_overview.py --hostgroup webserver --service HTTP,HipHop -w 5% -c 10% All 472 'HTTP', 'HipHop' services for 'mainweb' are OK
  • 15. 15 Missing monitoring • Is everything that should be monitored, being monitored? • Won’t realize until it’s too late • Angry people..
  • 16. 16 Solution: Puppet Puppet is an open-source next-generation server automation tool. It is composed of a declarative language for expressing system configuration, a client and server for distributing it, and a library for realizing the configuration. • Modules for each application (Nginx, Postfix, SNMP etc.) • Roles based on function as set in server management database • Everything is defined in Puppet
  • 17. 17 Example: Nginx module class nginx { tag("nginx") package { "nginx": ensure => "latest", category => "www-servers" } service { "nginx": enable => true, ensure => running } }
  • 18. 18 Example: Role module class role::webserver inherits role { include nginx }
  • 19. 19 Using Puppet to generate configs • Supports “Nagios” Exported Resources • Exported Resources stored in MySQL backend • Define nagios_services in the matching modules
  • 20. 20 Include monitoring in NGINX module modules/nginx/manifests/init.pp: class nginx { tag("nginx") <snip> @@nagios_service { "HTTP $hostname": service_description => "HTTP", check_command => "check_web_http", event_handler => "service_restart!nginx!CRITICAL", contact_groups => "sysadmins" } }
  • 21. 21 Predefine defaults in defines.pp $__notifications_enabled = $systemstatus ? { operational => "1", fail => "0" } Nagios_service { ensure => present, host_name => $hostname.$domain, use => "generic-service", notifications_enabled => $__notifications_enabled, target => "/etc/icinga/puppetgenerated/services/$hostname.cfg", notes => $monitoringhost }
  • 22. 22 Nagios_host { ensure => present, host_name => $hostname.$domain, hostgroups => $role, use => "generic-host", alias => $hostname, notifications_enabled => $__notifications_enabled, target => "/etc/icinga/puppetgenerated/hosts/$hostname.cfg", notes => $monitoringhost } Predefine defaults in defines.pp
  • 23. Define host in monitoring module 23 modules/monitoring/manifests/init.pp: class monitoring { @@nagios_host { "$hostname": address => $ip } } modules/role/manifests/init.pp: class role { include monitoring }
  • 24. 24 Retrieving resources class icinga { tag("icinga") Nagios_host <<| notes == "$hostname" |>> { require => File["/etc/icinga/puppetgenerated/hosts"] } Nagios_service <<| notes == "$hostname" |>> { require => File["/etc/icinga/puppetgenerated/services"] } }
  • 25. 25 Checking generated configuration class icinga { <snip> exec { "verify new cfg": command => "/usr/bin/icinga -v /etc/icinga/verify-puppetgenerated.cfg", require => Class["get_icinga_puppet_resources"] } exec { "mv cfgs": command => "rm -rf /etc/icinga/puppet/*; mv /etc/icinga/puppetgenerated/* /etc/icinga/ puppet/", require => Exec["verify new cfg"] } exec { "restart icinga": command => ""/usr/bin/printf '[] RESTART_PROGRAMn' > /var/icinga/rw/icinga.cmd"", require => [ Exec["mv cfgs"], Service["icinga"] ] } }
  • 26. 26 Problems exporting resources • Puppet runs on Icinga hosts took between 10 and 30 minutes! • Makes it hard to quickly change monitoring • Most time spent retrieving and processing (Nagios) resources
  • 27. 27 get_icinga_puppet_resources.py • Determined queries used by Puppet • Get all resource IDs • For each ID get parameter name and value • Write to defined file (“target”) • Finishes in 15 seconds!
  • 28. 28 Retrieving resources ourselves class icinga { <snip> exec { "get_icinga_puppet_resources": command => "/usr/bin/python /usr/local/bin/get_icinga_puppet_resources.py", require => [ File["/etc/icinga/puppetgenerated/hosts"], File["/etc/icinga/puppetgenerated/services"] ] } }
  • 29. 29 Other cool stuff to do with Puppet • Generate daemon checks for servers based on config file • Generate overview daemon checks using Icinga API
  • 30. 30 Retrieve daemons from config modules/role/lib/facter/hyvesfacters.rb: Facter.add("hyves_daemons") do daemons = ["None"] if File::exists?( "/<path_to_config>/daemons.conf" ) daemons = [] daemonarray = [] daemonconf = %x{grep name /<path_to_config>/ daemons.conf} for daemon in daemonconf daemon.sub!(/.** name:/, '') daemonarray.push(daemon.chomp) end end setcode do daemonarray.uniq end end
  • 31. 31 Create services for daemons modules/daemons/manifests/init.pp: class daemons { define add_daemon_check { @@nagios_service { "$name Daemon $hostname": use => "Daemon-check", service_description => "$name Daemon", check_command => "check_daemon!$name" } } add_daemon_check { $hyves_daemons: } }
  • 32. 32 Retrieving unique daemons from API require 'net/http' module Puppet::Parser::Functions newfunction(:get_daemons, :type => :rvalue, :docs => " This function returns an array of all current daemons, based on the Icinga API ") do |args| domain = "<icinga-web_url>" url = "/icinga-web/web/api/service/filter[AND(SERVICE_NAME%7Clike%7C*Daemon)]/ columns[SERVICE_NAME]/order[SERVICE_NAME;ASC]/authkey=<api_key>/json" response = Net::HTTP.get_response(domain, url) data = response.body results = PSON.parse(data) daemons = Array.new results.each { |result| daemon = result['SERVICE_NAME'] daemon.sub!(/ Daemon/, '') daemons << daemon } daemons.uniq end end
  • 33. 33 Create overview services for daemons modules/icinga/manifests/noc.pp: $__daemons = get_daemons() templatefile { "/etc/icinga/puppetgenerated/other/daemons.cfg": template => template("icinga/hyvesdaemons.cfg.erb") } hyvesdaemons.cfg.erb: <% __daemons.each do |daemon| -%> define service{ use DaemonOverview-check host_name daemons service_description <%= daemon %> } <% end -%>
  • 34. 34 Deployment • Deploy script to start Puppet runs on all monitoring hosts • Reports status of Puppet runs once they’re finished • Starts Puppet run on NOC monitoring host
  • 35. What if a machine doesn’t run Puppet? 35 • Check to check configuration • Retrieve all operational hosts from servermanagent DB • Retrieve all hosts from Icinga API • Alert if something is missing or notifications are off
  • 36. What about failover? 36 • Requires puppet run on all server • Speed up puppet “runs” with --noop • Redeploy Icinga
  • 37. 37 ICL (Icinga CommandLine) • Python based script • Libraries for access to Icinga API and MK_Livestatus • Library for things like translating exit codes, and statuses • See host/service status information • Control monitoring and alerting • Quickly see open problems
  • 38. 38 Integration with other tools • Integration with server administration script to change status • Fail -> disable notifications • Operational -> check if everything is OK + enable notifications • Deprecated -> disable notification + remove from Puppet DB • Integration with failover scripts • Deploy monitoring when adding new servers • Scripts can check status of hosts and services before continuing
  • 40. 40 Plans for the (near) future • Upgrade Icinga to 1.6 • Clean up ICL and make compatible with Icinga 1.6 • Put ICL on GitHub • Expose API to developers • Trend analysis / integration with Ganglia/Graphite
  • 41. 41 Thank you, questions? Puppet: http://puppetlabs.com/ Github: https://github.com/hyves-org/ Email: jeffrey@hyves.nl Hyves: http://skyler.hyves.nl/ Twitter: @0skyler0