Hyves.nl wird mit fast 3000 Servern betrieben, die alle eine lückenlose Überwachung erfordern. Jeffrey wird darüber reden wie Hyves.nl Puppet anwendet um den kompletten Icinga Cluster automatisch zu konfigurieren und so eine totale Monitoring-Konfiguration zu gewährleisten. Ein weiteres Thema wird die Integration von Icinga in einigen der anderen Tools sein, die mit Hilfe von MK Livestatus und dem Icinga API genutzt werden. Er wird Tools vorstellen, die speziell für Kunden konzipiert wurden um Statusinformationen von Icinga zu überwachen und anzufordern.
3. 3
Hyves environment
• 3000 hosts running Gentoo
• 3 Datacenters
• 190 types of server functions
• 160 Employees
• System Engineering team: 12
• Developers: 45
4. 4
Back in the day
• 1 Datacenter
• 150 servers
• 4 System Engineers
• 1 Nagios instance
• Manual configuration
5. 5
Keep up with serverpark growth
• Popularity required expansion
• Receiving 100 - 200 servers at a time
• Manual configuration became unmanageable
6. 6
Solutions to growth
• Templates for host and hostgroup configurations
• Servicechecks defined per hostgroup
• Automated configuration with scripts (hosts, hostgroups,
servicedependencies)
• Server management database as source
• Servicedependencies generated based on check_name prefix
7. 7
Keep up with more serverpark growth
• From 1 to 3 datacenters
• Serverpark grew to 1500 hosts
• 1 Nagios host isn’t enough anymore
8. 8
Solutions to more growth
• Distributed Nagios setup consisting of:
• 1 Central Nagios server for alerting and webinterface
• 9 Distributed Nagios servers
• Required little changes to configuration scripting
• Distribution based on location and function
9. 9
Watching the watchers
• Monitoring Nagios hosts with Nagios on NOC
• NOC monitored by one of the Nagios hosts
• Monitoring all datacenters from HQ
10. 10
Distributed Nagios scaling problems
• Long reloads due to large configuration (mainly Central server)
• Freezes during large (network) fall-outs -> No alerting!
• Webinterface could no longer load
11. 11
Icinga
• Switched in November 2010
• No more central monitoring server needed
• Standalone web interface
• Database backend
• API
• Rapid development
• Painless migration:
• sed -i ‘s/nagios/icinga/g’ /etc/nagios/*cfg
• mv /etc/nagios/* /etc/icinga/
13. • 2 Icinga-web + database hosts
• Loadbalanced database and API
• Easy failover
13
Icinga setup
14. 14
Make use of the API: Overview checks
• Overview checks for hostgroups and services
• Minimizes alerts during large failures
• Python script using API
• Example:
python check_monitoring_overview.py --hostgroup webserver
--service HTTP,HipHop -w 5% -c 10%
All 472 'HTTP', 'HipHop' services for 'mainweb' are OK
15. 15
Missing monitoring
• Is everything that should be monitored, being monitored?
• Won’t realize until it’s too late
• Angry people..
16. 16
Solution: Puppet
Puppet is an open-source next-generation server automation
tool. It is composed of a declarative language for expressing
system configuration, a client and server for distributing it, and a
library for realizing the configuration.
• Modules for each application (Nginx, Postfix, SNMP etc.)
• Roles based on function as set in server management database
• Everything is defined in Puppet
19. 19
Using Puppet to generate configs
• Supports “Nagios” Exported Resources
• Exported Resources stored in MySQL backend
• Define nagios_services in the matching modules
26. 26
Problems exporting resources
• Puppet runs on Icinga hosts took between 10 and 30 minutes!
• Makes it hard to quickly change monitoring
• Most time spent retrieving and processing (Nagios) resources
29. 29
Other cool stuff to do with Puppet
• Generate daemon checks for servers based on config file
• Generate overview daemon checks using Icinga API
30. 30
Retrieve daemons from config
modules/role/lib/facter/hyvesfacters.rb:
Facter.add("hyves_daemons") do
daemons = ["None"]
if File::exists?( "/<path_to_config>/daemons.conf" )
daemons = []
daemonarray = []
daemonconf = %x{grep name /<path_to_config>/
daemons.conf}
for daemon in daemonconf
daemon.sub!(/.** name:/, '')
daemonarray.push(daemon.chomp)
end
end
setcode do
daemonarray.uniq
end
end
32. 32
Retrieving unique daemons from API
require 'net/http'
module Puppet::Parser::Functions
newfunction(:get_daemons, :type => :rvalue, :docs => "
This function returns an array of all current daemons, based on the Icinga API
") do |args|
domain = "<icinga-web_url>"
url = "/icinga-web/web/api/service/filter[AND(SERVICE_NAME%7Clike%7C*Daemon)]/
columns[SERVICE_NAME]/order[SERVICE_NAME;ASC]/authkey=<api_key>/json"
response = Net::HTTP.get_response(domain, url)
data = response.body
results = PSON.parse(data)
daemons = Array.new
results.each { |result|
daemon = result['SERVICE_NAME']
daemon.sub!(/ Daemon/, '')
daemons << daemon
}
daemons.uniq
end
end
33. 33
Create overview services for daemons
modules/icinga/manifests/noc.pp:
$__daemons = get_daemons()
templatefile { "/etc/icinga/puppetgenerated/other/daemons.cfg":
template => template("icinga/hyvesdaemons.cfg.erb")
}
hyvesdaemons.cfg.erb:
<% __daemons.each do |daemon| -%>
define service{
use DaemonOverview-check
host_name daemons
service_description <%= daemon %>
}
<% end -%>
34. 34
Deployment
• Deploy script to start Puppet runs on all monitoring hosts
• Reports status of Puppet runs once they’re finished
• Starts Puppet run on NOC monitoring host
35. What if a machine doesn’t run Puppet?
35
• Check to check configuration
• Retrieve all operational hosts from servermanagent DB
• Retrieve all hosts from Icinga API
• Alert if something is missing or notifications are off
36. What about failover?
36
• Requires puppet run on all server
• Speed up puppet “runs” with --noop
• Redeploy Icinga
37. 37
ICL (Icinga CommandLine)
• Python based script
• Libraries for access to Icinga API and MK_Livestatus
• Library for things like translating exit codes, and statuses
• See host/service status information
• Control monitoring and alerting
• Quickly see open problems
38. 38
Integration with other tools
• Integration with server administration script to change status
• Fail -> disable notifications
• Operational -> check if everything is OK + enable notifications
• Deprecated -> disable notification + remove from Puppet DB
• Integration with failover scripts
• Deploy monitoring when adding new servers
• Scripts can check status of hosts and services before continuing
40. 40
Plans for the (near) future
• Upgrade Icinga to 1.6
• Clean up ICL and make compatible with Icinga 1.6
• Put ICL on GitHub
• Expose API to developers
• Trend analysis / integration with Ganglia/Graphite