Nagios Conference 2011 - William Leibzon - Nagios In Cloud Computing Environments

4. Efficient use of modern multi-core processors

6. Less energy, more power efficient use of resources

9. Can move (even live) from one hardware to another

10. Can be shutdown, saved to disk and started again when required

13. Open-Source: Xen, KVM, OpenVZ, Quemu, VirtualBox

15. Open-Source: Eucalyptus, OpenNebula, OpenStack, Baracus

18. KVM - http://www.linux-kvm.org/

20. OpenNebula - http://www.opennebula.org/

21. OpenStack – http://www.openstack.org /

22. Baracus – http://baracus-project.org/

23. Proxmox - http://pve.proxmox.com/

26. Focus on application and network performance

27. Ideally should monitor utilization and be able to launch new server instances (auto-scaling)

28. Monitoring system should itself be robust and handle more servers without impacting performance

30. Clouds can be as small as 10 servers and as as large as 10,000+. When developing architecture, you need to support its future growth from the start.

31. Scaling on Demand

32. A pro-active system should handle big changes in the number of cloud instances. You may have 2 webserver instances at 6am and grow to 20 at 10pm.

33. High Availability

34. Good system design should be fully fault-tolerant and application as a whole should continue to function without interruption if any one server instance dies This means cluster !!!

36. ”Old Way” - NCSA used to forward results of checks from client servers to main nagios server, not robust

37. Shared database (Central Dashboard Model)

38. NDO-Mod and Merlin projects implement this with a combination of NEB modules, daemon & database

39. Worker Nodes (Load Balancing of Checks)

40. DNX and Mod-Gearman do it with combination of loaded NEB module, server daemon & client servers

42. - One central server with all services, it does not do any checks listing them all passive

43. - Separate client nagios servers run plugins and do checks for specific sets of hosts, each has its own subset of full nagios config

44. - Scripts are setup that capture results from each client host and send them to central server using NSCA, it puts them into nagios command queue

45. Advantages

46. This will work with any nagios server, organizations have been doing it from at least 2002

49. How

50. - Multiple Peer Nagios servers, each has different config file specifying which services it would check

52. - There is no master nagios server. There is master DB server, however it is a better understood topic how to create a db cluster

53. - Using NEB avoids slow command-queue processing

54. Disadvantages

55. Partioning of monitoring infrastructure among servers is still manual process. It is not easy to use this for dynamic cloud environment, however it works very well for fault-tolerance

57. - Similarly to Passive Service Checks, there is a central Nagios Server, it does not execute any plugins.

58. - Unlike with Passive Checks, nagios does schedule checks. Thereafter NEB module takes over.

59. - Module passes information on which plugin(s) to run to DNX server (or Gearman server for Mod-Gearman) which manages worker nodes. - Worker nodes are separate servers, each has special worker daemon running. The daemon communicates with management server and gets information (plugin command) on what to run. It then passes results back to management server and NEB module writes these results directly into nagios memory.

61. All worker nodes are essentially the same and there is no additional re-configuration necessary to add a new node

66. Author of this presentation does have a patch to DNX that allows results to be multicast to multiple instances of a nagios servers (second one of them would be stand-by and not scheduling checks only receiving results). This is experimental.

68. Communication between Server and Client using own UDP protocol passing XML packets .

69. Almost all communication is from client to server. Client contacts DNX server dispatcher port, receives list of checks to run, runs them and returns results on collector port

70. DNX Client can support having common checks built into client. check_nrpe was included before, but was pulled out of a package as it required nagios source. #poolInitial = 20 #poolMin = 20 #poolMax = 100 #poolGrow = 10 channelDispatcher = udp://10.1.1.1:12480 channelCollector = udp://10.1.1.1:12481

71. DNX System Internals DNX Server System Internals DNX Client (Worker Node) System Internals

72. Mod-Gearman MOD-Gearman System Nagios Checks and Mod-Gearman Queues

74. Supports nagios-2.x with a patch and nagios-3.x as is

77. Only supports nagios 3.x

78. Supports eventhandlers and not just checks !

79. Nagios-only features are hard to add at node level DNX Mod-Gearman

81. Ideal Fully Fault-Tolerant Nagios Cluster Architecture Replication udpecho cross-monitor Ideally you would have each of the above as a separate cloud server, but even those with 1000s of servers may find this hard to maintain udp udp heartbeat Nagios Server Merlin/ADO DB Merlin/ADO DB Backup DB Proxy Nagios Web Interface Server Backup Nagios Web Interface Server Standby DB Proxy Worker Node Worker Node Worker Node Worker Node Backup Nagios Server Performance Data (RRD) Server (like NagiosGrapher) Backup Performance Data (RRD) Server

83. Cross-monitor of other nagios does not use DNX cluster

84. If main server dies, backup takes over and registers itself in dynDNS server replacing primary.

85. DNX Clients use dynDNS address, they are restarted on server switch replication cross-monitor Nagios Daemon Apache Mysql DB Merlin PNP w/ RRD DNX Server DNX Client DNX Client Nagios Daemon Apache Mysql DB Merlin PNP w/ RRD DNX Server

88. Trigger based on total number of open http sockets (check_netstat, check_apache_status) from all servers

89. Write custom script that keeps number of currently active servers in DB or local file to set name of new server.

90. Have new server name as a parameter for launching cloud instance. Write startup scripts that use this to set hostname and register ip in local dynamic dns server.

91. For Amazon EC2, aws utility is very useful to automate launching of new servers. Get it at http://timkay.com/aws/

92. Extra nagios worker node is launched similarly and this is triggered when enough servers have been launched. Can also do it based on nagios stats (check_nagios)

93. Scale down after an hour or more of low resource usage, you can do it with a check that relies on RRD data

94. Use of SQL DB for Auto-Scaling This is for illustration of logic only. Not real code. CREATE TABLE ServerData ( id bigint(10) unsigned NOT NULL, name varchar(50) unsigned default NULL, connections bigint(20) unsigned default 0, started_on date default NULL, PRIMARY KEY(id)); After you got results of server check (like event handler that runs): UPDATE ServerData SET connections=<data from nagios check> WHERE name=<server host> Custom check to see if new server should be started: $count=sqlexec("SELECT COUNT(id) FROM ServerData") $sumit=sqlexec("SELECT SUM(Connections) FROM ServerData") $lastlaunched=sqlexec("SELECT MAX(started_on) FROM ServerData") if $sumit/$count > $threshold && ($now-$lastlatched)<600 { <figure out the name and id> launch_new_server_instance($newname) sqlexec(”INSERT INTO ServerData VALUES ($newid, $newname,0,CURDATE())”) enable_nagios_service_checks($newname) }

96. But if you control the cloud, find way to get cloud hardware system load. Write check showing physical server name

99. DNX (Distributed Nagios eXecutor) -

100. http://dnx.sourceforge.net/

101. Mod-Gearman - http://labs.consol.de/lang/de/nagios/mod-gearman/

102. Gearman - http://gearman.org/

103. Merlin (Module for Effortless Redundancy and Loadbalancing by OP5) – http://www.op5.org/community/plugin-inventory/op5-projects/merlin

104. Check-Multisite (collect data from multiple servers) – http://www.my-plugin.de/check_multi/

105. Ganglia (open-source computing cluster monitoring, can be integrated with nagios) – http://www.ganglia.info

Nagios Conference 2011 - William Leibzon - Nagios In Cloud Computing Environments

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Nagios Conference 2011 - William Leibzon - Nagios In Cloud Computing Environments

Similar to Nagios Conference 2011 - William Leibzon - Nagios In Cloud Computing Environments (20)

More from Nagios

More from Nagios (20)

Recently uploaded

Recently uploaded (20)

Nagios Conference 2011 - William Leibzon - Nagios In Cloud Computing Environments

Editor's Notes