Monitoring Large-scale Cloud Infrastructures with OpenNebula

Monitoring Large-scale Cloud
Infrastructures with OpenNebula
Simon Boulet
OpenNebula Consultant
Co-founder of the Cloudnorth.com Project
simon@nostalgeek.com

Goals
1. Show how to configure OpenNebula to
achieve sub-1 minute monitoring interval
2. Demonstrate the use of OpenNebula in
large-scale cloud infrastructures
3. Suggest enhancements to OpenNebula
performance and monitoring

How Big Exactly is Large-scale?
How many hosts?
1,000? 2,000? 10,000 VMs?

Monitoring in OpenNebula
● Detects when a VM or host changes status
(Running, Stopped, etc.)
● Built-in metrics: CPU, memory and network
usage
● You can add as many metrics as you like by
customizing driver
● Can be used to perform various tasks (auto
scaling, high-availability redeployment, etc.)

Don't Expect the Default
Configuration to Perform Optimally
● Database: Use MySQL database backend,
not the default SQLite
● Logs: Use Syslog log system, and disable
debug logging (debug_level=1)
● Number of threads: Adjust the number of
drivers threads (see -t option to your *MAD
config options)

Use OpenNebula >= 4.0
Prior versions did monitoring in two phases:
1. The IM Monitor action monitored Hosts
2. The VMM Poll action monitored VMs
100 Hosts + 1,000 VMs * 15 seconds interval = 4,400
actions per minute
Since OpenNebula 4.0, the IM Monitor action is
capable of returning the information of VMs
running on the monitored host

Monitoring History
By default OpenNebula keeps 24h of
monitoring history
15 seconds interval X 24h = 5760 records per VM
Average record size: 4KB
23MB of monitoring history per VM
100 VM = 2.3GB
10,000 VM = 230GB
HOST_MONITORING_EXPIRATION_TIME and
VM_MONITORING_EXPIRATION_TIME config options

Monitoring History (continued)
● Reduce history to 30 minutes (1800
seconds)
● Use MySQL MEMORY storage engine for
vm_monitoring and host_monitoring tables
It's OK to lose monitoring history when MySQL
is restarted
Most recent monitoring values are stored in VM
template
Set MySQL max_heap_table_size large enough to hold all your monitoring
history

Watch your Load Average
As of 4.2, the maximum number of
simultaneous XML-RPC API connections is
limited to 15
Overloaded OpenNebula = Slow XML-RPC API response =
API Limit / Timeout
● Reduce load at deployment time by
adjusting number of VMs simultaneously
deployed by scheduler
● Watch next release (4.4) for
XML-RPC API concurrency
enhancements

Local Caching Nameserver
OpenNebula use DNS name for monitoring
hosts (unless you named your hosts using their
IP address instead of name)
● Use a local caching nameserver to speed up
DNS lookup (such as dnsmasq).

Beware of SSH Transport
Most OpenNebula drivers (KVM, Xen, etc.) use
SSH connections to perform actions
OK for deploying new VM, but expensive when
doing VM monitoring

Meet Ganglia
<< Ganglia is a scalable distributed system monitor tool for high-performance
computing systems such as clusters and grids. >>
- Wikipedia
OpenNebula has built-in support for Ganglia
By default Ganglia and OpenNebula must run
on the same machine
Set GANGLIA_HOST in /var/lib/one/remotes/im/ganglia.d/ganglia_probe and
/var/lib/one/remotes/vmm/kvm/poll_ganglia

Ganglia Driver Limitations
1. Currently only 1 Ganglia Collector is
supported
2. Need to run script on each host to export
OpenNebula-specific metric
(OPENNEBULA_VMS_INFORMATION)
3. Ganglia as a maximum length of 1392 bytes
for string metrics

Host sFlow
<< The Host sFlow agent exports physical and virtual server performance
metrics using the sFlow protocol. The agent provides scalable, multi-vendor,
multi-OS performance monitoring with minimal impact on the systems being
monitored.>>
- http://host-sflow.sourceforge.net/
Exports a standard set of hypervisor and VM
metrics
Official support for Xen, KVM and Hyper-V, but
uses Libvirt to gather metrics (and Libvirt has
support LXC, OpenVZ, VMWare, etc.)

Host sFlow (continued)
Source: http://blog.sflow.com/2012/02/ganglia-33-released.html

Host sFlow (continued)
Sample Metrics
Hosts Metrics
VMs Metrics
Not currently supported in OpenNebula. Contact me if you're interested.
vnode_mem_total Hypervisor Total Memory
vnode_domains Hypervisor VM Count
<VM ID>.vcpu_state VM State (Running, Stopped, etc.)
<VM ID>.vmem_util VM Memory Utilization
<VM ID>.vdisk_free VM Free Disk Space

4,000 VMs at Sub-1 Minute Interval
OpenNebula 4.2 + xml-rpc patch (upcoming in 4.4)
Experimental Host sFlow Driver
1 OpenNebula Core (EC2 High-CPU XLarge instance)
1 Sunstone Web Server (EC2 Standard Medium instance)
1 Ganglia Collector (EC2 Standard Medium instance)
100 Hosts (EC2 High-CPU Medium instances)
~40 VMs per Host
~4,000 VMs (OpenVZ)
15 - 60 second monitoring interval

4,000 VMs at Sub-1 Minute Interval

Looking Forward
There’s room for optimizations
● The command line tools can get very slow when
returning very large result sets (but not the API…)
● Distributed driver, for example using ZeroMQ for
distributing tasks to multiple workers
● Investigate PoolSQL locks being held for long period
and blocking other threads (discussed in bug #1818)
● Gather metrics about OpenNebula internals: locks wait,
effective monitoring interval, memory footprints, etc.
● Investigate very large Sunstone memory usage

Thank you!
Questions?
“OpenNebula captured my interest for several technical
reasons besides the fact that it is truly open. It's architecture
is very elegant; it has C++ bones, ruby muscles and bash
tendons. It's extensible and understandable. It has no peer
as far as I can tell.”
Christopher Barry, Infrastructure Engineer, RJMetrics,
September 2012
http://opennebula.org/users:testimonials

Monitoring Large-scale Cloud Infrastructures with OpenNebula

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Monitoring Large-scale Cloud Infrastructures with OpenNebula

Similar to Monitoring Large-scale Cloud Infrastructures with OpenNebula (20)

Recently uploaded

Recently uploaded (20)

Monitoring Large-scale Cloud Infrastructures with OpenNebula