Monalytics - Online Monitoring and Analytics for Large Scale Data Centers

Monalytics - Online Monitoring and
Analytics for Managing Large
Scale Data Centers

Mahendra Kutare*,Greg Eisenhauer*,
Chengwei Wang*, Karsten Schwan*,
Vanish Talwar# and Matthew Wolf*
(*Georgia Tech, # HP Labs)

Data Center Management
State of Art
•

Rich platform-level monitoring, incl. hardware
counters.

•

Monitoring and management systems:
–

Dedicated firmware and infrastructure for monitoring at
rack levels (e.g., HP iLO, IBM Director).

–

Middleware-level tools and support for center-level (e.g.,
HP OpenView, IBM Tivoli):
•

statically configured, with standards-(XML)-based logging,
and

•

centralized analysis and management.

Key Idea
Monalytics – for on-line management `at scale’:


Combine monitoring with analysis for scalability
and fast response.



Lightweight, dynamic, and distributed.



Enable `local’ control loops for fast actions on
analyzed monitoring data.

Issues and Goals


Scale to future datacenter systems:
–

–



`in space’: e.g., large numbers of entities, even
per node, due to consolidation – implies large
monitoring data volumes;
‘in time’: e.g., fault localization made difficult by
cascading effects of failures at scale – requires
short response times.

Dynamics in utility clouds:
–

e.g., changed endpoints due to VM migration
require re-deployment of capture, aggregation,
analysis components;

–

e.g., changing needs demand capture of more
detailed or alternative metrics and analyses.

Monalytics - Design


Monitoring:
−
−

flexible and dynamic monalytics topologies;

−


‘at source’ lightweight data manipulation (e.g.,
filtering);
distributed, concurrent, and supporting multiple
administrative domains (zones).

Analysis combined with monitoring:
−
−



`at source’, `during aggregation’, and global;
dynamic: whenever and wherever needed.

Overhead proportionality of control loops with the
processes been controlled.

Technical Issues


Elasticity: dynamic topologies/methods:
–

–



adapting to different time and length scales, e.g., by
dynamic use of alternate metrics and analyses, by
‘zooming in’ on select detail;
dealing with VM migration and arrivals/departures.

Overhead-proportional monalytics:
–
–



dynamically and selectively local analysis/actions;
limiting overheads by providing only summary and
aggregate data to higher level brokers and zone
leaders.

Flexible, lightweight building blocks for monalytics
topologies.

Achieving Overhead-Proportionality
Monitor

Collect
Local Window
Collect

Monitor
Local
Loop

Global Window
Summary

Global
Loop

Action
Action
Action

Local
Analyze

Aggregate
Analyze
Analyze

Illustrations - I
Dynamic local control loop


Recreate an apache bug that segfaults and
finally stops all interactions between RUBiS
components:
–

behavior detection via run time monitoring;

–

behavior diagnosis triggers corrective action by
instantiating control loop to stop/re-start VM.
Total
Requests

Unsuccessful
Requests

Without Control Action

52535

13976

With Control Action

52535

5763

Illustration - II
Scalability through local analysis


Trace the http requests processed by the
apache webserver:
–

behavior detection by monitoring for request
processing time for abnormalities;

–

behavior diagnosis using predefined processing
time threshold of 200 ms.
Total Request Trace Data
Generated (10 min)

Without Filtering Operator

1.46 MB

With Filtering Operator

60.45 KB

Illustration - III
Zoom In analysis
 Failure proportional scalability:
–

behavior detection by monitoring CPU utilization
for the application’s multiple VMs;

–

behavior diagnosis by using entropy-based
statistical techniques.
Total Data Transferred
(3hour )

With Centralized Decision

394.06 KB

With Local Decision and Zoom in
Analysis

123.32 KB

Lightweight Building Blocks:
Event Tracing Overheads DomU

•

Overheads grow with the increase in trace event size
from 50-150Kb.

•

Low overheads for reasonably high request rates.

Lightweight Building Blocks:
Trace Event Logging Overheads

•

Logging overheads are similar to Apache native logging.

•

Monalytics achieve better results on logging numerical
versus string data.

Summary


Light weight monalytics enable flexible
management.



Illustrations demonstrate the importance of
dynamic capabilities to attain overhead
proportionality and to operate at larger scales.



Statistical methods for behavior detection key to
effective monitoring.

Related Publications
•

Online Detection of Utility Cloud Anomalies Using Metric
Distributions – NOMS 2010
–

•

Chengwei Wang, Vanish Talwar, Karsten Schwan, Partha
Ranganathan

Look Who's Talking: Discovering Dependencies between
Virtual Machines Using CPU Utilization – HotCloud 2010
–

Renuka Apte, Liting Hu, Karsten Schwan, Arpan Ghosh

Monalytics – Cloud Visibility
• Issues – Service Provider/User separation.
– Multiple administrative domains between applications
and infrastructure operators reduces system visibility.
– Cooperative Vs Non-cooperative components
• Problems – How to deduce/infer from limited information?
– Answer questions such as • What are the source-destination communication
pairs for each VM?
• Most heavily interacting VM in the infrastructure ?

Current Research
• Communication – Current techniques work under simplistic
communication patterns, hence are not generic
enough.
– More complex communications arise with load
balancing and multiple instances of the same
application residing on the infrastructure.
•

Techniques – Time series analysis techniques based on stationary
model are less realistic and inaccurate.

Problem and Intended Contribution
• Problem – Find all VMs (source-destination) which are
communicating with a VM ?
– A VM can communicate with multiple VMs due to
application design or load balancing.


Contribution 



A model for communication between VMs using
network level metrics.
Demonstration of the validity of the model for realistic
cloud deployments.

Approach
Collect network traffic information

Build a profile for normal network traffic patterns

Monitor the VM operation based on the traffic profile

Issues
• Issues – Differentiate traffic from multiple sources and
destinations.
– Traffic data becomes high dimensional.

Key Idea
• Use Gaussian distribution to model relationship between
incoming and outgoing traffic information.
• Issues – Directly applying Gaussian model is expensive and
inaccurate.
• Perform dimensionality reduction using PCA.
– In real operations, the relationship could be more
complex due to different request types.
• Build a mixture of Gaussian model.
• Use Gaussian mixture model on features rather than
original monitored data

Example
incoming traffic

outgoing traffic

outgoing traffic

profile on network
traffic relationship

Three Gaussian models

Mixture of Gaussians

incoming traffic

Probability distribution

Model Building
• Model –
– Use EM algorithm for Gaussian mixture model.
– EM algorithms provides for a given network metrics
point which Gaussian distribution among the mixture
the data point belongs to.
• Base case –
– Each Gaussian distribution presents a VM and all the
data points associated with a Gaussian distribution
represents all the source-destination network metrics
data points that are associated with a VM.
– Through above we can find all the source-destination
VMs which are interacting with a VM.

Runtime Monitoring
• Check system current status by comparing it with the
learned model.
– If a new VM starts interacting with a given VM it can be
detected by changes in source –destination network
metrics data points for a given Gaussian representing a
VM.
– Similarly we can detect when VM interaction with
source or destination VMs are stopped.
– To still figure out, if we can rank all the source and
destination VMs interacting with a VM. This can provide
information about which VM has been communicated
the most.

Experimental Evaluation
• Steps –
– Testbed setup with multiple instances of application and
load balancer components.
– Collect the network in and out metrics.
– Plot the scatter plots
– Describe the mixture model from the learned data.
• Evaluation –
– Zero traffic
– Significantly delayed traffic

Monalytics - Online Monitoring and Analytics for Large Scale Data Centers

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (7)

Similar to Monalytics - Online Monitoring and Analytics for Large Scale Data Centers

Similar to Monalytics - Online Monitoring and Analytics for Large Scale Data Centers (20)

Recently uploaded

Recently uploaded (20)

Monalytics - Online Monitoring and Analytics for Large Scale Data Centers