Next-generation AAM aircraft unveiled by Supernal, S-A2
Monalytics - Online Monitoring and Analytics for Large Scale Data Centers
1. Monalytics - Online Monitoring and
Analytics for Managing Large
Scale Data Centers
Mahendra Kutare*,Greg Eisenhauer*,
Chengwei Wang*, Karsten Schwan*,
Vanish Talwar# and Matthew Wolf*
(*Georgia Tech, # HP Labs)
2. Data Center Management
State of Art
•
Rich platform-level monitoring, incl. hardware
counters.
•
Monitoring and management systems:
–
Dedicated firmware and infrastructure for monitoring at
rack levels (e.g., HP iLO, IBM Director).
–
Middleware-level tools and support for center-level (e.g.,
HP OpenView, IBM Tivoli):
•
statically configured, with standards-(XML)-based logging,
and
•
centralized analysis and management.
4. Key Idea
Monalytics – for on-line management `at scale’:
Combine monitoring with analysis for scalability
and fast response.
Lightweight, dynamic, and distributed.
Enable `local’ control loops for fast actions on
analyzed monitoring data.
5. Issues and Goals
Scale to future datacenter systems:
–
–
`in space’: e.g., large numbers of entities, even
per node, due to consolidation – implies large
monitoring data volumes;
‘in time’: e.g., fault localization made difficult by
cascading effects of failures at scale – requires
short response times.
Dynamics in utility clouds:
–
e.g., changed endpoints due to VM migration
require re-deployment of capture, aggregation,
analysis components;
–
e.g., changing needs demand capture of more
detailed or alternative metrics and analyses.
6. Monalytics - Design
Monitoring:
−
−
flexible and dynamic monalytics topologies;
−
‘at source’ lightweight data manipulation (e.g.,
filtering);
distributed, concurrent, and supporting multiple
administrative domains (zones).
Analysis combined with monitoring:
−
−
`at source’, `during aggregation’, and global;
dynamic: whenever and wherever needed.
Overhead proportionality of control loops with the
processes been controlled.
10. Technical Issues
Elasticity: dynamic topologies/methods:
–
–
adapting to different time and length scales, e.g., by
dynamic use of alternate metrics and analyses, by
‘zooming in’ on select detail;
dealing with VM migration and arrivals/departures.
Overhead-proportional monalytics:
–
–
dynamically and selectively local analysis/actions;
limiting overheads by providing only summary and
aggregate data to higher level brokers and zone
leaders.
Flexible, lightweight building blocks for monalytics
topologies.
12. Illustrations - I
Dynamic local control loop
Recreate an apache bug that segfaults and
finally stops all interactions between RUBiS
components:
–
behavior detection via run time monitoring;
–
behavior diagnosis triggers corrective action by
instantiating control loop to stop/re-start VM.
Total
Requests
Unsuccessful
Requests
Without Control Action
52535
13976
With Control Action
52535
5763
13. Illustration - II
Scalability through local analysis
Trace the http requests processed by the
apache webserver:
–
behavior detection by monitoring for request
processing time for abnormalities;
–
behavior diagnosis using predefined processing
time threshold of 200 ms.
Total Request Trace Data
Generated (10 min)
Without Filtering Operator
1.46 MB
With Filtering Operator
60.45 KB
14. Illustration - III
Zoom In analysis
Failure proportional scalability:
–
behavior detection by monitoring CPU utilization
for the application’s multiple VMs;
–
behavior diagnosis by using entropy-based
statistical techniques.
Total Data Transferred
(3hour )
With Centralized Decision
394.06 KB
With Local Decision and Zoom in
Analysis
123.32 KB
15. Lightweight Building Blocks:
Event Tracing Overheads DomU
•
Overheads grow with the increase in trace event size
from 50-150Kb.
•
Low overheads for reasonably high request rates.
16. Lightweight Building Blocks:
Trace Event Logging Overheads
•
Logging overheads are similar to Apache native logging.
•
Monalytics achieve better results on logging numerical
versus string data.
17. Summary
Light weight monalytics enable flexible
management.
Illustrations demonstrate the importance of
dynamic capabilities to attain overhead
proportionality and to operate at larger scales.
Statistical methods for behavior detection key to
effective monitoring.
18. Related Publications
•
Online Detection of Utility Cloud Anomalies Using Metric
Distributions – NOMS 2010
–
•
Chengwei Wang, Vanish Talwar, Karsten Schwan, Partha
Ranganathan
Look Who's Talking: Discovering Dependencies between
Virtual Machines Using CPU Utilization – HotCloud 2010
–
Renuka Apte, Liting Hu, Karsten Schwan, Arpan Ghosh
19. Monalytics – Cloud Visibility
• Issues – Service Provider/User separation.
– Multiple administrative domains between applications
and infrastructure operators reduces system visibility.
– Cooperative Vs Non-cooperative components
• Problems – How to deduce/infer from limited information?
– Answer questions such as • What are the source-destination communication
pairs for each VM?
• Most heavily interacting VM in the infrastructure ?
20. Current Research
• Communication – Current techniques work under simplistic
communication patterns, hence are not generic
enough.
– More complex communications arise with load
balancing and multiple instances of the same
application residing on the infrastructure.
•
Techniques – Time series analysis techniques based on stationary
model are less realistic and inaccurate.
21. Problem and Intended Contribution
• Problem – Find all VMs (source-destination) which are
communicating with a VM ?
– A VM can communicate with multiple VMs due to
application design or load balancing.
Contribution
A model for communication between VMs using
network level metrics.
Demonstration of the validity of the model for realistic
cloud deployments.
22. Approach
Collect network traffic information
Build a profile for normal network traffic patterns
Monitor the VM operation based on the traffic profile
23. Issues
• Issues – Differentiate traffic from multiple sources and
destinations.
– Traffic data becomes high dimensional.
25. Key Idea
• Use Gaussian distribution to model relationship between
incoming and outgoing traffic information.
• Issues – Directly applying Gaussian model is expensive and
inaccurate.
• Perform dimensionality reduction using PCA.
– In real operations, the relationship could be more
complex due to different request types.
• Build a mixture of Gaussian model.
• Use Gaussian mixture model on features rather than
original monitored data
27. Model Building
• Model –
– Use EM algorithm for Gaussian mixture model.
– EM algorithms provides for a given network metrics
point which Gaussian distribution among the mixture
the data point belongs to.
• Base case –
– Each Gaussian distribution presents a VM and all the
data points associated with a Gaussian distribution
represents all the source-destination network metrics
data points that are associated with a VM.
– Through above we can find all the source-destination
VMs which are interacting with a VM.
28. Runtime Monitoring
• Check system current status by comparing it with the
learned model.
– If a new VM starts interacting with a given VM it can be
detected by changes in source –destination network
metrics data points for a given Gaussian representing a
VM.
– Similarly we can detect when VM interaction with
source or destination VMs are stopped.
– To still figure out, if we can rank all the source and
destination VMs interacting with a VM. This can provide
information about which VM has been communicated
the most.
29. Experimental Evaluation
• Steps –
– Testbed setup with multiple instances of application and
load balancer components.
– Collect the network in and out metrics.
– Plot the scatter plots
– Describe the mixture model from the learned data.
• Evaluation –
– Zero traffic
– Significantly delayed traffic