2. 12
Management Challenges
Performance problems often occur with no real warning
– Many times end users are the first to notice problems
– Root cause determination is difficult and time-consuming
– Solving problems requires all-hands-on-deck bridge calls
Real-time understanding of performance is lacking
– No reliable understanding of the health of IT infrastructure makes IT too reactive
– Siloed monitoring tools do not allow a common “truth”
– No correlation across IT silos
Optimizing IT infrastructure is difficult if not impossible
– Understanding the abnormal metric behaviors that lead to degradation of Key
Performance Indicators is not possible with current tools
– Understanding the abnormal behaviors that define your worst performing devices is
not possible with current tools
– Heavy reliance on “Tribal Knowledge” of a few application experts
3. 13
What If You Could…
Automate
• Eliminate time-consuming problem resolution
processes
Correlate and Accelerate
• “One Click” to root cause of emerging
performance problems to reduce MTTI/MTTR
Get Proactive
• Avert end user and business impact of
building performance problems
Collaborate
• Aggregate and correlate data from monitoring
landscape to create a single “truth”
Optimize
• Tune components to deliver optimal
performance for application transactions
5. 15
1st Generation - Event-Centric, Hard-Threshold Based
3/4/08 16:45 Host 1 processingTimeServ The Processing Time Service Level on process… n/a n/a n/a
3/4/08 16:45 Host 1 Processor_Table 0 Processor 0 is at 87.0%. A CPU Bottleneck is….. n/a 0 Windows_System
3/4/08 16:44 Host 2 System_Table The number of hardware interrupts per second… n/a 0 Windows_System
3/4/08 16:30 Host 2 Processor_Table 1 Processor 1 is at 84.0%. A CPU Bottleneck is …. n/a 0 Windows_System
3/4/08 16:25 n/a responseTimeServ… The Response Time Service Level on Toadwor.. n/a n/a n/a
3/4/08 16:20 n/a processingTimeServ.. The Processing Time Service Level on Prospec.. n/a n/a n/a
3/4/08 16:08 Host 1 Ora_Sql_Hogs_Alert Oracle: SFPRD A CPU Hog has been detected n/a OraSF Oracle
3/4/08 16:08 Host 1 Ora_Sql_Hogs_Alert Oracle: SFPRD SQL with high I/O has been de.. n/a OraSF Oracle
3/4/08 14:40 n/a responseTimeServ… The Response Time Service Level on Siebel Sa.. n/a n/a n/a
3/4/08 14:20 n/a processingTimeServ.. The Processing Time Service Level on Siebel S. n/a n/a n/a
3/4/08 14:39 Host 3 Top_CPU_Table Process ‘siebsh.exe(svc-siebel, 6780)’: is cons.. n/a 0 Windows_System
3/4/08 14:39 Host 3 Top_CPU_Table Process ‘siebsh.exe(svc-siebel, 7940)’: is cons.. n/a 0 Windows_System
3/4/08 14:15 n/a responseTimeServ… The Response Time Service Level on Toadwor.. n/a n/a n/a
3/4/08 14:15 n/a processingTimeServ.. The Processing Time Service Level on Prospec.. n/a n/a n/a
3/4/08 13:55 Host 1 Ora_Sql_Hogs_Alert Oracle: SFPRD A CPU Hog has been detected n/a OraSF Oracle
3/4/08 16:45 Host 1 processingTimeServ The Processing Time Service Level on process… n/a n/a n/a
3/4/08 16:45 Host 1 Processor_Table 0 Processor 0 is at 87.0%. A CPU Bottleneck is….. n/a 0 Windows_System
3/4/08 16:44 Host 2 System_Table The number of hardware interrupts per second… n/a 0 Windows_System
How “1st Generation” Tools Attempt to Solve These Problems
DATA FEEDS
DATA FEEDS
DATA FEEDS
DATA FEEDS
6. 16
2nd Generation - Rudimentary Baselining, Rules/Templates, Charting
3/4/08 16:45 Host 1 processingTimeServ The Processing Time Service Level on process… n/a n/a n/a
3/4/08 16:45 Host 1 Processor_Table 0 Processor 0 is at 87.0%. A CPU Bottleneck is….. n/a 0 Windows_System
3/4/08 16:44 Host 2 System_Table The number of hardware interrupts per second… n/a 0 Windows_System
3/4/08 16:30 Host 2 Processor_Table 1 Processor 1 is at 84.0%. A CPU Bottleneck is …. n/a 0 Windows_System
3/4/08 16:25 n/a responseTimeServ… The Response Time Service Level on Toadwor.. n/a n/a n/a
3/4/08 16:20 n/a processingTimeServ.. The Processing Time Service Level on Prospec.. n/a n/a n/a
3/4/08 16:08 Host 1 Ora_Sql_Hogs_Alert Oracle: SFPRD A CPU Hog has been detected n/a OraSF Oracle
3/4/08 16:08 Host 1 Ora_Sql_Hogs_Alert Oracle: SFPRD SQL with high I/O has been de.. n/a OraSF Oracle
3/4/08 14:40 n/a responseTimeServ… The Response Time Service Level on Siebel Sa.. n/a n/a n/a
3/4/08 14:20 n/a processingTimeServ.. The Processing Time Service Level on Siebel S. n/a n/a n/a
3/4/08 14:39 Host 3 Top_CPU_Table Process ‘siebsh.exe(svc-siebel, 6780)’: is cons.. n/a 0 Windows_System
3/4/08 14:39 Host 3 Top_CPU_Table Process ‘siebsh.exe(svc-siebel, 7940)’: is cons.. n/a 0 Windows_System
3/4/08 14:15 n/a responseTimeServ… The Response Time Service Level on Toadwor.. n/a n/a n/a
3/4/08 14:15 n/a processingTimeServ.. The Processing Time Service Level on Prospec.. n/a n/a n/a
3/4/08 13:55 Host 1 Ora_Sql_Hogs_Alert Oracle: SFPRD A CPU Hog has been detected n/a OraSF Oracle
3/4/08 16:45 Host 1 processingTimeServ The Processing Time Service Level on process… n/a n/a n/a
3/4/08 16:45 Host 1 Processor_Table 0 Processor 0 is at 87.0%. A CPU Bottleneck is….. n/a 0 Windows_System
3/4/08 16:44 Host 2 System_Table The number of hardware interrupts per second… n/a 0 Windows_System
3/4/08 16:30 Host 2 Processor_Table 1 Processor 1 is at 84.0%. A CPU Bottleneck is …. n/a 0 Windows_System
3/4/08 16:25 n/a responseTimeServ… The Response Time Service Level on Toadwor.. n/a n/a n/a
3/4/08 16:20 n/a processingTimeServ.. The Processing Time Service Level on Prospec.. n/a n/a n/a
3/4/08 16:08 Host 1 Ora_Sql_Hogs_Alert Oracle: SFPRD A CPU Hog has been detected n/a OraSF Oracle
3/4/08 16:08 Host 1 Ora_Sql_Hogs_Alert Oracle: SFPRD SQL with high I/O has been de.. n/a OraSF Oracle
How “2nd Generation” Tools Attempt to Solve These Problems?
DATA FEEDS
DATA FEEDS
DATA FEEDS
DATA FEEDS
7. 17
VMware’s Approach to Real-Time Performance Management
Flexible
INTEGRATION
to many data sources
Enterprise
SCALABILITY
Patented performance
ANALYTICS
I can put all my
monitoring tools to good
use and get better
performance analytics.
Powerful information
DASHBOARDS
3rd Generation – Holistic, Real Time Analytics
8. 18
Slide 18
vCenter Operations 3rd Generation Approach – An Analogy
My brain is understanding the health of my body.
Should I do anything?
Your Brain Understands Context:
If my heart rate and temperature are increasing I
should go to the hospital
If I’m tired, rest more
If I tire easily, start exercising!
Heart RateRespiration Temperature
Muscular Skeletal Cardio Vascular
Monitoring UserEx Metrics Monitoring Business Metrics
Monitoring App Layer Metric – JVM, DB Connections, etc.
Monitoring Server O/S Metrics – CPU, RAM, Disk, I/O, etc.
vCenter Operations is understanding the health of
my enterprise by analyzing millions of
measurements. Should I do anything?
vCenter Operations Understands Context:
Act based on urgency of emerging problems
Act based on real-time performance dashboards
Act based on long term correlations and trends
vCenter
Operations
Nervous
9. 19
Data Agnostic Approach to Data Collection
Accepts any time series data (examples)
• Server OS
• Server App layer (eg, IIS, Oracle, WebSphere, etc)
• Network
• Storage
• User Experience
• Transactional
• Business Data
• Change Events
Minimal Required Fields (4)
• Object Name, Metric Name, Value, Timestamp
Data Extraction - *not* an analytic question
• No rules/templates to Write and Maintain
• vCenter Operations Analytics do all of the “Work”
vCenter
Operations
10. 20
Slide 20
Learn Normal Behavior and Identify Abnormalities
Doesn’t assume IT data has a normal bell-shaped distribution
Sophisticated Analytics – 8 different algorithms
Learns your dynamic ranges of “Normal” without templates
Learns patterns of behavior and identifies Abnormalities
BLUE LINE
Metric’s
Measured Value
GRAY BAR
Learned Upper and
Lower band of Dynamic
Threshold - “Normal”
RED Zone
Breached Dynamic
Threshold – “Abnormal”
11. 21
Proactive Alerting – Smart Alerts
User Experience (eg, RUM, etc.)
Database Silo (eg, Quest, etc.)
App Data (eg, Wily, etc.)
Network Data (e.g., Ionix IPPM, etc.)
Smart Alert Generation (“When”)
Business Data (eg, Finance)
! SMART ALERT
Business Application
13. 23
Drill down to the Root Cause
Smart Alert Summary (“What”)
Early Warning
SMART ALERT
Noise Line Crossed
14. 24
Drill down to the Root Cause
Smart Alert Summary (“What”)
Impact to
application
health
Impact to health of
each technology tier
No major impact to
application key Performance
Indicators (KPIs)…yet.
15. 25
Drill down to the Root Cause
Smart Alert Summary (“What”)
Root cause technology
tier is the DB
Metric-level
root cause
symptoms -
START HERE
16. 26
Drill down to the Root Cause
See change and other
external events
affect on application
health with this
“mash up” view
Smart Alert Summary (“What”)
17. 27
One Source of Truth Across the
Enterprise
Health - Objective measure of
performance based on
underlying level of abnormal
behavior
Analytics provide a Health
score for any resource or
grouping
• A single Server, Device, Resource
• Entire Tier or Silo
• Entire Application or Service
• Entire Datacenter
• Any Arbitrary Group of Resources
Dynamic Performance Dashboards – Health Scores
“How is our world doing?”
18. 32
vCenter Operations - OPEX Savings
Incident Management
Lifecycle Savings
Manage/Resolve incidents
Proactive alerts reduce costs
30-40%
Change Lifecycle
Savings
Manage changes to
apps/infrastructure
“Before/after” analysis reduces
changed-related incidents 30-40%
Incident Management
Savings
Managing Service Desk issues
(Incidents)
Manual threshold elimination
reduces erroneous tickets by
50-60%
Problem Management
Savings
Closing problems after systems
restored, includes root cause
analysis
Root cause analysis reduces
problem closure by 30%
19. 33
Customer Success: IT Operations
Before
400 critical alerts/hour
End user complaints
alerted IT to the problem
End users impacted (avg. 2
hours/outage)
12 Level-2 engineers on
bridge call to address
problem
After
20 alerts/MONTH
3 hours advanced warning
of slowdown w/root cause
NO end user impact
1 Level-2 Engineer and 1
DBA to address problems
Learn Normal
Smart Alerting
Root Cause
Solve performance issues before end users are affected
and reduce total alerts
21. 35
vCenter Operations Enterprise - Standalone Architecture
Four Main Services:
Collector, Analytics,
Web, ActiveMQ
Architecture includes
MS SQL or Oracle DB,
plus File-based DB
(FSDB) for raw metric
storage
Collectors can be
distributed for
scalability, or to span
DCs & firewalls
22. 36
vCenter Operations Enterprise - Standalone Processing
4a: Metric-level anomalies
are tracked for Alerting and
Dashboarding
5: Data
provided to
“Northbound”
integration
with products
like Ionix
SMARTS
SAM
2a: Analytics runs daily to
determine hour-by-hour
DTs for next 24 hours
2b: Full FSDB is scanned
by the 8 analytic algorithms
to determine per metric
best match the next 24
hour period
1a: Collectors and
adapters collect metrics,
topology & change events
- Ongoing -
1b: Data
stored in
FSDB
3: Incoming data points are
tested against DT bands
4b: Correlate anomalies,
generate Smart Alerts,
and determine RC
2c: Store DT data
in SQL DB
23. 37
Deployment Prerequisites and Sizing
OS Support
• Win Server 2003 R2 (x64)
• Red Hat Linux RHEL 5 (x64)
* Customer supplied
DB Support*
• SQL Server 2005
• Oracle 10g R2
Size Metrics -
Collected every
5 min on Avg.
Processors
(>2.8Ghz)
Memory Minimum
Initial Disk
Space
Processors
(>2.8Ghz)
Memory Minimum
Initial Disk
Space
Small <250,00 4 Cores 12GB 500GB 2 Cores 4GB 10GB
Medium <1,000,000 8 Cores 24GB 500GB 4 Cores 8GB 25GB
Large <5,000,000 16 Cores 64GB 5TB 8 Cores 16GB 100GB
Very
Large
<10,000,000 24 Cores 128GB 10TB 8 Cores 16GB 100GB
DB ServerAnalytics (Main) Server