Your SlideShare is downloading. ×
vCenter Operations 5: Level 300 trainingSingapore, Q2 2012Iwan ‘e1’ RahabokVCAP-DCDStaff SE, Strategic Accountse1@vmware.c...
Document Information This deck is part 2 of a series.    • Part 1 is Management in the Virtual World: a technical introdu...
Table of Contents Built for vCenter Standard Core: Metrics, Threshold, Analytics Badges Heat Map Smart Alert Details...
Managing Performance/Capacity in vSphere: the basic     Is it healthy?       Is it enough?        Is it optimised?    • Ev...
Direct Mapping by vCenter Operations                                        Is it healthy = Health                       ...
Bird-eye view6
Visibility across vCenters                                Sample from ASEAN Lab:                                       6 v...
Performance Troubleshooting: a day in the life… You got an email from the app team, saying the main Intranet application ...
Performance Troubleshooting: a day in the life… The minimum you need to prove    • Performance is not caused by your infr...
Challenge 1: details are lost after 1 hour10
Challenge 1: details are lost after 1 hour                                             The following counters are lost:   ...
Challenge 1: details are lost after 1 hour        Memory Counters                      Disk Counters     <1 hour        >1...
13
Challenge 2: no application awareness14
15
16
Deep understanding of vCenter is required                       Here is a common example of why        a deep understandin...
Deep understanding of vCenter is required                                             Yes, buy more RAM.                  ...
Deep understanding of vCenter is required                                                        vCenter Ops shows        ...
vCenter Ops shows     a very different data.     Memory is only 32%.     Plenty of headroom.     It just saves us from a  ...
Live Demo     1 engine, 2 UI.     Dashboard..     Badges.     Configuration21
Counters and Badges A vCenter farm with 500 VM and 50 ESX will have     >10000 counters!     • It is not humanely possibl...
Samples of Derived Metric: Health Health Score of an Object = MAX (Abnormal Workload, Faults)     • Abnormal Workload per...
Threshold: a shift in mindset needed vCenter sets “static” threshold, which can be misleading     • During peak, it is co...
Dynamic threshold & alerts vCenter Operations uses dynamic threshold     • It is dynamic and personalised down to individ...
Dynamic Threshold Analysis            For each metric                                                                     ...
Dynamic Threshlold: Algorithm                                                                       m 1 m  1       m  ...
Analytics7 different analytics areas.For DT feature, there are 8algorithms.Only inEnterprise EditionThese advancefeatures ...
Discussion Point                   Raw Counters vs Derived Counters              Dynamic Threshold vs Static Threshold29
Badge – Health Answer complex questions like:     • How is the entire virtual data center doing? What’s the       degree ...
95Badge – Workload Answer complex questions like:     • For every object, how is Demand vs Supply?     • For every single...
Derived Metric: Demand                The chart below shows Demand in action.                I generated IOPS which on a l...
Badge – Anomalies Answer complex questions like:     • Is our vDC doing business as usual today? Or is it a       dynamic...
This virtual DC spans multiple vCenters.     vCenter Ops show all the counters that     are behaving abnormally.34
Badge – Faults Answer complex questions like:     • What faults do we experience in our vDC?     • For every object, what...
Badge – Risk Answer complex questions like:     • Do we have risk from performance and capacity in       our vDC? If yes,...
Badge – Time Remaining Answer complex questions like:     • How much time do we have before we need       to buy more ser...
Badge – Capacity Remaining Answer complex questions like:     • How many more VM can we put without impacting       perfo...
Capacity Remaining Calculation Determine Capacity Constraint Resource Deployed or Powered On VMs     • Powered Off VMs o...
Capacity and Time details You can drill down to see details     • You can check the 9 components, as      shown on the ri...
Badge – Stress Answer complex questions like:     • In our vDC, do we have stress points or       periods? How bad is it?...
Stress Calculation                            100                                        Stress Zone                      ...
Badge – Efficiency Answer complex questions like:     • Are there optimization opportunities in our      vDC?     • How w...
Badge – Reclaimable Waste Answer complex questions like:     • Do we over provisioned the VMs in terms of CPU,       RAM ...
Badge – Density Answer complex questions like:     • How high can we push our consolidation       ratio before we experie...
Badge ThresholdsThere are 2 different threshold:VM and Infra (ESXi, Cluster,Datastore, etc)Notice that Major badge hasdiff...
Using badges together Workload High & Anomalies Low & Stress High     • Workload – Object is Running Hot. Potentially Sta...
Discussion Point                       Is Badge the way to go?                    Are these the right 11 badges?          ...
Heat Map Built-in heat maps     • Basic:                                                         A great way to show a lo...
Storage: Datastore + VM vs workload + latency Since all the datastores are on the same array, how do we quickly tell the ...
Each square is a VM. They are grouped by datastore.     Bigger square: bigger throughput     Color: latency.51
Storage: Throughput vs Latency at cluster level Which cluster is generating high storage workload? Are they getting the ...
Storage: Throughput vs Latency at cluster level53
Storage: Throughput vs Latency at host level54
Storage: Throughput vs Latency at VM level             Can we show at VM level now?             That’s why you need a 24” ...
Storage: Space vs Latency Any big VM that is not getting the SLA we agreed on?56
Storage: Datastore space contention Do we have space contention at any of the datastore? If yes, how bad is the     conte...
Storage: Space contention We use thin provisioning58
CPU: Contention vs Usage at cluster level Which clusters are doing the most work? Which are not doing much? How is the C...
CPU: Contention vs Usage at host level Same questions with previous, but for host. We can expect some “drill down” in th...
CPU: Contention vs Usage at VM level             Can we show at VM level now?             That’s why you need a 24” full H...
VM Health Current Health     • Are all the VMs healthy? Especially those VMs which have high workload!     • Which VMs ar...
VM: color by health, size by workload63
VM: color by capacity, size by workload This is now showing future projection. We can see that the VM vCenter 5 is having...
Drill down to specific VM Screenshot below shows vCenter 5. We can see that it will need more vCPU as it will max out in ...
Drill down to specific VM Showing value in absolute terms is good, but can be confusing. vCenter Ops can also     show in...
Discussion Point                Which heat maps are useful for you?          What other heat maps or cold maps do you need...
Smart Alert vs Normal Alert Smart Alert     • Relies on the advanced analytics instead of simple raw counters.     • Not ...
Application-level smart alert Needs Enterprise edition.69
Alert When does Alert happen     • When a badge change color     • When a fault happens     • VC Ops own alert       • A ...
Advance edition: Alert main window Filter by the 11 badges Filter the VC Ops own alert: system or environment71
72
Enterprise edition: Alert main window New alerts: Early Warning, KPI Breach, KPI Prediction, KPI High Threshold breach, C...
Enterprise edition: alert detail74
75
Email Notification Rules76
Email Notification Rules77
Anomalies – Symptoms Window The example is from an ESXi host with 11 VM.                                                 ...
vCenter Operations presents     datastore with all the details79
Storage in vCenter Operations             Automatic learning of storage             performance.             Calculating b...
vSphere 5 Performance Chart (fat client)Can onlychoose 1componentat a time.e.g. cannotshow CPUand RAM atthe sametime.81
vSphere 5 Performance Chart (fat client)         Can only show 1 chart at a time.         Hence can only show 2 units at a...
vCenter Operation charts      Can show >1 charts at a time. Can combine/split charts.      Can show different data type fr...
Capacity Management in vSphere is hard     CPU Optimizations                           Reserved                           ...
Capacity Management                  What are my historical utilization trends?                  What resources have bee...
Understanding Behavior Need to understand the weekly pattern     • Business week     • Weekend     • E.g. workload spike ...
87
88
89
Planning  Summary  Export90
Planning  Summary  Resources91
Planning  Summary  Resources92
Planning  Summary  Resources93
What-if Visualise     • Add or remove VMs.       • Add based on existing VMs as profiles       • Add based on spec you su...
3 choice of views95
Average VM Capacity (trend view)96
97
98
Modeling a what-if scenario           Change Supply   Change Host/Datastore                                               ...
Modeling a what-if scenario100
Modeling a what-if scenario – Specifying VM Configuration101
Modeling a what-if scenario – Using Existing VMs                                            Columns you can see102
Modeling a what-if scenario – Using Existing VMs103
Modeling a what-if scenario – Using Existing VMs104
Modeling a what-if scenario – Changing hosts105
Modeling a what-if scenario – Changing datastores106
Modeling a what-if scenario107
108
Capacity state                          today               VM count               capacity                               ...
Common VM distribution110
Datastore waste111
112
Reclaim waste capacity113
VMs can appear in Stress and Waste at the Same Time                            Undersized for CPU                         ...
Powered-Off VM and Idle VM: setting115
Powered-off VMs116
Capacity Planning: Is the VM really sized properly? Setting a threshold of under-utilisation alone is not enough         ...
Oversized VM & Undersized VM118
Oversized VMs - Calculation                  Same concept applies to undersize.                   Same concept applies to ...
Planning  Summary tab      Planning  Views tab120
Tips No of intervals and data points used for analysis      • Tied to your business cycles.      • Pick correct number of...
Change Events Correlated with Performance Overview      • Integration between vCM and vC Ops Mgr for change events      •...
VCM Events in vC Ops – Event Collected vC Ops does not pull in every event from vCenter      • Only events that could aff...
Event Types in vC Ops Mgr Circle Events are vCM Initiated      • Change log in vCM updated when change is completed      ...
125
126
vCM Change Events Correlated with Performance   A pop-up for a vCM event related to uninstalling a piece of software on t...
vCM Change Events Correlated with Performance 128
Terms The terms Attribute, Metric, Counter mean the same thing.      • CPU Ready Time is an attribute.      • CPU Ready T...
Adapter, Resource, Attribute, Package      VC Ops                              Adapter                          Source of ...
Actual Resource Kinds Sample adapters with their associated resource kinds.                                 This is a spe...
vSphere resource kinds Unlike the Advanced edition, we can utilise Folder and Resource Pool      • This means you can cre...
Resource Kind: default settings133
Attribute & Attribute Package Package      • A collection of Attributes from 1 Resource with the same collection interval...
135
136
137
138
Editing a resource property139
140
Resource Kind: Tags                      What’s the difference between Applications and Application? Looks like           ...
Resource Kind: Tags You can control which resource kinds      are shown      • In the picture below, ESX was hidden.142
Predefined Tags143
Drag selected objects to the tag value144
Resource Kind: Tags145
VC Ops generated metrics146
Monitoring the big workload You have convinced your CIO to virtualise the remaining 50% of the servers. Your CIO needs y...
Super Metrics148
Super Metric: Functions 2 types:      • looping functions: take multiple input value         • Average, sum, min, max, co...
Super Metric: hierarchy Example: super metric for Average CPU usage of a cluster                                         ...
151
152
Super Metric: Operators To calculate a value for each VM based on metrics for that VM, use the ‘$This’      operator. An...
154
155
Super Metric: package156
157
158
Discussion Point               Think of super metrics that you need.              Explain why and how you will need them.159
Applications and Application Tiers App Team often view things from their own application-centric. We can create custom da...
Drag selected objects to the tag va161
Parent-Child Resource Relationships162
163
What counters do you check?Component                            ESX                                                  VM   ...
Test your vSphere knowledge!      How are Disk, Datastore,      Adapter and Path related?165
CPU counters               Test your vSphere knowledge!               Which one is ESX, which one               is VM? How...
%OVRLP and %SYS        Run Wait         Ready                                                             Time            ...
Memory counters         ESXi     VM168
Storage counters: ESXi host           Datastore                    Disk      Storage Adapter or Storage Path169
ESXi: Adapter, Device and Path                                  1 adapter can many Devices (LUN).                         ...
ESXi: Disk171
ESXi: Adapter, Device and Path                                                     ESXi 5.0      vmnic         Storage Ada...
Storage counters: VM       Virtual Disk (VMDK, RDM)                                                VM                     ...
Network counters        ESXi        VM174
Other Counters: ESXi Host        vSphere Replication   System (vmkernel)                                       See        ...
176
A long list of vmkernel      resources. Some are familiar,      such as vMotion, FT, hostd,      Vpxa, DCUI, logging177
178
Widget179
Widget: Full List180
Dashboard: creating a new Tab181
Alerts182
Application Overview and Application Detail183
184
185
Data Distribution186
187
188
Health Status189
Health Status190
Health Tree191
Health Tree192
Health Tree193
Advanced Health Tree194
Advanced Health Tree195
Scoreboard: Health or Workload196
197
Scoreboard: Generic198
199
Heat Map200
201
202
Mashup Charts203
204
Mashup Chart205
Metric Graph206
207
Metric Graph (Rolling View)208
209
Metric Selector210
Metric Sparklines211
212
214
Resources215
216
Tag Selector217
Top-N Analysis218
219
Geographic220
The VC Relationship There are 2 widgets that are vSphere related. Use the advanced edition instead.      • Enterprise ed...
Interaction between widget Controlled at the dashboard level, not individual widget Providing widget and Receiving widge...
Interaction between widget223
Interaction between widget224
Practice session: creating your dashboard Goal: have a dashboard to help you investigates all non-local datastores quickl...
226
vCenter “equivalent” dashboard227
Configuration228
229
230
Cross Silo231
Fingerprint232
Maintenance Mode233
Maintenance Schedules234
Major Steps in implementation Define who             Create               Create                Create             Create ...
Who needs to see what                                              Simple Dashboard.                                      ...
Who needs to see what (samples)Roles                    Info presented                         Health of overall IT in the...
Designing Super Metric Leverage existing derived metrics Leverage Objects that vCenter cannot provide performance data  ...
Custom Heat Map or Cold Map Component                                Heat Map                                             ...
vCenter: network impact of vCenter Ops240
Choice of Tools vCenter Operations      • 1-15 minutes accuracy (for other sources)      • 5 minutes accuracy (for vSpher...
242
243
244
245
Upcoming SlideShare
Loading in...5
×

vCenter Operations 5: Level 300 training

13,201

Published on

Iwan ‘e1’ Rahabok who's working as a Staff SE, Strategic Accounts in Singapore ha created an awesome vCenter Operations 5 Training. It's available in PowerPoint format and I really would like to advise you to read the slide notes. The presentation serves 2 purposes, first it provides in-depth training for those who are learning or evaluating vCenter Operations 5 and second it provides materials that vCenter Ops champion can use to share with internal colleagues (e.g. storage team, app team, etc)

Published in: Technology, Education

Transcript of "vCenter Operations 5: Level 300 training"

  1. 1. vCenter Operations 5: Level 300 trainingSingapore, Q2 2012Iwan ‘e1’ RahabokVCAP-DCDStaff SE, Strategic Accountse1@vmware.com | Skype: e1_ang | 9119-9226 | Linkedin.com/in/e1ang © 2010 VMware Inc. All rights reserved
  2. 2. Document Information This deck is part 2 of a series. • Part 1 is Management in the Virtual World: a technical introduction. • http://communities.vmware.com/docs/DOC-17841 This deck has pre-requisite • Intro video: http://www.youtube.com/watch?v=Z-DJuTiqKag • VC Ops 5 technical introduction at Vault or Partner Central. This deck only covers vCenter Operation (enterprise + advance) • Focus on concept & ‘under the hood’ to get you understand the product deeper. • Does not cover: competitive, installation, configuration • Does not run through feature after feature. • See the official training deck for that at Vault or Partner Central. This is a very long training material. • vCenter Operations modules that it does not covers Use the Section feature • Chargeback to see how it is • Infrastructure Navigator organised. • Configuration Manager Further reading • virtual-red-dot.blogspot.com2
  3. 3. Table of Contents Built for vCenter Standard Core: Metrics, Threshold, Analytics Badges Heat Map Smart Alert Details & Charts Capacity Management Settings VCM integration Concepts & Advance Concepts Deep dive into Metrics Dashboard and Widgets3
  4. 4. Managing Performance/Capacity in vSphere: the basic Is it healthy? Is it enough? Is it optimised? • Every VM & ESX • Enough CPU, RAM, • Which VMs need performing well? Network, Disk? adjustment? CPU, RAM, Future risk? • What are my key Network, Disk? • Time remaining? ratios? • Are they behaving • Capacity • How much can I expectedly? remaining? claim back from • Any fault on any • Where are the “fat” VMs? component? “Stress points” • How many more in time? VMs can I put without impacting performance?4
  5. 5. Direct Mapping by vCenter Operations  Is it healthy = Health • Workload • Anomalies • Faults  Is it enough = Risk • Time remaining • Capacity remaining • Stress period  Is it optimised = Efficiency • What can we reclaim? • Density. Key ratios for management  Daily update at midnight5
  6. 6. Bird-eye view6
  7. 7. Visibility across vCenters Sample from ASEAN Lab: 6 vCenters. Mixed of Appliance and Windows 2 are LinkedMode (SRM)7
  8. 8. Performance Troubleshooting: a day in the life… You got an email from the app team, saying the main Intranet application was slow. • The email was 1 hour ago. The email stated that it was slow for 1 hour, and it was ok after that. • So it was slow between 1-2 hours ago, but ok now. • You did a check. Everything is indeed ok in the past 1 hour. • The application spans 10 VMs in 2 different clusters, 4 datastores and 1 RDM • You are not familiar with the applications. You do not know what apps runs on each VM as you have no access to the Guest OS. • Your environment: 1 VC, 4 clusters, 30 hosts, 300 VM, 20 datastores, 1 midrange array, 10 GE FCoE Test your vSphere knowledge! How do you solve/approach this with just vSphere?What do you do? A: Smile, as this will be a nice challenge for your TAM/BCS/MCS/RE  B: No sweat, you’re VCDX + CCIE + ITIL Master. You’re born for this. C: SMS your wive, “Honey, I’m staying overnight at the datacenter  “ D: Take a blood pressure medicine so it won’t shoot up. E: Buy the app team very nice dinner, and tell them to keep quiet.8
  9. 9. Performance Troubleshooting: a day in the life… The minimum you need to prove • Performance is not caused by your infrastructure, or at least not by your VMware. • Infrastructure = VMware + Storage + Network • Application = VM + App inside the VM What you need to prove • For each of the 10 VM, the following was ok between 1-2 hours ago: CPU, RAM, Disk, network • To strengthen the above, prove that: • The shared infrastructure was also healthy: relevant ESX, relevant Datastore • The overall platform was also healthy. • No relevant faults that happened 1-2 hours ago. • Give the list of ports (that the 10 VM use) to network team to ensure the firewall is not dropping them. What challenges do you face in vSphere to do the above? • Group discussion: what limitations do you face, if vCenter + vMA + PowerGUI + RVTools is all you have? The ideal you need to prove • Show the exact application-level counter that are slow, with the underlying infrastructure-level counter that caused it. Another word, application-specific + root-cause-analysis9
  10. 10. Challenge 1: details are lost after 1 hour10
  11. 11. Challenge 1: details are lost after 1 hour The following counters are lost: 1. Used 2. System 3. Idle 4. Latency 5. Overlap 6. Demand 7. Wait 8. Run 9. Swap wait11 10. Max Limited
  12. 12. Challenge 1: details are lost after 1 hour Memory Counters Disk Counters <1 hour >1 hour <1 hour >1 hour12
  13. 13. 13
  14. 14. Challenge 2: no application awareness14
  15. 15. 15
  16. 16. 16
  17. 17. Deep understanding of vCenter is required Here is a common example of why a deep understanding of vSphere counters make a huge difference.Buy more RAM? 17
  18. 18. Deep understanding of vCenter is required Yes, buy more RAM. ESXi has 32 GB RAM. It is highly used18
  19. 19. Deep understanding of vCenter is required vCenter Ops shows a very different data. Memory is only 32%. Plenty of headroom. What?! It’s been high constantly for the last 24 hours! Better buy more RAM now. But hang on! This is ESXi-06 host in VMware ASEAN lab. We know who use them 19
  20. 20. vCenter Ops shows a very different data. Memory is only 32%. Plenty of headroom. It just saves us from a costly RAM upgrade project20
  21. 21. Live Demo 1 engine, 2 UI. Dashboard.. Badges. Configuration21
  22. 22. Counters and Badges A vCenter farm with 500 VM and 50 ESX will have >10000 counters! • It is not humanely possible to look at them, let alone analyse them. Derived Counters vCenter presents raw counters Standardises the scale into 0 - • e.g. What does Ready Time of 1500 in Real Time chart mean? Is value of 2000 in Real Time chart better than value 100. of 75000 in Daily Chart? 1 universal unit. Minimise the • e.g. Is memory.usage at 90% at ESXi level good or bad? “translation” in our head. • E.g. Is IOPS of 300 good or bad for datastore XYZ? Can be >100 if demand is unmet Single counter can be misleading Universal. Apply to CPU, RAM, Disk, Net, etc. • e.g. Low CPU usage does not mean VM is getting the CPU, if there is Limit, Contention and Co-Stop. Counters derived using sophisticated formula, not just • e.g. To see disk performance, we need to see multiple aggregated. counters at multiple layers (VM, kernel, physical) For the same counter, different Different counters have different units objects use different formula. • GHz, %, MB, kbps, ops/sec, ms • This makes analysis even more complex22
  23. 23. Samples of Derived Metric: Health Health Score of an Object = MAX (Abnormal Workload, Faults) • Abnormal Workload per Metric = Geometric Mean (MAX (Abnormality (Capacity/Entitlement), Abnormality (Demand/Usage)), Workload) • Abnormal Workload per Object = Score Aggregation (Abnormal Workload per Metric) • Fault depends on the object: Cluster = HA Issues = MAX (HA Insufficient Failover Resources, HA Failover In Progress, HA Cannot Find Master) Host = MAX (Hardware Issues, HA Issues) Hardware Issues = MAX (Network Issues, Storage Issues, Compute Issues, CIM Issues) Network Issues = MAX (Network, DVPort, VMNic) Network = Max_of_all_instances (Network Device) DVPort = Max_of_all_instances (DVPort Device) VMNic = Max_of_all_instances (VMNic Device) Storage Issues = MAX(Storage, SCSI, VMFS heartbeat, NFS server, CIM Storage) Storage = Max_of_all_instances (Storage Device) SCSI = Max_of_all_instances (SCSI Device) VMFS heartbeat = Max_of_all_instances (VMFS heartbeat Device) NFS server = Max_of_all_instances (NFS server Device) Compute Issues = MAX (Error, PCIe) CIM Issues = MAX (Processor, Memory, Fan, Voltage, Temperature, Power, System Board, Battery, Other Health, IPMI, BMC) HA Issues = HA Host Status VM = MAX (FT Issues, HA Issues)23
  24. 24. Threshold: a shift in mindset needed vCenter sets “static” threshold, which can be misleading • During peak, it is common for VM to reach high utilisation. • Static threshold will generate alerts when they should not. • vSphere admin quickly learns to ignore them, defeating the purpose of alert to begin with. • During non-peak, it might be abnormal for VM to reach even 50% utilisation. • Static threshold will not generate alerts when they should have. vCenter only sets high threshold • Do you set static threshold when CPU or RAM utilisation drops below 5%?  • A drop in entire array storage IOPS might be a sign of terrible day ahead. • Will not alert when these happen: • Utilisation drops from 75% to 1% when it should not. • Utilisation change from 5% to 70% when it should not. • We need to plots both upper range and lower range But each VM differs. And the same VM differs depending on day/time…  • Intelligence required to analyse each metrics and their expected “normal” behaviour.24
  25. 25. Dynamic threshold & alerts vCenter Operations uses dynamic threshold • It is dynamic and personalised down to individual metric. • Varies from object to object. 1000 VM will have their own threshold. • Varies from time to time. The same CPU Usage counter has different threshold at different time. This cater for peak. See the chart below. • Varies from metric to metric. An ESX with 12 cores, each core has its own CPU Usage threshold. • You can fix hard thresholds if you need to. • This needs Enterprise edition. It comes with no static threshold defined. • Steps  http://virtual-red-dot.blogspot.com/2012/01/vcenter-operations-5-hard-threshold.html Notice the range varies in size25
  26. 26. Dynamic Threshold Analysis For each metric  DT analysis runs nightly • New dynamic thresholds are computed for Data Categorization each metric  Data categorization • Tries to identify stat as linear,Linear DT Multinomial Sparse Step Function Quantile multinomial, step function, etc DT Sigma DT DT Sigma DT • If one of those matches, that DT function is used CCPD  Otherwise: competition • Sigma: assumes hourly cycles ACPD • CCPD: tries to find normal cycles • ACPD: tries to find abnormal cycles DT Scoring • Winner is assigned based on metric trending accuracy  The same metric may get different DT function on different day Dynamic Thresholds26
  27. 27. Dynamic Threshlold: Algorithm   m 1 m  1 m      0,0     i , j    i , j   m 1 m 1 0,0 1      i , j 1    m 1 m  1  m m   pi , j  i 1 pi , j   1     pi , j  i 1 pi , j   i 1 j 1 i  m, j 1  i , j 1  P1,1,P1,2 ,...,Pm,m ( p1,1, p1,2 ,..., pm,m )   m 1 m  1   0,0      i , j      i , j    i 1 j 1 m m, j    i 1 j 1  m, j    i 1 j 1 i  m , j 1   m 1 m  1 m  where   pi , j  i 1 j 1  i  m , j 1 pi , j  1 0  pi , j  1 and   z    t z 1e  t dt , 0 The marginal distribution of the i th row of J is:   m 1   Dirichlet      i , j , i ,1, i ,2 ,..., i ,m 1  for i  1 m  1 ,...,   j 1   ( pi ,1,..., pi ,m 1 )      m    Dirichlet     0,0   m, j  , m,1, m,2 ,..., m,m , 0,0  for i  m       j 1    m 1 m  1 m where   0,0     i , j   i , j i 1 j 1 i  m , j 1 It is pretty difficult for a human to beat the computer in analysis of the data.. The above is one of the many algorithms applied by vCenter Operations.27
  28. 28. Analytics7 different analytics areas.For DT feature, there are 8algorithms.Only inEnterprise EditionThese advancefeatures createSmart Alert.28
  29. 29. Discussion Point Raw Counters vs Derived Counters Dynamic Threshold vs Static Threshold29
  30. 30. Badge – Health Answer complex questions like: • How is the entire virtual data center doing? What’s the degree of their health? • For every cluster, host, datastore, what’s their health? Health is a current Operational State. • It represents what is wrong now that should be addressed within 1 day. Thus Health needs to be scored such that if it is red, then it really needs attention. Weather Map • Simple way to check that entire farm is healthy • For child object, it is replaced with Health Trend • Shows Health of all parent and child objects • Each square can be VM, ESX, datastore, cluster, datacenter, vCenter. Value Explanation 75 – 100 Normal behaviour 50 – 75 The object experience some problems. The object might have serious problems. 25 – 50 Check and take action as soon as possible. The object is either not functioning properly or30 0 – 25 will stop functioning soon.
  31. 31. 95Badge – Workload Answer complex questions like: • For every object, how is Demand vs Supply? • For every single VM, is CPU/Memory/Disk/Network bound? • Any VM is not getting what they are entitled? • What’s the normal workload range for every object in our vDC? Workload is not utilisation or usage • More accurate than utilisation as it takes many factors than just utilisation. Workload = (Demand/Entitlement) Value Explanation • Entitlement is dynamic. Affected by shares, limit, etc. 0 – 80 Workload is not high. • Demand ≠ Usage. The object is experiencing some • Usage may mean passive usage. E.g. the RAM page is there but 80 – 90 high resource workloads. no write/read. Workload on the object is • Score is Max (CPU, RAM, Disk IO, Net IO) 90 – 95 approaching its capacity in ≥1 area. • To bring up the attention Workload on the object is at or over its >95 capacity in ≥1 areas.31
  32. 32. Derived Metric: Demand The chart below shows Demand in action. I generated IOPS which on a local datastore, resulting in spike in latency (read latency when up from 3 ms to 60 ms. Demand correspondingly go up from 4 to 100!32
  33. 33. Badge – Anomalies Answer complex questions like: • Is our vDC doing business as usual today? Or is it a dynamic environment with lots of unexpected changes? • Which VMs, ESX, cluster, datastore, etc are behaving abnormally? • …. and exactly which counters are the culprits? Identifying metric abnormalities • It need to learn dynamic ranges of “Normal” for each metric, so give it >3 cycle per metric. • A month-end job means it needs 3 months. • Normal range changes after configuration or application changes. Value Explanation Anomalies score 0 – 50 Normal Anomaly range • A high number of anomalies: 50 – 75 The score exceeds the normal range. • Usually an indication of a problem 75 – 90 The score is very high. • Demand change Most of the metrics are beyond their • Application team change code/app thresholds. This object might not be > 90 working properly or will stop working • KPI metrics impacts the Anomalies score more than soon. non-KPI metrics.33
  34. 34. This virtual DC spans multiple vCenters. vCenter Ops show all the counters that are behaving abnormally.34
  35. 35. Badge – Faults Answer complex questions like: • What faults do we experience in our vDC? • For every object, what faults does it have? Specific knowledge of which vCenter Events • Which events affect Availability and Performance of which object? • Pulled from active vCenter events • Example: • Loss of redundancy in NICs or HBAs • Memory checksum errors • HA failover problems • Each fault has a default score (e.g. 25, 50, 75, 100) Value Explanation • Highest individual Fault Score drives the Fault object 0 – 25 No fault is registered on the object Score Faults of low importance happens on 25 – 50 object. Best Practices: Faults of high importance happens on 50 – 75 • Do not change the Faults Threshold object. • Use Alerts View to manage Faults. Filter it to just show > 75 Faults of critical importance happens on Fault. object35
  36. 36. Badge – Risk Answer complex questions like: • Do we have risk from performance and capacity in our vDC? If yes, where are they and can you quantify the seriousness? • Which objects are at risk? What is the specific risk? Risk Score takes into account • Time Remaining • Capacity Remaining • Stress Risk is an early warning system. • Identifies potential problems that could eventually Value Explanation hurt the performance 0 – 50 No problems are expected in the future. • The Risk Chart shows Risk score over the last 7 There is a low chance of future problems or a 50 – 75 days, giving a view of the trend. potential problem might occur in the far future. There is a chance of a more serious problem or a 75 – 100 problem might occur in the medium-term future. The chances of a serious future problem are high 100 or a problem might occur in the near future36
  37. 37. Badge – Time Remaining Answer complex questions like: • How much time do we have before we need to buy more server, storage, network before performance starts to degrade or we run out of capacity? • For every cluster, VM, datastore, how much time do we have? Measures time remaining before each resource type reaches its capacity • CPU • Memory • Disk (IOPS & Space) • Network I/O Value Time remaining Early warning of upcoming provisioning 50 – 100 > 2x SP Buffer (60 days) needs 25 – 50 < 2x SP Buffer • Based on Score Provisioning buffer. Default value is 30 days. <25 Near SP Buffer • Set in “Capacity & Time Remaining” section 0 < SP buffer (30 days)37
  38. 38. Badge – Capacity Remaining Answer complex questions like: • How many more VM can we put without impacting performance or using up capacity? • For every cluster, VM, datastore, which components (CPU, RAM, Disk, Network) would run out first? Early warning system 333 More VMs correlates to 77% Capacity Remaining for this object • A low score of 1 mean you still have >30 days. • Measures how many more VMs can be placed on the object Percentage of Total VM “Slots” Remaining • Based on the average size of the VM on the object (e.g. VM profile) Value Capacity remaining • Each object has its OWN VM profile size: Host, >10 >120 days Cluster, Datacenter, Etc. 5 – 10 60 – 120 days From the table, notice value is not linear 0–5 30 – 60 days • It is also not the same with Time Remaining 0 <30 days threshold. • A value of 30 means >120 days for capacity but around 40 days for time.38
  39. 39. Capacity Remaining Calculation Determine Capacity Constraint Resource Deployed or Powered On VMs • Powered Off VMs only use disk space resources • Powered On VMs uses ALL of the 4 resources Calculation Example Shown: • Limiting Resource is Disk Space with 333 VMs available • Use the Deployed VM number of 99 to do the calculation for percentage space remaining • Determine Capacity Remaining • 333 / (333 + 99) = 77%39
  40. 40. Capacity and Time details You can drill down to see details • You can check the 9 components, as shown on the right. • This helps answer the question which components have how many days or VM left! • Summary = Min (all 9 components)40
  41. 41. Badge – Stress Answer complex questions like: • In our vDC, do we have stress points or periods? How bad is it? • For every cluster, VM, datastore, which ones are experiencing stress and how bad is it? Measures long-term or chronic workload (6 weeks) • Chart shows weeks break down of Stress for each day/hour averaged over the last 6 Weeks • Workloads > 70% = “Stressed” • Threshold Configurable as per screenshot below Value Explanation 0–1 Normal score. No action needed Some of the object resources are 1–5 not enough to meet the demands. The object is experiencing regular 5 – 30 resource shortage. Most of the resources on the object are >30 constantly insufficient. The object might stop functioning properly.41
  42. 42. Stress Calculation 100 Stress Zone 12% 70 Workload Line 0 6 Weeks Stress Score is a % and is based on area of Workload Above “Stress Line” Threshold compared to the Total Capacity of the object • Stress Score = (Stress area / Stress Zone) *100 • But max value can be > 100% as the workload can be >100. Example • Stress Line is 70% Workload • 12% of the area is above the 70% threshold • Stress Score is 1242
  43. 43. Badge – Efficiency Answer complex questions like: • Are there optimization opportunities in our vDC? • How well do we do in terms of VM provisioning? Do we get them right? Efficiency Score factors • Reclaimable waste • Density ratio Graph Depicts VMs by Percent • Optimal – Optimally Provisioned VMs Value Explanation • Waste – Over Provisioned VMs  Three Resources Considered use >25 The efficiency is good. The resource on the selected object is optimal. • CPU • Stress – Under Provisioned VMs • 10 – 25 The efficiency is good, but can be Memory improved. Some resources are not fully • Not used in Efficiency Calculation (see Risk) • Disk Space used. The resources on the selected object are  Note: VMs can appear in Stress and 0 – 10 not used in the most optimal way. Waste 0 The efficiency is bad. Many resources are wasted.43
  44. 44. Badge – Reclaimable Waste Answer complex questions like: • Do we over provisioned the VMs in terms of CPU, RAM and Disk? If yes, what’s the degree of over provisioning? • For every cluster, VM, datastore, what can we reclaim? It identifies the amount of reclaimable resources • CPU • Memory • Disk Reclaimable Waste = Reclaimable Capacity / Value Explanation Deployed Capacity No resources are wasted on the 0 – 50 • Waste Score = Max(CPU Waste Score, RAM Waste selected object. Score, Disk Space Waste Score) 50 – 75 Some resource can be used better. • Disk calculation can also include old snapshots and 75 – 100 Many resources are underused templates Most of the resources on the selected 100 object are wasted.44
  45. 45. Badge – Density Answer complex questions like: • How high can we push our consolidation ratio before we experience performance problem? • Now that’s a million dollar question!  • For every datacenter, cluster, ESXi, what are our key ratios and how much head room do we have? Contrasts Actual vs Ideal Density • Identify Optimal Resource Deployment Before Contention Occurs • Ideal is based on demand, not simple configuration. • High Density is good. 100 is not too high. Value Explanation >25 Good consolidation 10 – 25 Some resources are not fully consolidated 0 – 10 The consolidation for many resources is low 0 The resource consolidation is extremely low.45
  46. 46. Badge ThresholdsThere are 2 different threshold:VM and Infra (ESXi, Cluster,Datastore, etc)Notice that Major badge hasdifferent threshold to its minorbadgesEven “similar” badges havedifferent threshold. Notice Timeremaining and Capacityremaining have very differentthresholds. Disable Color Threshold by Clicking the Level Off46
  47. 47. Using badges together Workload High & Anomalies Low & Stress High • Workload – Object is Running Hot. Potentially Starving for Resources • Anomalies – Normal Behavior for this timeframe Add resources • Stress – Object is often running under high Workload. Workload High & Anomalies Low & Stress Low • Workload – Object is Running Hot. Potentially Starving for Resources Not likely a big problem… • Anomalies – Normal Behavior for this timeframe a cyclical workload spike? • Stress – Object usually has enough resources Workload High & Anomalies High • Workload – Object is Running Hot. Potentially Starving for Resources Something is amiss! Immediate attention. • Anomalies – Abnormal behavior for this timeframe If there are Alert and Fault too, then it is a sign of major issue47
  48. 48. Discussion Point Is Badge the way to go? Are these the right 11 badges? What other badges do you need?48
  49. 49. Heat Map Built-in heat maps • Basic: A great way to show a lot of information on 1 screen. • Storage: space, IO Heat map can quickly highlight information, • CPU as it can present relative information. • RAM It is good for relative comparison among • Network VMs. • Advance (or composite) • Health • Workload • Capacity Heat map is a 2 dimensional chart. So it takes Custom heat map or cold map 2 parameters. You cannot choose >2 data. For example, you cannot show the following • Since we can change the color, we can actually at the same time: create cold map. • IOPS, Latency and Throughput. Also, • In cold map, the bigger the size, the colder it is these 3 have different units so it’s hard (less utilised it is). The bluer it is, the less utilised it to combine using Super Metric. is. • ESX, VM and Datastore. • Hence it focuses on Waste49
  50. 50. Storage: Datastore + VM vs workload + latency Since all the datastores are on the same array, how do we quickly tell the relative workload generated by every one of them? • This answers: which datastores are heavily loaded? For each of these datastores, how do we know the relative workload generated by the VM? • This answers: which VMs dominate within a datastore? For every VM, how do we performance is reasonable number? • This answers: which VM has storage bottlenect? How do we show all the above data in one page, without the need to show a lot of numbers? • And we still want to be able to drill down to each VM and datastore.50
  51. 51. Each square is a VM. They are grouped by datastore. Bigger square: bigger throughput Color: latency.51
  52. 52. Storage: Throughput vs Latency at cluster level Which cluster is generating high storage workload? Are they getting the SLA they ask? What’s the latency? The cluster owner wants to know that his entire cluster is getting <10 ms latency. We expect these X, Y, Z clusters to be doing little work. Can we prove this? Basically, the same concept from previous slide, but looking from cluster point of view as Cluster & Datastore has a Many-to-Many relationship.52
  53. 53. Storage: Throughput vs Latency at cluster level53
  54. 54. Storage: Throughput vs Latency at host level54
  55. 55. Storage: Throughput vs Latency at VM level Can we show at VM level now? That’s why you need a 24” monitor 55
  56. 56. Storage: Space vs Latency Any big VM that is not getting the SLA we agreed on?56
  57. 57. Storage: Datastore space contention Do we have space contention at any of the datastore? If yes, how bad is the contention? • While we use thick provision at vSphere level (and thin at array level), we still have risk of space from snapshots, vRAM increase, new VM, new vDisk, storage vMotion, storage DRS, etc. Are the datastore uniformly sized?57
  58. 58. Storage: Space contention We use thin provisioning58
  59. 59. CPU: Contention vs Usage at cluster level Which clusters are doing the most work? Which are not doing much? How is the CPU workload on every cluster? For each of those clusters, can we see if there is CPU contention?59
  60. 60. CPU: Contention vs Usage at host level Same questions with previous, but for host. We can expect some “drill down” in this heat map60
  61. 61. CPU: Contention vs Usage at VM level Can we show at VM level now? That’s why you need a 24” full HD monitor 61
  62. 62. VM Health Current Health • Are all the VMs healthy? Especially those VMs which have high workload! • Which VMs are experiencing problems? • Are more demanding VMs less healthy? • Can we see this by cluster? By host? Future Health • Will all the VMs be okay in future (30 days)? Need to check CPU, RAM, Disk IO, Disk Space and network for every single VM! • For those VMs which are not ok, can we be specific on which value will run out first? Can we “drill down” to individual VM?62
  63. 63. VM: color by health, size by workload63
  64. 64. VM: color by capacity, size by workload This is now showing future projection. We can see that the VM vCenter 5 is having red color. Its capacity will run out within 30 days. So we click on it to drill down.64
  65. 65. Drill down to specific VM Screenshot below shows vCenter 5. We can see that it will need more vCPU as it will max out in 10 days. We can go as far as 6 months. This is good enough as you should not buy hardware >6 months in advance. It makes sense in the physical world as it’s fixed, but unwise in virtual world.65
  66. 66. Drill down to specific VM Showing value in absolute terms is good, but can be confusing. vCenter Ops can also show in %66
  67. 67. Discussion Point Which heat maps are useful for you? What other heat maps or cold maps do you need?67
  68. 68. Smart Alert vs Normal Alert Smart Alert • Relies on the advanced analytics instead of simple raw counters. • Not static, as it based on Dynamic Threshold • Examples: • Early warning alerts: use total anomalies to predict when a problem is happening, sometimes before users are impacted • KPI predictive: prediction that a KPI might soon go abnormal due to an event occurring that has preceded the KPI going abnormal on previous occasions • Fingerprint: set of metric anomalies matches previously seen problem (and associated resolution) Comparison Advanced Edition Enterprise Edition provide alert on Minor Badges badge. E.g. Workload Provide alert on any counters (raw, badge, super YES, Health NO metric) Can only do infrastructure level alert Can do application-level alert good for Alerts on single objects (e.g. VM) Good for single or multi objects driven by the badge’s changing color Driven by threshold anomaly breaches and KPI Threshold Breaches Not customiseable Highly customisable Cannot do alert at Resource Pool or Folder Can do it68
  69. 69. Application-level smart alert Needs Enterprise edition.69
  70. 70. Alert When does Alert happen • When a badge change color • When a fault happens • VC Ops own alert • A component in VC Ops itself has failed. • VC Ops cannot get data Can do SNMP and SMTP • Both are set at set on the Administration Web page. The URL format is https://VM-IP/admin/70
  71. 71. Advance edition: Alert main window Filter by the 11 badges Filter the VC Ops own alert: system or environment71
  72. 72. 72
  73. 73. Enterprise edition: Alert main window New alerts: Early Warning, KPI Breach, KPI Prediction, KPI High Threshold breach, Classic (static) We can also color the row by criticality, and specify period (start – end)73
  74. 74. Enterprise edition: alert detail74
  75. 75. 75
  76. 76. Email Notification Rules76
  77. 77. Email Notification Rules77
  78. 78. Anomalies – Symptoms Window The example is from an ESXi host with 11 VM. Example of an ESXi Anomalies symptom window. • It shows 3 resource type: VM, Datastore, Host System • The VM resource kind has 7 metric groups with anomaly. The VM resource kind (30 out of 71 Symptom) • 71 – Total number of Symptoms under VM object • We’re reporting on an ESX here, and VM is a child of host. So all children metrics are included. • The metric group comes from the vSphere adapter + VC Ops own. • 30 – Total number of Displayed Symptoms • Based on the limit of 5 metrics shown for each Metric Group • The metric group (CPU Usage, network, Summary, etc) are specified by the adapter • Subcategory Network (3 of 11) • 11 – The total number of VMs associated with this ESX. This is not the number of symptoms. • 3 – The total number of VMs that have one or more Network symptoms. Metrics will not be identical common among VM. Most will be similar though. Multi vCPU VM will have more vCPU metrics than 1 vCPU VM. Different VM will have different anomalies They have different workload.78
  79. 79. vCenter Operations presents datastore with all the details79
  80. 80. Storage in vCenter Operations Automatic learning of storage performance. Calculating both Demand and Normal rate.80
  81. 81. vSphere 5 Performance Chart (fat client)Can onlychoose 1componentat a time.e.g. cannotshow CPUand RAM atthe sametime.81
  82. 82. vSphere 5 Performance Chart (fat client) Can only show 1 chart at a time. Hence can only show 2 units at a time.82
  83. 83. vCenter Operation charts Can show >1 charts at a time. Can combine/split charts. Can show different data type from different objects. Line is color coded, showing when threshold is breached.83
  84. 84. Capacity Management in vSphere is hard CPU Optimizations Reserved Capacity vSMP, Shares, Reservations, Limits Memory Optimizations Transparent Page Sharing, Memory Ballooning, Memory Compression ? Remaining Capacity Storage Optimizations Usable Thin Provisioning, Linked-Clones Capacity Clusters DRS, HA, FT, vMotion, Storage vMotion Workload Flux Used VMs growing/shrinking, added/removed Capacity vSphere 36 days remaining84
  85. 85. Capacity Management  What are my historical utilization trends?  What resources have been requested vs. needed?  How many more VMs will fit in my current farm? Analyze  How can we use my resources more efficiently?  What VMs should be right-sized?  Can I reclaim over-provisioned or unused capacity? Optimize  When will I run out of capacity?  What if I add, remove, reconfigure capacity?  Can I defer infrastructure investments? Forecast85
  86. 86. Understanding Behavior Need to understand the weekly pattern • Business week • Weekend • E.g. workload spike at 9am on Mondays Year 1 Accomplish through roll-ups • Roll-up weeks in a month to compute the typical week for the month • Roll-up typical week in a month to a typical week in the quarter Quarter 1 Differs from performance management roll-ups • Older performance data gets less granular. vCenter loses accuracy • Older capacity data maintains its granularity Month 1 Month 2 Month 386
  87. 87. 87
  88. 88. 88
  89. 89. 89
  90. 90. Planning  Summary  Export90
  91. 91. Planning  Summary  Resources91
  92. 92. Planning  Summary  Resources92
  93. 93. Planning  Summary  Resources93
  94. 94. What-if Visualise • Add or remove VMs. • Add based on existing VMs as profiles • Add based on spec you supply • Add, remove, or update hosts. • Modify CPU and RAM only. No Network. • Add, remove, or update datastores. • Update means increase or decrease size. • No IOPS yet. At a cluster level or host level • Cannot do at datacenter or higher level • Host level does not make sense when host has HA & DRS turned on You can add multiple what-if scenario • You can combine them or compare them on the same chart • You cannot save. Changes lost upon log-off. • You can export the scenario results to an Adobe PDF or CSV file.94
  95. 95. 3 choice of views95
  96. 96. Average VM Capacity (trend view)96
  97. 97. 97
  98. 98. 98
  99. 99. Modeling a what-if scenario Change Supply Change Host/Datastore Based on existing VMs Change Demand Change VM New VM spec99
  100. 100. Modeling a what-if scenario100
  101. 101. Modeling a what-if scenario – Specifying VM Configuration101
  102. 102. Modeling a what-if scenario – Using Existing VMs Columns you can see102
  103. 103. Modeling a what-if scenario – Using Existing VMs103
  104. 104. Modeling a what-if scenario – Using Existing VMs104
  105. 105. Modeling a what-if scenario – Changing hosts105
  106. 106. Modeling a what-if scenario – Changing datastores106
  107. 107. Modeling a what-if scenario107
  108. 108. 108
  109. 109. Capacity state today VM count capacity Current capacity cross-over point Actual VMs deployed109
  110. 110. Common VM distribution110
  111. 111. Datastore waste111
  112. 112. 112
  113. 113. Reclaim waste capacity113
  114. 114. VMs can appear in Stress and Waste at the Same Time Undersized for CPU Oversized for Memory114
  115. 115. Powered-Off VM and Idle VM: setting115
  116. 116. Powered-off VMs116
  117. 117. Capacity Planning: Is the VM really sized properly? Setting a threshold of under-utilisation alone is not enough We need to calculate the degree of under-utilisation.117
  118. 118. Oversized VM & Undersized VM118
  119. 119. Oversized VMs - Calculation Same concept applies to undersize. Same concept applies to idle VM.119
  120. 120. Planning  Summary tab Planning  Views tab120
  121. 121. Tips No of intervals and data points used for analysis • Tied to your business cycles. • Pick correct number of data points and the interval type to represent a typical business cycle. • Match no of intervals used for trend view and no of data points used for forecasting • Stay with default forecasting algorithm settings Leverage buffer settings to accommodate for unforeseen usage spikes or future business growth. • VC Ops 5 does not yet have “future incoming VM” concept Leverage business hours to eliminate off-peak usage Don’t be afraid, play with global settings • They are just knobs used for data analysis • Raw data is not modified when global settings are changed121
  122. 122. Change Events Correlated with Performance Overview • Integration between vCM and vC Ops Mgr for change events • Overlay Guest OS configuration changes from vCM in vC Ops performance trend graphs • Launch in context into vCM to see full details of changes and potentially remediate them Benefits • Enable Operations to quickly understand and resolve performance issues arising from configuration changes (reduce MTTR) • Drive efficient & effective troubleshooting by correlating Guest OS configuration changes w/ VM performance degradations • In larger enterprise, help bridge gap between VMware Admin and Guest OS Admin122
  123. 123. VCM Events in vC Ops – Event Collected vC Ops does not pull in every event from vCenter • Only events that could affect health or workload (vSphere Knowledge!) Adapter only pulls in change events for Guest OSs • No ESX/i Host configurations changes (these come from vCenter Adapter) • Guest OS has to be by managed by VCM Event Collected Reboot Software Install/Uninstall Windows Registry IP/Networking changes Device Driver changes Memory/CPU changes Windows Firewall Patches123
  124. 124. Event Types in vC Ops Mgr Circle Events are vCM Initiated • Change log in vCM updated when change is completed E • Time = Occurred time Diamonds are non-VCM-initiated • Change log in vCM updated when vCM collects from VM • Time = Collected time E Always Blue Events – “Might” have minimal impact vCM events VMs follow the normal vC Ops display rules • vCM Events appear for the VM Object itself • vCM Events appear on an ESX host if you enable Child Events124
  125. 125. 125
  126. 126. 126
  127. 127. vCM Change Events Correlated with Performance  A pop-up for a vCM event related to uninstalling a piece of software on the VM in question 127
  128. 128. vCM Change Events Correlated with Performance 128
  129. 129. Terms The terms Attribute, Metric, Counter mean the same thing. • CPU Ready Time is an attribute. • CPU Ready Time from the VM ABC123 is a metric. • vSphere uses the word Counter. VC Ops uses Attribute and Metric. • As there are many attributes, they are grouped together. This is called Attribute Package. Resource provides the Metrics. • Example of resources: host, VM, datastore, cluster, etc. • So a resource provides many attributes. • Resource are pulled via Adapter. Adapter Kind • In VC Ops, there are many kinds of resources. So there is a term Resource Kind, that you need to get used to. Resource Resource Resource • VC Ops uses different adapters to talk to different source. 1 type of adapter per source. So there is a term Adapter Kind. Attribute Attribute Attribute Advance terms • Container. Super Metrics. Application. Tier. KPI129
  130. 130. Adapter, Resource, Attribute, Package VC Ops Adapter Source of data VMware Adapter vSphere 5 VCM Adapter VCM 5.4 VC Ops Adapter VC Ops 5 Container Adapter Adapter Kind = adapter type. VMware Adapter is an example of Adapter Kind. 1 Adapter Kind can have many kind of objects that it pulls from the source. This is called Resource Kind. To make management of attributes easier, they are put into Package. Inside a package, metris are grouped for ease of use. This is the actual Resource Kind Container Adapter is not actually an adapter. It’s a group or container that brought by VMware Adapter can hold other objects.130
  131. 131. Actual Resource Kinds Sample adapters with their associated resource kinds. This is a special & built-in adapter. This is another special & built-in “adapter”. Technically, this is This monitor VC Ops itself! actually not an adapter, as it’s just a VC Ops is just an application, container. which also needs monitoring.131
  132. 132. vSphere resource kinds Unlike the Advanced edition, we can utilise Folder and Resource Pool • This means you can create Super Metric at this level. • Complement vCenter. Not used? ESX Host Not used? No vApp, no Datastore Group, no vDS as at VC Ops 5.132
  133. 133. Resource Kind: default settings133
  134. 134. Attribute & Attribute Package Package • A collection of Attributes from 1 Resource with the same collection interval. That’s all! • Need to map it to objects • Super Metric must be placed into a package • A package cannot come from multiple resources. See screen below. • Cannot create a package that has both VM and ESXi • There is a default package called All Attributes.134
  135. 135. 135
  136. 136. 136
  137. 137. 137
  138. 138. 138
  139. 139. Editing a resource property139
  140. 140. 140
  141. 141. Resource Kind: Tags What’s the difference between Applications and Application? Looks like Application is from the Container adapter, which is built-in. Maintenance schedule contains the time a particular object is on scheduled downtime. It is used to tell VC Ops to ignore, else it would give alert as the behaviour is unexpected. It would think the health drop! So in this screen, ignore maintenance schedule as it should not be part of Resource Kind. The range for Health. This is not the same with the badge Health in VC Ops Advance, as this is universal and apply to beyond vSphere. Health in Advance edition include Fault, which is vSphere specific. Tier is a special container. Again, this is universal, so name your tier properly to avoid changing name later on. Only 1 value here. This means the entire VC Ops.141
  142. 142. Resource Kind: Tags You can control which resource kinds are shown • In the picture below, ESX was hidden.142
  143. 143. Predefined Tags143
  144. 144. Drag selected objects to the tag value144
  145. 145. Resource Kind: Tags145
  146. 146. VC Ops generated metrics146
  147. 147. Monitoring the big workload You have convinced your CIO to virtualise the remaining 50% of the servers. Your CIO needs you to prove, supported by performance charts, that the platform has served every VM well, meeting the SLA in the past 1 quarter. • Tier 1 cluster SLA: 2% CPU Ready, 0 RAM Ballooning, 10 ms disk latency, 0 drop packets. • Tier 2 cluster SLA: 4% CPU Ready, 5% RAM Ballooning, 20 ms disk latency, 0 drop packets. • Tier 3 cluster SLA: 6% CPU Ready, 10% RAM Ballooning, 30 ms disk latency, 0 drop packets. You have 500 VM on 50 ESXi, 8 clusters, 40 datastores, 5 RDM. You must prove that: • Not a single Tier 1 VM has >2% CPU Ready in the past 1 quarter. The underlying ESXi also has <2% CPU contention. • Not a single Tier 1 VM has >10 ms disk latency in the past 1 quarter. The underlying ESXi also has <10 ms disk latency. • Etc, for each Tier and each component (CPU, RAM, Disk, Net) What kind of charts do you need to show?147
  148. 148. Super Metrics148
  149. 149. Super Metric: Functions 2 types: • looping functions: take multiple input value • Average, sum, min, max, count, combine, etc. • More practical or useful than single functions • single functions: take 1 value • Absolute, round up, round down, square root, etc. The xxxN functions, instead of working on just the immediate children, it looks down (or up) the number of levels specified in the formula. • This ‘2’ tells the function to look down for two levels for the metric. • Putting -2 means look up.149
  150. 150. Super Metric: hierarchy Example: super metric for Average CPU usage of a cluster VM is 2 level down from cluster.150
  151. 151. 151
  152. 152. 152
  153. 153. Super Metric: Operators To calculate a value for each VM based on metrics for that VM, use the ‘$This’ operator. Another example: max ( $This:CPUavg, ESXi-Host-003:CPUavg, VM:CPUavg) Finds the maximum value among these • CPUavg metric for the resource to which the super metric is assigned (so this is dynamic) • CPUavg metric for a specific resource called ESXi-Host-003 (so this is hardcoded) • CPUavg metric for all resources of type VM (so this is universal for all VM)153
  154. 154. 154
  155. 155. 155
  156. 156. Super Metric: package156
  157. 157. 157
  158. 158. 158
  159. 159. Discussion Point Think of super metrics that you need. Explain why and how you will need them.159
  160. 160. Applications and Application Tiers App Team often view things from their own application-centric. We can create custom dashboard showing their “Application” Even better if we add non vSphere data, like Hyperic. This gives app-level info and GuestOS-level info, which is not available in vSphere adapter. Define your own hierarchy and relationship160
  161. 161. Drag selected objects to the tag va161
  162. 162. Parent-Child Resource Relationships162
  163. 163. 163
  164. 164. What counters do you check?Component ESX VM Usage or Utilisation: Overall CPU utilisation (to get overall utilisation of entire box) Usage or Utilisation: Overall CPU utilisation Usage or Utilisation: Individual core utilisation Usage or Utilisation: Individual core utilisation (to see distribution and if any particular core isCPU max out) Wait (wait for IO. To see if it’s IO bound) Wait (wait for IO. To see if it’s IO bound) Ready (VM unable to run, waiting for core) Ready (VM unable to run, waiting for core) Co-Stop (if there are large VMs) Co-Stop (if there are large VMs) Ballooning BallooningRAM Active or Active Write Active or Active Write Latency: kernel latency, device latency. Guest Latency Device LatencyStorage Throughput Throughput IOPS IOPS Drop packets Drop packetsNetwork Throughput Throughput vSphere Replication?Others System? Cluster service?164
  165. 165. Test your vSphere knowledge! How are Disk, Datastore, Adapter and Path related?165
  166. 166. CPU counters Test your vSphere knowledge! Which one is ESX, which one is VM? How do you know? Test your vSphere knowledge! What can stop/block a VM from getting the CPU it was configured? No more Collection Level limitation. VC-Ops collect them all and analyse them all. Changing collection level in vCenter does not impact VC Ops as VC Ops gets from “real-time” statistic.166
  167. 167. %OVRLP and %SYS Run Wait Ready Time World 1 %RUN %SYS %OVRLP %RUN continues to accumulate. But %OVRLP kicks in. World 2 %RUN %OVRLP Overlapping time. A world still wants CPU but interrupted by another world. High number normally means ESX is experiencing heavy IO %USED = %RUN + %SYS - %OVRLP As a result, the overlap value does not incorrectly inflate %USED. %SYS A high no means heavy IO or interrupts167
  168. 168. Memory counters ESXi VM168
  169. 169. Storage counters: ESXi host Datastore Disk Storage Adapter or Storage Path169
  170. 170. ESXi: Adapter, Device and Path 1 adapter can many Devices (LUN). 1 Device is accessed via many paths. 1 path can only access 1 Device.170
  171. 171. ESXi: Disk171
  172. 172. ESXi: Adapter, Device and Path ESXi 5.0 vmnic Storage Adapter 1 Storage Adapter 2 vmhba2 vmhba3 Storage Path Storage Path Storage Path Storage Path Storage Path Storage Path vmhba3 NFS VMFS VMFS RDM Datastore Datastore Datastore Disk Disk Disk172
  173. 173. Storage counters: VM Virtual Disk (VMDK, RDM) VM Drive 1 Drive 2 Drive 3 vDisk vDisk vDisk scsi0:0 scsi0:2 Datastore VMFS NFS RDM Datastore Datastore Disk Disk Disk173
  174. 174. Network counters ESXi VM174
  175. 175. Other Counters: ESXi Host vSphere Replication System (vmkernel) See next 2 slides for info Cluster Service Power175
  176. 176. 176
  177. 177. A long list of vmkernel resources. Some are familiar, such as vMotion, FT, hostd, Vpxa, DCUI, logging177
  178. 178. 178
  179. 179. Widget179
  180. 180. Widget: Full List180
  181. 181. Dashboard: creating a new Tab181
  182. 182. Alerts182
  183. 183. Application Overview and Application Detail183
  184. 184. 184
  185. 185. 185
  186. 186. Data Distribution186
  187. 187. 187
  188. 188. 188
  189. 189. Health Status189
  190. 190. Health Status190
  191. 191. Health Tree191
  192. 192. Health Tree192
  193. 193. Health Tree193
  194. 194. Advanced Health Tree194
  195. 195. Advanced Health Tree195
  196. 196. Scoreboard: Health or Workload196
  197. 197. 197
  198. 198. Scoreboard: Generic198
  199. 199. 199
  200. 200. Heat Map200
  201. 201. 201
  202. 202. 202
  203. 203. Mashup Charts203
  204. 204. 204
  205. 205. Mashup Chart205
  206. 206. Metric Graph206
  207. 207. 207
  208. 208. Metric Graph (Rolling View)208
  209. 209. 209
  210. 210. Metric Selector210
  211. 211. Metric Sparklines211
  212. 212. 212
  213. 213. 214
  214. 214. Resources215
  215. 215. 216
  216. 216. Tag Selector217
  217. 217. Top-N Analysis218
  218. 218. 219
  219. 219. Geographic220
  220. 220. The VC Relationship There are 2 widgets that are vSphere related. Use the advanced edition instead. • Enterprise edition can access Advanced edition UI at the same time. Just open another window or tab.221
  221. 221. Interaction between widget Controlled at the dashboard level, not individual widget Providing widget and Receiving widget222
  222. 222. Interaction between widget223
  223. 223. Interaction between widget224
  224. 224. Practice session: creating your dashboard Goal: have a dashboard to help you investigates all non-local datastores quickly • Be able to plot chart for all non-local datastores for comparison. Answer: • Create a tag called Storage from the Environment screen. • Create 1 tag value: Shared Datastore • Tag all the non-local datastores with this tag value • Done manually. Simply drag all the rows • Create a dashboard with 4 widgets • Health Status • This is where you show the overall health of all Non-Local Datastores • Resources • This is where you show all the members of Non-Local Datastore tags • Metric Selector • All the metrics will appear here. • Select the metric you want • Metric Graph or Metric Sparklines • Choose Sparklines if you have lots of graph.225
  225. 225. 226
  226. 226. vCenter “equivalent” dashboard227
  227. 227. Configuration228
  228. 228. 229
  229. 229. 230
  230. 230. Cross Silo231
  231. 231. Fingerprint232
  232. 232. Maintenance Mode233
  233. 233. Maintenance Schedules234
  234. 234. Major Steps in implementation Define who Create Create Create Create Create needs what Super Metrics Applications Tags Heat Maps Dashboards Begin with the end in mind • Every Super Metric must serve a particular role • Role, not individual. A person can & will have many heatmaps/dashboards. • Decide if you need the following non-standard info • Application-level & Guest-OS-level info • Info from physical machines (UNIX, X64, etc) • Info from physical storage and network (switch, FW, router, etc) Think in terms of application • A great way to complement vSphere as vCenter does not have this object.235
  235. 235. Who needs to see what Simple Dashboard. Big picture. Tend to be application focused. CIO or CTO No absolute data. Normalised to 0-100. Focus on long term. Averaged data. A 30-minute spike will not show up. Updated daily. Group Head e.g. Head of Infra, Head of Apps Dept Head e.g. Head of Storage, Head of Server, Head of Network, Head of Databases Rich Dashboard. Ideally Full HD screen. Admin/Architect Specific info. e.g. Storage Admin, Network Admin, App Owner, VM Owner Absolute data + Normalised Data. Focus on short term. Actual data. A 5-minute spike will be visible. Updated every 2 minutes.236
  236. 236. Who needs to see what (samples)Roles Info presented Health of overall IT in the past 1 monthCIO Health of key applications in the past 1 monthCTO As above, but with more technical content, and tailored to him. Health of all key apps in the past 1 month, with the ability to do 1 level drill down for each app.Head of Applications Capacity projection for all key apps. Health of Storage Health of NetworkHead of Infrastructure Health of Servers (VMware and Physical) Health of VMHead of Storage A higher level, simpler dashboard than Storage AdminHead of NetworkVMware TeamAn App Owner The infra is providing each of the VMs in my App with the resources it needs237
  237. 237. Designing Super Metric Leverage existing derived metrics Leverage Objects that vCenter cannot provide performance data • Application, Resource Pool, Folder, Location, can now have performance counters Minimise static alert. Know what a good range for the end result Build a simple table to avoid super metric sprawl and duplicating existing metrics • Below is an example, showing 2 Super Metrics.Name Purpose Target Role Formula Good Range VM SLA = 100% - Max (CPU, RAM, Disk, Network) CPU = CPU Contention %. RAM = RAM ballooning %. Shows that a VM gets the Disk = % above threshold latency. >99% (Tier 1 cluster) resources it wants fromVM SLA VM Owner Network = Packet Drop %. >97 (Tier 2 cluster) infrastructure based on the >95% (Tier 3 cluster) defined SLA. Tier 1 Disk SLA is 10 ms. Tier 2 Disk SLA is 20 ms. Tier 3 Disk SLA is 30 ms. Show that the underlying infra VMware Infra SLA = 100% - Max (Host Cluster, DatastoreInfra SLA has the resources for all the Admin Cluster) VMs on it238
  238. 238. Custom Heat Map or Cold Map Component Heat Map Cold Map Least utilised VM: size by vCPU count, color by RAM + CPUCPU Resource pool: size by CPU utilisation, usage (a Super Metric) Most RAM intensive VMs, grouped by ESX. Size by RAMRAM utilisation, color by health Most disk intensive VMs, grouped by ESX. Size by diskDisk Least utilised disk: size by GB, color by % of free utilisation, color by health Most network intensive VMs, grouped by ESX. Size byNetwork Most idle VMs, grouped by host network utilisation, color by health VMs with file system that will run out soon. Color by %Capacity left, size by GB left. VM health, grouped by cluster. Color by health, size byHealth workload. Design consideration • Use Super Metric so the info is richer. • Group VMs by 1 consistent hierarchy only. If you group by cluster, it won’t make sense to further group by datastore as 1 datastore can spans multiple cluster.239
  239. 239. vCenter: network impact of vCenter Ops240
  240. 240. Choice of Tools vCenter Operations • 1-15 minutes accuracy (for other sources) • 5 minutes accuracy (for vSphere) • No need reproducible. But problem should last >5 minutes, preferably 15 minutes (3 sample) vCenter • 20 – 300 seconds accuracy • Reproducable performance issue • Requirements: you already have some idea what causes it esxtop • 2 – 20 seconds accuracy. Short burst problem. • Reproducable performance issue • Requirements: you already know which ESX & VM has the problem. vSCSIStat • Specific for storage, low level analysis241
  241. 241. 242
  242. 242. 243
  243. 243. 244
  244. 244. 245

×