Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Adam Grummitt - Capacity Management: Guided Practitioner Satnav


Published on

Capacity Management: Guided Practitioner Satnav

Published in: Technology, Business
  • Be the first to comment

Adam Grummitt - Capacity Management: Guided Practitioner Satnav

  1. 1. CapMan GPSCapMan GPS CMG Brazil 2011 # 1 of 30 adam@ grummitt.comCapacity Management:Guided Practitioner SatnavA General PostScript toCapacity Management: A Practitioner GuideISBN 9789087535193 published by Van
  2. 2. CapMan GPSCapMan GPS - Summary CMG Brazil 2011 # 2 of 301. Where am I? Baseline, Gap analysis, perception and reality2. Where do I want and need to get to? Defined business objectives, real infrastructure3. How do I get there? Fastest, shortest, cheapest, safest4. What has to get there? All? Most expensive? Lightest?5. Who do I need to travel with? Evangelist, Champion, Architects, Planners6. What else has to happen at the same time? SLAs, Availability, Continuity, Demand Management7. When will I get there? Short, medium and long term8. Why should I go there? - Conclusionwith acknowledgement to Paul Wilkinson for his ABC cartoons
  3. 3. CapMan GPS1. Where am I? - GPS CMG Brazil 2011 # 3 of 30
  4. 4. CapMan GPSCSI route map CMG Brazil 2011 # 4 of 30 Where are we now? Baseline of current service levels Does it meet wants/needs? What do we want? Delivery & perception of service Business vision, mission, goals What did we actually get? What do we need? Delivery & perception of service External and internal drivers Deliver Service What can we afford? Business budgets, IT specs What will we get? Business budgets, IT specs
  5. 5. CapMan GPS2. Where do I want to go? CMG Brazil 2011 # 5 of 30
  6. 6. CapMan GPSGap analysis -kiviat CMG Brazil 2011 # 6 of 30 Monitors 4 Costs Baselines CMDB changes 3 Bottlenecks 2 Testing results Patterns 1 Application sizes 0 Thresholds SLA targets Alarms Capacity plans Demands Resource usage Workload Forecasts Service drivers Now Next
  7. 7. CapMan GPSMetron Metrics Matrix eg: Reporting CMG Brazil 2011 # 7 of 30 DevDR Metrics Matrix: Reporting TestFailover Acc Normal Measures per App Prod Business Number of users Activity Drivers Reports produced Performance BPI Time for report per location Frequency of reports Number of capacity related incidents Service Response time SLA targets Report generation time SLA constraints Given number of reports Number of concurrent users Component/resource per generic (N/W, SAN, DBMS) & platform (mf, UNIX) Groups of top metrics Overview relevant to domain eg CPU utilization eg LPAR (mainframe and AIX) eg I/O eg Read/write activity per sec eg RAM eg Paging/swapping Special metrics #locations, #users, #reports, time per report
  8. 8. CapMan GPS3. How do I get there? CMG Brazil 2011 # 8 of 30 SF (C)SI SS ST SD SO
  9. 9. CapMan GPSHow do I get there - ABC CMG Brazil 2011 # 9 of 30•  Paul Wilkinson – ABC of ICT•  People, product, process, partners•  Performance depends on Attitude, Behaviour, Culture
  10. 10. CapMan GPS4. What has to get there? CMG Brazil 2011 # 10 of 30•  All of my belongings?•  A selection of what is most important?•  Needed in the short term?•  Needed in the medium term?•  Needed in the long term?•  Most expensive?•  Lightest?•  What I am allowed to take by my service provider?•  What level of service I am prepared to pay for?•  Private flight, 1st class, business, premium, coach, economy?•  Contractual agreement on service level and violations•  Demand management…
  11. 11. CapMan GPSService mapping to continuity & capman CMG Brazil 2011 # 11 of 30 Service Critical to Capacity Capacity Capacity Performance Continuity: Headroom: Workload: Failover: Allowable Mirror level Allowable Allowable Allowable degradation DR level degradation change per degradation % Backup from baseline quarter from from baseline level peak baseline performance Diamond Mission 25% 400% 25% Highest Highest Platinum Regulation 50% 200% 50% Higher Higher Gold Business 100% 100% 100% High High Silver Important 200% 75% 200% Medium Medium Bronze Regular 300% 50% 300% Low Low Tin Discretionary 400% 25% 400% Very low Very low
  12. 12. CapMan GPSPossible resource extension to mapping CMG Brazil 2011 # 12 of 30Service Critical to MF CPU CPU N/W RAM Storage service UNIX Wintel Band-width GB & I/O class Quota - VM priority limit Guarantee - cap Diamond Mission highest 16-32 16-32 highest XXL T0 - SSD Platinum Regulation higher 8-16 8-16 higher XL T0 - SSD Gold Business high 4-8 4-8 high L T1 Silver Important medium 2-4 2-4 medium M T2 Bronze Regular low 1-2 1-2 low S T3 Tin Discretion- lower 1 1 Very low CC T4 ary
  13. 13. CapMan GPS5. Who do I need to travel with? CMG Brazil 2011 # 13 of 30•  Evangelist – technician who understands capman•  Champion – manager who appreciates capman and has $•  Architect/analyst (applications) – who know their systems•  Planners (tools, domains) – who know their domains•  Business users – who know their needs and constraints•  Maybe a mentor for overall guidance•  Maybe an expert to give initial appreciation workshops•  Maybe a consultant to act as a catalyst with management•  Maybe contractors to provide short term expertise•  Not …
  14. 14. CapMan GPSWho do I not need? CMG Brazil 2011 # 14 of 30• Sysprogman: super-hero• ITIL perfectionist: paralysis by analysis• a BPR process perfectionist• a lean black belt• ISO2000 top level checklists• Boy racer who ‘installs ITIL’ in 3 months
  15. 15. CapMan GPS6. What else has to happen ? CMG Brazil 2011 # 15 of 30•  SLAs with respect to performance and capacity•  Availability•  Continuity•  Demand Management•  Things done for real not by rote?•  Exception reporting leading to actions•  Automated activities•  Proper use of tools
  16. 16. CapMan GPSSLA & Performance CMG Brazil 2011 # 16 of 30 •  NOT – in vacuo –  “Mandatory ave response of 3 secs; desirable 1 sec” –  “Mandatory 8 secs; desirable 5 secs for 95 %ile” •  MAYBE – predefined, objective, quantified, meaningful –  “for the XYZ service, between 8am and 8pm, for a normal traffic of <1000 transactions per hour, the average response time is desirably <1 sec and mandatory <2; 95% of response times should be <3 secs and must be <5 seconds” •  NEEDS – measurable, achievable, appropriate –  Service catalogue/portfolio, business needs –  Instrumentation for traffic levels and app counters –  Agreements with teeth that can be monitored & policed –  Normal, peak and exceptional service levels.
  17. 17. CapMan GPSSLA outcomes CMG Brazil 2011 # 17 of 30 Performance metric e.g. Response Time Agreement Depends on Agreement does Worst broken at low precise wording not apply traffic rate of SLA Mandatory Should meet System is under System is under OK desirable target pressure anyway excessive traffic at lower traffic pressure Desirable System is System may be System is Best performing as over-configured probably over- expected configured Light Normal Peak Excessive Std/DR/DM maximum maximum Workload metric e.g. Transaction arrival rate
  18. 18. CapMan GPSAvailability = (agreed service time – unplanned downtime)/ast CMG Brazil 2011 # 18 of 30 Not 99.999% availability for all Include period in statements % downtime pa Outage Max events in period 99 87.6 hours Up to 6 mins 1 week 99.9 8.8 hours 6-60 mins 1 month 99.99 53 mins 1-4 hours 1 quarter 99.999 5.3 mins 4-8 hours 1 year Note one 8 hour period downtime is Max downtime in hours: 93.3% for a week (8*1) + (4*4) + (12*1) + (52*1*0.1) = 41.2 but 99.9% for a year Availability = 0.995 or 99.5% What if ‘up’ but not for all (use potential minus actual): Locations – weighted by size/staff/users Users – weighted by classification Transactions - weighted by significance What if: Too slow – check SLA for limit and percentile of traffic and performance Lengthy recovery time for failover when failure - between cluster nodes - of a blade, of a RAID disk, of a network link…
  19. 19. CapMan GPSContinuity – DR site sizing factors CMG Brazil 2011 # 19 of 30•  Data security to reduce impact of DR: –  Backups made to tape/disk on site and sent off-site regularly –  Data replication to an off-site location so only system sync required –  High availability systems to keep both the data & system replicated•  Precautionary measures: –  Local mirrors of systems and/or data and use of RAID –  Surge protectors, UPS and/or backup generator, fire prevention –  Antivirus, antibot software and other security measures•  Stand-by site at: –  Own site with high availability –  Own remote facilities with SAN –  An outsourced disaster recovery provider•  DR service –  Priority of service determines if included DR service –  DR reduced performance and reduced traffic constraints as per SLA –  Models used to justify configuration and cost of DR site.
  20. 20. CapMan GPSDemand Management CMG Brazil 2011 # 20 of 30•  Control demand for resources to meet levels that the business is willing to support•  Optimize and rationalize demand for the use of IT to achieve optimum provision –  One extreme of over-provisioning without regard to cost –  Other of under-provisioning so that there is no headroom•  Understand and throttle/smooth peaks, if possible, in customer demand or priority•  Control degradation of service due to peaks in demand or downtime/slowtime•  Use budgets/priorities/chargeback/quotas for workloads and new services•  Use ‘levels of critical’ categorization for workloads (gold/silver/bronze)•  Plans for when business requirements cannot be fulfilled due to: –  HW or SW failure –  Unexpected budgetary constraints/ demand increase•  Decisions based on problems being Short term or long term? –  Short-term: only mission critical services supported –  Long-term: management of resource constraints•  Need to identify the critical services and the resources they use –  Business plans, Service catalogue, Change requests, SIPs –  Service priorities and their mapping to resources
  21. 21. CapMan GPS7. When will I get there? CMG Brazil 2011 # 21 of 30•  SatNav gives a typical answer in hours and minutes•  Detailed time depends on route selected and options taken•  Answer based on accumulated experience of many journeys•  CapMan gives an answer typically in short/medium/long term•  Detailed time depends dominantly on many local factors
  22. 22. CapMan GPSShort term Improvements (Wintel VI) CMG Brazil 2011 # 22 of 30 Assets ESM Enhance attributes in registers Improve liaison - ESM and ITSM Standardise contents for action Event, infrastructure & app teams Add resource pool Better exploit resource information Add profiles for levels of priority ESM data already present Metrics Reports Add extra VM metrics for tiers Extended KPIs and trends Add extra KPI metrics Monitoring to assess VM growth - CPU utilisation/server etc Consolidate similar, retire moribund Add selected extra reports Extra reports (day, week, month)
  23. 23. CapMan GPSPhysical rating CMG Brazil 2011 # 23 of 30 Determine S Capture U (SpecInt Rating of the (Peak % utilisation of physical server) the physical server) Calculate N (Normalised power rating of the physical server) N = S * (U/100) e.g. server HP ProLiant DL580 has SpecInt of 40, S = 40 Captured peak utilisation of 15%, U = 15 Rating of N = 40 * (15/100) = 6
  24. 24. CapMan GPSVirtual rating CMG Brazil 2011 # 24 of 30 Determine H Estimate C (SpecInt Rating of the (consolidation ratio host server) e.g. 20:1) Calculate values for tiers such as: Bronze = H / C Silver = Bronze * 2 Gold = Bronze * 4 Platinum = uncapped e.g. SpecInt of VI server (HP Integrity rx8640) H = 200 Estimated target consolidation ratio C = 20:1 Bronze limit = 200/20 = 10 so box needs bronze service
  25. 25. CapMan GPSMedium term Improvements CMG Brazil 2011 # 25 of 30 SPM Proactive Fill vacant positions Reactive reporting to proactive Select CapMan activities Analysis of trends & pathology Establish processes Identify rogues and flatlines Services Portal Add business liaison Formalised reporting vehicle SLAs and performance Regular and exception reports More use of Availability data Available to all relevant parties
  26. 26. CapMan GPS Logica HPCMG Brazil 2011 # 26 of 30 Brocade HPOV Performance Availability tool Multiple data sources CDB/CMIS The Capacity Portal Capacity exceptions Trend Reports Daily Performance refined metrics subset of key metrics focus on key metrics critical thresholdstrended 30/60/90 days across entire estate alarms as relevant with thresholds set regular, on web
  27. 27. CapMan GPSLonger term Improvements CMG Brazil 2011 # 27 of 30 CDB/CMIS & CMDB/CMS Demand management Capacity management db Characterise new workloads Configuration management db Consolidate/retire more apps Utility chargeback Capacity plan Analysis of actual usage For infrastructure upgrades Financial control of upgrades For anticipated project demands
  28. 28. CapMan GPS Component Service CMG Brazil 2011 BusinessHPOV nWorks & perf # 28 of 30 Availability Forecasts Brocade SLM Plans Logica CDB/CMIS Monitoring, Analysis, Tuning, Demand, Sizing, Modelling The Capacity Plan Component Business Current utilisation Service ForecastsForecasts and changes Response times now Drivers Improvement options Track changes Further VI req’s Costs vs benefits Slow time KPI updates (CO2?) Options modelled Utilisation trends Data centre space? Recommendations
  29. 29. CapMan GPSProcedures and work instructions CMG Brazil 2011 # 29 of 30 Process DescriptionTOR Description Process Flow Diagram A high level graphical representation of how the various elements of the processOwnership A clear definition of who will both own the process (and by definition sponsor the project) and support one another relate and ultimately manage the process and day to day to activities. Process Flow An overview of each Capacity Management procedure and the beginnings of the Descriptions work instruction pack. These can be as detailed and broad as befits theObjectives Prior to implementing the process it is essential to define the overall objectives of what the environment but should initially be: process is going to achieve. It is common that these are objectives are quite high •  Daily threshold and trend review level, but these could initially be: •  Trending analysis and Capacity forecasting •  Establish component level monitoring for all applications with an initial focus on the •  Virtualization optimization “Top 5” metrics and all supported platforms. •  Workload characterisation •  Establish service based metrics for at least one application. This should include an As the process matures these would normally be expanded to include the end to end response time and the addition of relevant service metrics within the provision of new services, modelling, exception reporting etc relevant SLA Some of these objectives could be used as process KPI’s if clearly defined. The long term goal would be to populate this section with a complete listing of Process Interfaces all process interfaces that includes likely inputs/outputs.Definitions The key elements that required definition are •  CPM sub-processes,KPIalthough initial this should be component, serviceAs an business Description with initial stage it is recommended that the interfaces are defined: •  Interface to ISS for support and provision of data Capacity Management being an aspiration at this point •  Interface to service owners (SLM) for provision of business data A clear definitionKey Performance of the current responsibilities. These should – Total no. of capacity incidents, no. of emergency changes due to capacity Operational be considered more •  •  Interface to Configuration management for service relationship operational than processIndicators (KPIs) be doing which activities and specific i.e. who will requirements etc information providing what data. Process Quality – No. of anomalies in capacity outputs, % variance in any predictions •  Interface to Change management •  A list of deliverables can be provided here or via link to the “Information Flowof services covered, % physical estate in scope etc Process implementation - % Diagram” VI DescriptionScope A key requirement for a process definition is a clear definition of scope. It Specify key configuration information regarding the highest level relating to Top level details is recommended, Procedure Description given the variety of data sources, that the scope could be limited. As the monitoring particular environment e.g. AIX frame, VMware cluster etc and structure becomes more mature this should be gradually increased until it covers Procedures more applications and infrastructure. Pool the suggested activities. If appropriate capture any pool limits and how those relate to individual guests. Procedural description relating to specification Within each of the procedures the following elements should be VMware specific, but most flavours of UNIX (including AIX) also offer the MoreBenefits These will obviously vary between businesses, but could ultimately include: clearly defined: options ring fence resource and assign it to groups of guests. §  The Capacity Management process will ensure rules proactive approach wetaken. capture the following information: •  Step by step guide to the procedure a Resource that Here is would This change of DefinitionwillInputs/Outputs availability of critical business services •  approach of ensure higher • # of vCPUs due to a •  reduction of capacity related outages. Related procedural tree • Amount of memory (GB) §  Deferred expenditure, through a reduction in the amount of excess capacity. •  Responsibilities • Disk specification §  Reduced risk for existing applications as system resources are managed more • # of Network interfaces (inc speed) effectively. Management/Continuity Specify what resource management and continuity policies are in place e.g. HACMP, DRS/HA, Fair share scheduling etc
  30. 30. CapMan GPS8. Why should I go there? - Conclusion CMG Brazil 2011 # 30 of 30•  CapMan when practiced well saves money•  CapMan GPS has been applied at a number of sites adam@•  Most sites are not where management thinks they are•  Most sites have people who know the real situation•  It takes openness & technical awareness to reveal truth•  Demand management is often minimal•  Project management is often uber-all•  Performance is often an after-thought•  Next steps are often short, medium and long term•  Usually related to liaison as much as process•  Often related to making more use of extant tools•  Hopefully not all reports are filed on the shelf•  But it needs in-house believers to carry it forward…