Capacity Planning for Virtualized Datacenters - Sun Network 2003
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Capacity Planning for Virtualized Datacenters - Sun Network 2003

  • 7,203 views
Uploaded on

Presentation I made at the Sun Network conference in 2003 on how to do capacity planning for virtualized systems, tied into the N1 product that Sun was pushing at the time. This project was......

Presentation I made at the Sun Network conference in 2003 on how to do capacity planning for virtualized systems, tied into the N1 product that Sun was pushing at the time. This project was structured as a design for six sigma (DFSS) project.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Note that this slide deck was written about six years ago, the N1 product no longer exists, but it can be seen as an early approach to what is now known as Cloud Computing. At the time I had just moved from Sun's performance group where I did this work, to the HPC group where I was focused more on Grid related technologies. I left Sun in 2004, so the email address shown won't work :-)
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
7,203
On Slideshare
6,580
From Embeds
623
Number of Embeds
32

Actions

Shares
Downloads
239
Comments
1
Likes
9

Embeds 623

http://thetaooftech.blogspot.com 382
http://perfcap.blogspot.com 115
http://www.linkedin.com 22
http://www.slideshare.net 15
https://www.linkedin.com 12
https://twitter.com 8
http://thetaooftech.blogspot.kr 8
http://thetaooftech.blogspot.in 7
http://thetaooftech.blogspot.de 7
http://perfcap.blogspot.ca 6
http://perfcap.blogspot.co.uk 5
http://thetaooftech.blogspot.tw 5
http://thetaooftech.blogspot.com.au 4
http://perfcap.blogspot.se 3
http://thetaooftech.blogspot.fr 3
http://thetaooftech.blogspot.ch 2
http://thetaooftech.blogspot.dk 2
http://thetaooftech.blogspot.ca 2
http://thetaooftech.blogspot.co.uk 2
http://thetaooftech.blogspot.nl 1
http://perfcap.blogspot.fr 1
http://perfcap.blogspot.dk 1
http://thetaooftech.blogspot.it 1
http://planets.sun.com 1
http://perfcap.blogspot.com.au 1
http://perfcap.blogspot.gr 1
http://thetaooftech.blogspot.jp 1
http://perfcap.blogspot.sg 1
http://perfcap.blogspot.co.nz 1
http://perfcap.blogspot.in 1
http://www.techgig.com 1
http://192.168.10.100 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Capacity Planning for N1 Sun Network 2003 Presentation SunSigma DFSS Adrian.Cockcroft@sun.com Project Chief Architect - High Performance P925 Technical Computing August 29, 2003
  • 2. Project: Capacity Planning for N1 ID: P925 What is N1? Datacenter Automation Manage “N” systems as if they were “1” system Solve the Total Cost of Ownership (TCO) problems Manage all the “fabrics” as one - Network/VLAN, SAN/Zone, power, consoles, cluster Heterogenous Support Solaris, Linux, AIX, HP-UX, Windows, EMC etc… Layered Provisioning Platform/OS, Application, Service Roadmap Includes Acquisitions 2001 Sun internal N1 architectural definition 2002 Terraspring platform level virtualization 2003 CenterRun Application level provisioning ………. 2
  • 3. Project: Capacity Planning for N1 ID: P925 Voice of the Customer “We want better performance at a lower price” _ “We want higher utilization” _ “We don’t want application performance to _ degrade at times of peak load” “We want more and faster application changes” _ “How do we do capacity planning with N1?” _ Scope… 3
  • 4. DEFINE Project: Capacity Planning for N1 ID: P925 Capacity Planning for N1 Define _ Project goals, scope and plan, VOC, stakeholders – Measure _ Definition of Capacity Planning measurements – Analyze _ Gaps, N1CP Processes Concept Design, Survey – Design _ Prototype Use Cases – Verify _ Stakeholder communication and transition plan – Monitor _ N1 Capacity Planning implementation tracked as – subgroup of N1 Strategic Working Group 4
  • 5. MEASURE Project: Capacity Planning for N1 ID: P925 Translate VOC to Measurements “We want better performance at a lower price” Fast, well tuned and efficient systems Lower Total Cost of Ownership Flexibility - choice of systems by price, performance, reliability, scalability, compatibility and feature set “We want higher utilization” Consistently high utilization of expensive resources “We don’t want application performance to degrade at times of peak load” Consistent and fast application or service response times Headroom needed to handle peak loads “We want more and faster application changes” Flexible scenario planning, rapid provisioning Question: “My company already has capacity planning processes and tools” - do you agree or disagree with this statement? 5
  • 6. MEASURE Project: Capacity Planning for N1 ID: P925 N1 as a Constraint and Opportunity Centralized control and monitoring _ Highly replicated hardware configurations _ Well defined workload and capacity characterization _ Arrays of load-balanced systems, structured network _ Large SMP nodes, standardized storage layout _ Web services workloads follow an “open system” _ queuing model, which is simple to plan against Dynamic system domains and virtualized provisioning _ allow rapid capacity adjustments and pooled resources Primary capacity metrics are CPU power and storage, _ secondary metrics (memory, network and thermal) may be over-provisioned but should be watched 6
  • 7. MEASURE Project: Capacity Planning for N1 ID: P925 Utilization Definition Utilization is the proportion of busy time _ Always defined over a time interval _ Sum over devices _ OnCPU Scheduling for Each CPU (mean load level) Mean CPU Util OnCPU and 0.56 usr+sys CPU for Peak Period 100 0 90 80 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 70 Microseconds 60 CPU % 50 40 Utilization 30 20 10 0 Time 7
  • 8. MEASURE Project: Capacity Planning for N1 ID: P925 Headroom Definition Headroom is available usable resources _ Total Capacity minus Peak Utilization and Margin – Applies to CPU, RAM, Net, Disk and OS – Depends upon workload mixture – Can be very complex to determine – usr+sys CPU for Peak Period 100 Margin 90 80 Headroom 70 60 CPU % 50 40 Utilization 30 20 10 0 Time 8
  • 9. MEASURE Project: Capacity Planning for N1 ID: P925 CPU Capacity Measurements CPU utilization is defined as busy time divided by _ elapsed time for each CPU Number of CPUs is dynamic, so capacity at “100%” is _ not constant. Use units of “processors” to measure load. CPU type and speed varies so we need something like _ MIPS or M-Values for mixed systems CPU utilization should be managed within a range that _ safely minimizes headroom to give stable performance at minimum cost Process level CPU wait time measures the time a _ process spent on the run queue waiting for a free CPU This allows response time increase to be observed directly so that – increased capacity can be provisioned before headroom is exhausted 9
  • 10. MEASURE Project: Capacity Planning for N1 ID: P925 Response Time Definition Service time occurs while using a resource _ Queue time waits for access to a resource _ Response Time = Queue time + Service time _ Response time curves for random arrival of work from large unknown user population (e.g. the Internet!) Response Time Curves R = S / (1 - (U/m)m) 10.00 Response Time Increase Factor 9.00 8.00 7.00 6.00 One CPU 5.00 Two CPUs Four CPUs 4.00 3.00 2.00 1.00 0.00 0 0.5 1 1.5 2 2.5 3 3.5 4 Mean CPU Load Level 10
  • 11. MEASURE Project: Capacity Planning for N1 ID: P925 Response Time Curves Systems with many CPUs can run at higher utilization levels, but degrade more rapidly when they finally run out of capacity. Headroom margin should be set according to response time margin and CPU count. Response Time Curves R = S / (1 - (U%)m) 10.00 Response Time Increase Factor 9.00 8.00 One CPU 7.00 Two CPUs 6.00 Four CPUs 5.00 Eight CPUs Headroom 16 CPUs 4.00 margin 32 CPUs 3.00 64 CPUs 2.00 1.00 0.00 0 10 20 30 40 50 60 70 80 90 100 Total System Utilization % 11
  • 12. MEASURE Project: Capacity Planning for N1 ID: P925 CPU Scalability Differences SMP allows work to migrate between CPUs, “blades” don’t Single queue of work gives lower response time for user sessions – at high utilization than arrays of uniprocessor “blades” Headroom margin on array of “blades” is constant as array grows – Two to four CPU systems need much less margin than Uni-CPUs – Measure and calibrate actual response curve per workload – Response Time Curves SMP R = S / (1 - (U/m)m) vs. Blade R = S / (1 - U/m) 10.00 Response Time Increase Factor 9.00 8.00 7.00 1 CPU/Blade 6.00 2 CPU SMP 5.00 4 CPU SMP 2 Blades 4.00 4 Blades 3.00 2.00 1.00 0.00 0 0.5 1 1.5 2 2.5 3 3.5 4 CPU Demand Level 12
  • 13. MEASURE Project: Capacity Planning for N1 ID: P925 CPU Measurement System Issues Clock sampled CPU usage _ Poor clock resolution at 10ms (optionally 1 ms) – Biased sample since clock schedules jobs – Underestimates more at lower utilization – Creates apparent lack of scalability – Microstate measured CPU usage _ Measure state changes directly - “microstates” – Per-CPU microstate based counters are not available – Use microstates at process based workload level, sum over some or – all processes as needed (can take a while on big systems) Microstate method simply extends to measuring services and mixed – workloads 13
  • 14. MEASURE Project: Capacity Planning for N1 ID: P925 N1 Capacity Planning CTQs Gauge Budget CTQ Name Pri Units LSL USL Acc. Sigma 30% of CPU Utilization (TCO) 5 CPUs 99% 3.0 total CPU Responsiveness 70-98% 10 CPUs 99% 4.0 (SLA) of total Both of these Critical To Quality (CTQ) requirements are measured via the CPU load level which can accurately be measured with a Gauge accuracy estimated at 99% and a sigma goal based on defect cost. Using sampled CPU accuracy is estimated at 90%. For CPU Utilization a defect is unacceptable Total Cost of Ownership (TCO) and occurs if the total CPU load drops below the Lower Specification Limit (LSL) of 30% of the total configured for a sample taken during the peak load period. For CPU Responsiveness a defect is overload leading to a Service Level Agreement (SLA) failure and occurs if the total CPU load goes above the Upper Specification Limit (USL) which is 70% of the total configured for Uni-processors increasing for larger CPU counts. 14
  • 15. ANALYZE Project: Capacity Planning for N1 ID: P925 Concept Design - N1CP Roles Manager Application Architect _ – Developers – Database Administrators Systems Architect _ – Systems Administrators – Storage Administrators – Network Administrators Others? Question: What roles do you do? 15
  • 16. ANALYZE Project: Capacity Planning for N1 ID: P925 Scenarios - Top Level Functional Breakdown Install N1 Datacenter Provision Provisionlevel System Over-Provision System level Applications System level Applications Provision Applications Provisionlevel Repeat infrequently System Right-size System level Applications System level Applications Applications Provision Provisionlevel Repeat on schedule System Re-Allocate System level Applications Resources during Applications Provision Low load times Provisionlevel System Repeat as needed Grow or borrow System level Capacity Applications just before Applications Overload occurs 16
  • 17. ANALYZE Project: Capacity Planning for N1 ID: P925 Installation Sizing Scenario This scenario indicates the tasks for each role when an N1 datacenter fabric is created using currently available system level provisioning software. The tasks performed by each role in a scenario is called a “use case”. Future versions of N1 will configure services and policies during installation. Red arrows show the command flow between the roles. Manager Application Database Developer Systems Systems Network Storage Architect Admin Architect Admin Admin Admin I want an N1 Choose and Install Install Choose Size systems Size overall Size overall ready size generic generic systems mix network storage datacenter applications database application and images servers platforms Time Build generic Setup Setup SANs system switches and storage images and VLANs for N1 for N1 Measure capacity of generic systems 17
  • 18. ANALYZE Project: Capacity Planning for N1 ID: P925 Over-Provisioning Scenario This gives an indication of the tasks performed by each role as a new application is provisioned using the capabilities of todays N1 products. The initial goal is to over-provision the capacity for initial bring-up of the application then later right-size it as its actual usage pattern becomes better understood. In future releases more and more of this activity will be automated, and more of the work will move to become pre-work that is related to setting up the overall N1 datacenter infrastructure. Manager Application Database Developer Systems Systems Network Storage Architect Admin Architect Admin Admin Admin Provide an Use these Database App server Use these Systems Network Storage online apps versions versions platforms selection & sizing sizing service and sizing and sizing versions Configure Configure Define Build Provision Provision Time database app server operations replicable Internet LUNs policies system connection images Populate Acceptance Use N1 GUI Configure Configure database test to over- access and backup provision security strategy initial system Enable user access 18
  • 19. ANALYZE Project: Capacity Planning for N1 ID: P925 Rightsizing Scenario Rightsizing adjusts the headroom for each component of the system to make sure that the usage level falls inside the specification limits. Rightsizing can be performed during an offline maintenance window but all the technologies exist to adjust domain size for tier 3 systems, and adjust the number of tier 1 and tier 2 systems dynamically. Manager Application Database Developer Systems Systems Network Storage Architect Admin Architect Admin Admin Admin Business Monitor Monitor CPU, Monitor WAN Monitor level and database Network / Internet storage trend plan headroom and headroom headroom (memory memory and tables) Time Increase Increase Increase Increase headroom headroom headroom headroom for for for for bottleneck bottleneck bottleneck bottleneck Reduce Reduce Reduce Reduce headroom headroom headroom headroom for under for under for under for under utilized utilized utilized utilized database systems bandwidth storage 19
  • 20. ANALYZE Project: Capacity Planning for N1 ID: P925 Re-Allocation Scenario Load levels vary during the day and the week. Regular times of low utilization can have other work performed - e.g. overnight batch jobs. Batch workloads that cannot run on the same systems due to configuration or security issues can run on systems (or Grids) that are provisioned each night using spare capacity from other systems. Manager Application Database Developer Systems Systems Network Storage Architect Admin Architect Admin Admin Admin Batch Define batch Build or Define batch Determine workload capable configure mechanism timing and capacity applications batch depth of needed capable capacity to applications re-allocate Time Move resources to Grid after peak load time Bring resources back before peak load time 20
  • 21. ANALYZE Project: Capacity Planning for N1 ID: P925 Overload Scenario Load levels vary during the day and the week in a fairly consistent and predictable manner. Sizing for the normal load level allows high utilization levels. Higher load levels can be handled as an exception by watching for abnormally high levels before the load peaks and borrowing capacity from lower priority applications such as development environments. Question: “Are dynamic capacity adjustments a mature and reliable technology?” Manager Application Database Developer Systems Systems Network Storage Architect Admin Architect Admin Admin Admin Higher Determine utilization normal load needed to curve for time reduce cost of day and of service day of week Time Negotiate Monitor victim to deviations steal above normal capacity load level from Provision extra capacity before it is needed 21
  • 22. ANALYZE Project: Capacity Planning for N1 ID: P925 Rightsizing Scenario Detailed Design Concept via an Example _ Large scale Internet workload _ Fairly predictable load shape – Peaks every evening (use peak hours) – Grows every week – Key CTQs _ Performance during peak hour – Cost of maintaining performance level – Risk of downtime – Tier 3 backend database server _ Primary bottleneck, over-provisioned elsewhere – Highest cost of CPU headroom (E10K/F15K class) – Initially 56 CPUs in domain, average 30 CPUs load – 22
  • 23. ANALYZE Project: Capacity Planning for N1 ID: P925 CPU Load Level Monitor for days or weeks to establish baseline and time of peak load, then track that timeslot daily CPU load (units are CPUs, 56 configured) for a busy day: Summed CPU Utilization Peak 50 2 Hrs CPU Utilization Level 40 30 20 10 0 0:00:00 0:58:00 1:56:00 2:54:00 3:52:00 4:50:00 5:48:00 6:46:00 7:44:00 8:42:00 9:40:00 10:38:00 11:36:00 12:34:00 13:32:00 14:30:00 15:28:00 16:26:00 17:24:00 18:22:00 19:20:00 20:18:01 21:16:00 22:14:00 23:12:00 Time of Day 23
  • 24. ANALYZE Project: Capacity Planning for N1 ID: P925 Utilization Distribution Capability plot for peak time shows system is less than half utilized about 25% of the time, too much headroom. Defect rate corresponds to Sigma level of 2.18. CPU Demand Level 24
  • 25. ANALYZE Project: Capacity Planning for N1 ID: P925 Increase Utilization Reduce system to 40 CPUs, assume linear increase in utilization - predicted sigma = 5.2 Over-simplified - headroom margin and non-linearities not included in the plan. So add a little extra headroom to compensate CPU Demand Level 25
  • 26. DESIGN Project: Capacity Planning for N1 ID: P925 Headroom Tool Prototype Solaris specific prototype _ Rapid prototype using SE Toolkit from http://www.setoolkit.com – Shows component level headroom vs. utilization goal – Automatic margin calculation based on CPU count – Samples every few minutes, reports every 30-60 minutes – Microstate based, sums over all processes – Headroom predictor uses mean plus two standard deviations – Text based, logs data to a daily file, 3.5 sigma headroom – Code p.=processor, r.=ram, n.=network, d.=disk, .st=status, .cf=configured, .ll=min lsl, .ul=limit usl, .ld=mean load, .h%=headroom, .sd=std deviation, .tco=TCO defect rate, .sla=SLA defect rate, .tK=throughput K, .rm=response time in milliseconds, .rp=response time proportional increase time pll pul pcf pst ptco psla pld psd ph% ptK prm prp 17:36:04 3.6 11.6 12 Green 0.00 0.00 5.26 0.28 50 15.8 1.05 1.08 18:06:04 3.6 11.6 12 Green 0.00 0.00 4.90 0.38 51 13.9 1.01 1.06 18:36:04 3.6 11.6 12 Blue 0.40 0.00 4.55 2.19 23 13.0 0.93 1.09 19:06:03 3.6 11.6 12 Blue 1.00 0.00 3.02 0.17 71 12.7 0.86 1.05 19:36:03 3.6 11.6 12 Blue 0.93 0.00 2.82 0.53 67 12.0 0.67 1.04 CPU Throughput is based on Samples taken every 12 CPUs configured, Status is based on measured Mean load level and voluntary context switches, two minutes and Lower limit 30% = 3.6, defect proportion of time that standard deviation are prm is very short, but prp reported every 30 Upper limit based on CPU load level is below pll=TCO or compared to the upper limit minutes above pul=SLA limits to calculate headroom. defines a response time curve count at 11.6 26
  • 27. DESIGN Project: Capacity Planning for N1 ID: P925 Headroom Calculations Set configured total to number of processors online conf = sysconf(_SC_NPROCESSORS_ONLN); Set lower spec limit to 30% for TOC failures lsl = conf * 0.3; Use response time goal of 3 times baseline on curve to determine margin for maximum load level rpgoal = 3.0; Calculate max load level from theoretical response time curve /* rp = R/S, rp = 1/(1-(U^m)) so U = exp(log((rp-1)/rp)/ m)) */ usl = conf * exp(log((rpgoal-1.0)/rpgoal)/conf); Calculate headroom % from mean plus two standard deviations versus upper spec limit headp = 100.0 * (1.0 - (mean + 2.0*sd) / usl); Calculate Sigma Zst tco_sigma = 1.5 + (mean - lsl) / sd); sla_sigma = 1.5 + (usl - mean) / sd); 27
  • 28. DESIGN Project: Capacity Planning for N1 ID: P925 Design Optimization Compare the “traditional” approach with the new design Run the headroom tool on a big and busy server, collect data and show how a simplistic approach compares with the method described in this project SunRay timesharing server monitored for several days. System is loaded to the limit at peak times, but idle out of hours, so focus on a scheduled capacity reallocation scenario. Simplistic “Traditional” Approach Collect data using vmstat, sar, SunMC or 3rd party tools Plot CPU % busy - as shown on next slide There is spare capacity, but no indication of how many CPUs are unused Need extra information that this is a 12-CPU system N1CP Approach Collect data using headroom prototype Plot CPU load level in CPU units, no need to guess or replot data Calculate margin, headroom and sigma levels Plan capacity reallocation and recalculate margin, headroom and sigma levels 28
  • 29. 29 CPU %busy 0 :3 0 : 3 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% :3 05 0 6 :0 DESIGN :3 5 0 9 :0 :3 5 1 0: 20 :3 9 1 0: 51 :3 1 1 0: 82 :3 1 2 0: 10 :3 7 ID: P925 0 0 :0 :3 6 0 3 :0 :3 6 0 6 :0 :3 5 0 9 :0 :3 6 1 0: 21 :3 2 1 3: 51 :3 6 1 3: 81 :3 6 2 3: 10 :3 8 3 Project: Capacity Planning for N1 0 :0 :3 7 3 3 :0 :3 6 3 Simplistic View 6 :0 :3 7 3 Time of Day 9 :0 :3 7 1 3: 51 :0 4 1 6: 81 :0 7 2 6: 10 :0 7 CPU Utilization Monday-Thursday 6 0 :0 :0 6 6 3 :0 :0 6 6 6 :0 :0 6 6 9 :0 :0 6 1 6: 21 :0 0 1 6: 51 :0 0 1 6: 81 :0 3 2 6: There is no indication of how many CPUs are in use, util = 59% overall 10 :0 7 6 :0 7
  • 30. DESIGN Project: Capacity Planning for N1 ID: P925 N1CP View free overnight, system overloads at peak times - CPU Counts There are 12 CPUs, 6 to 8 are Mean+2sd Load vs Configured and Upper Limit pcf pul pmd+2psd 14 12 10 8 CPU Count 6 Mean CPU Load 7.03 4 Mean Util 59% DPMO Min Sigma Summary Mean headroom 34% 2 TCO 110215 -1.5 Zst Mean capacity 12.00 SLA 538 2.5 Zst 0 0:30:05 3:00:05 5:30:05 8:00:06 10:30:16 13:00:14 15:30:21 18:00:08 20:30:06 23:00:06 1:30:06 4:00:06 6:30:06 9:00:09 11:30:15 14:03:13 16:33:10 19:03:07 21:33:07 0:03:07 2:33:06 5:03:07 7:33:07 12:36:12 15:06:17 17:36:07 20:06:06 22:36:06 1:06:06 3:36:06 6:06:06 8:36:08 11:06:12 13:36:12 16:06:12 18:36:07 21:06:07 23:36:06 Time of Day 30
  • 31. DESIGN Project: Capacity Planning for N1 ID: P925 N1CP - Response Curve System is close to overload, this timeshared workload has a flatter curve than internet workloads (closed rather than open queuing model) Response Time vs Load Level 2.5 2 Response Increase 1.5 1 0.5 0 0 2 4 6 8 10 12 CPU Count 31
  • 32. 32 CPU %busy 0 :3 0 3 :0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% :3 5 0 6 :0 DESIGN :3 5 0 9 :0 :3 5 1 0: 20 :3 9 1 0: 51 :3 1 1 0: 82 :3 1 2 0: 10 :3 7 0 ID: P925 0 :0 :3 6 0 3 :0 :3 6 0 6 :0 :3 5 0 9 :0 :3 6 1 0: 21 :3 2 1 3: 51 :3 6 1 3: 81 :3 6 2 3: 10 :3 8 3 Project: Capacity Planning for N1 0 :0 :3 7 3 3 :0 :3 6 3 6 :0 :3 7 3 Time of Day 9 :0 :3 7 1 3: 51 :0 4 1 6: 81 :0 7 2 6: 10 :0 7 6 There is no indication of how many CPUs are in use 0 :0 :0 6 6 3 :0 :0 6 CPU Utilization with Capacity Optimization 6 6 :0 :0 6 6 9 :0 :0 6 1 6: 21 :0 0 1 6: 51 :0 0 1 6: 81 :0 3 2 6: 10 :0 7 Simplistic - CPUs reallocated 6 :0 7
  • 33. DESIGN Project: Capacity Planning for N1 ID: P925 N1CPcount and times daily, and borrow extra for the peak load View - Dynamic! Vary the CPU CPU mean+2sd Load vs Config and Upper Limit pcf pul pmd+2psd 14 3.2s 3.2s 3.5s 4.3s 12 6.3s 10 3.6s CPU Count 8 5.2s 3.2s 6 5.7s Mean CPU load 7.03 4 Min Sigma Mean Util 74% Predicted 2 TCO 2.0 Zst Mean headroom 16% SLA 3.2 Zst Mean capacity 9.52 0 30 5 30 5 35 :3 09 30 6 30 5 36 :3 12 33 6 33 7 37 :0 14 06 6 06 6 06 :0 10 :3 11 :3 21 :3 07 30 6 :3 16 :3 16 :3 08 33 7 :0 17 :0 07 06 6 :0 10 :0 13 :0 07 07 3: :0 6: :0 9: :0 3: :0 6: :0 9: :0 3: :0 6: :0 9: :0 3: :0 6: :0 9: :0 0: :0 0: :0 0: :0 12 0: 12 0: 15 3: 12 6: 15 0: 18 0: 21 0: 15 3: 18 3: 21 3: 18 6: 21 6: 15 6: 18 6: 21 6: 6: 30 0 3 6 0: Time of Day 33
  • 34. DESIGN Project: Capacity Planning for N1 ID: P925 Summary Performance Impact SLA Sigma levels improve from minimum of 2.5 Zst to 3.2 Zst Improvement of 0.7 Sigma by allowing for extra peak load Simplistic methods do not allow quality of service prediction Cost Impact TCO Sigma levels improve from minimum of -1.5 Zst to 2.0 Zst Improvement of 3.5 Sigma by reducing capacity from 12 to 9.5 Observability Impact Headroom tool prototype generates all required statistics Sigma level is simply calculated, or headroom tool could print it Simplistic methods do not show what is going on Complexity Impact Dynamic reconfiguration must be enabled One reconfiguration each morning and each evening Applicability (Assertions, out of scope for this project!) CPU based example can be applied to blades, RAM, disk, net, thermal Method can be extended from platform level to services 34
  • 35. VERIFY Project: Capacity Planning for N1 ID: P925 N1 Console Screenshots 35
  • 36. GRID Project: Capacity Planning for N1 ID: P925 Capacity for Sale Uses for Spare Capacity Carefully schedule batch work and backups Remotely support global timezones Run engineering dept. simulation jobs Grid Oriented Solutions Project Grid - departmental cluster (Sun Grid Engine) Enterprise Grid - collection of clusters forming a general purpose Grid service (Sun Grid Engine Enterprise Edition) The Global Grid - Internet level - GT2.2, OGSA/OGSI/GT3 Provision an Enterprise Grid service using N1 Join The Global Grid and sell or share capacity 36
  • 37. GRID Project: Capacity Planning for N1 ID: P925 Relationships: N1 and Grid N1 is about provisioning things you own, Grid is about access to things you don’t own Business Infrastructure Model Things you Utility N1 own and Computing control Things you Grid Services Utility borrow or Computing Web Services lease 37
  • 38. GRID Project: Capacity Planning for N1 ID: P925 Capacity Flows in a Grid Enabled N1 Datacenter Utility Computing N1 Virtualized Datacenter Capacity Requests Capacity Purchase On Tier 0 C.O.D. Tier 1 Tier 3 Tier 2 Demand Web Web User / Web Database App Front Web Services Servers Storage Servers End Purchase Capacity Free Sun Pool Grid Cluster Grid Unused Grid User / Engine Compute and Resources Grid Services Enter- Storage Resources Prise Retire Edition Obsolete Capacity Repair and Replace 38
  • 39. GRID Project: Capacity Planning for N1 ID: P925 IT market segments by “need to share” Defense Commercial Technical Consumer spooks suits geeks users What can be Operating Nothing Hardware Everything System shared Nothing, N1, Server P2P apps, Grid, VPN, What is physical domains, VLAN SETI, Kazaa, encryption, separation and SAN Zone Limewire, trusted firewalls required partitioning People! Everything in The Everything What is Local systems, Local systems Global Grid including other and Internet visible Project Grids community users Storage. CPU cycles, CPU cycles. Network Organizational, Primary Latency. bandwidth. Organizational legal, constraints National issues Know-how contractual security issues 39
  • 40. Questions? Capacity Planning for N1 Adrian.Cockcroft@sun.com Sun Sigma DFSS Project P925