Hotsos 08 regarding_capacity_1_9c
Upcoming SlideShare
Loading in...5
×
 

Hotsos 08 regarding_capacity_1_9c

on

  • 395 views

This is a presentation I gave at the 2008 Annual Hotsos Symposium. Its punny title speaks to the obsession many performance analysts have with the analysis of utilization (U) decoupled from any ...

This is a presentation I gave at the 2008 Annual Hotsos Symposium. Its punny title speaks to the obsession many performance analysts have with the analysis of utilization (U) decoupled from any observations about how a system is actually performing. The preso is a bit of a tirade, but a good rant all-in-all for recalling some of the practices that lead to the common practices of massive over-provisioning and tragically under-managing systems performance.

Statistics

Views

Total Views
395
Views on SlideShare
391
Embed Views
4

Actions

Likes
0
Downloads
4
Comments
0

1 Embed 4

http://www.linkedin.com 4

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hotsos 08 regarding_capacity_1_9c Hotsos 08 regarding_capacity_1_9c Presentation Transcript

    • Capacity: Its Not All About U! (née: “RegardingCapacity”)Bob Sneed - Sr. Staff Engineer Sun Microsystems, Inc. Performance & Applications Engineering (PAE) Hotsos Symposium 2008, March 2-6 @ Dallas Rev 1.9c – March 19, 2008 Copyright © 2008, Sun Microsystems, Inc. All Rights Reserved.
    • Abstract When it comes to managing computer capacity, the state-of- the-industry is wildly diverse -- but often both primitive and inconsistent in the area of enterprise computing. Indeed, most discussions regarding capacity dont even involve appropriate engineering units of measure! Its no surprise that the relationship between capacity management, performance management, and Quality of Service (QoS) management is so uneven in practice. This session will survey modern quandaries in Performance and Capacity Management, and offer some insights and abstractions aimed at stimulating constructive discussion, progressive engineering development, and intelligent practices in this area. Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 2
    • Disclaimers Opinions and views expressed herein are those of the author, Bob Sneed, and do not represent any official opinion of Sun Microsystems, Incorporated - or anyone else.Im not a doctor and I dont even play one on TV - but I do regard Tom Baker and Chris Eccleston as role models. There is no warranty, expressed or implied, in the quality of the information herein, or its fitness for any given purpose. If you goof up applying this stuff and have a bad outcome or destroy a bunch of data – its not my fault or Suns. This is version 1.x material. Batteries not included. Your mileage may vary (YMMV). Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 3
    • Agenda• Motivations [10]• Lets Talk PerfCap [15]• Case Study [10]• Ruminations on the State of the Art [ 5]• Heterogeneity, Elasticity, and Covariance [15]• Concluding Remarks [ 5](All times in Bob-minutes; YMMV ...) Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 4
    • MotivationsCopyright © 2008, Sun Microsystems, Inc. All rights reserved. 5
    • Concerns and Premises• Primitivism: Many customers are doing capacity wrong with the result being variously massive over- provisioning, surprises in production, or much ado about normal!• Im annoyed: Many "capacity crises” are actually either chaos in action or misunderstandings about The Way Things Work.• Advancing the art: Investments are required to make industry advances in managing Performance and Capacity (PerfCap).• Customer value: Right-sizing is a win-win scenario. Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 6
    • How widespread is “wrong”?• Its not that everyone is doing it wrong ... > ... though even many who do PerfCap right are crippled by organizational behaviour and GIGO constraints ...• In some places, PerfCap tends to get done right ... > Technical computing (HPC, HPTC) > Embedded computing & realtime systems > In well-defined tiers with homogeneous workloads• In some places, PerfCap tends to get done wrong ... > Commercial IT – especially around big databases > Heterogeneous workloads - some inherently complex, some resulting from consolidation or virtualization• Bob says: “Tiers are for people who have not discovered resource and workload management!” Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 7
    • PerfCap / Physics Metaphor• Primitivism, pre-science ~ state of the practice > Wonder; everything is mystery and magic > Underlying causes attributed to nature or deities > Stagnant - “Because weve always done it that way”• Newtonian physics ~ state of the art > Causality; testable hypotheses, repeatable outcomes > Mathematical relationships determined > Enables the modern era• Einsteinian physics ~ the horizon > Relativity; frames of reference > True nature of things theorized; testability gets harder > Propels the post-modern era Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 8
    • Over-Provisioning; so What?• Pros ... > Hardware is cheap. Sun sells hardware. Good for Bob! > Feature/function time-to-market has priority. > Performance expertise scarce and inconsistent. > No time for learning “new tricks”. > “Throwing Iron” at problems has a fixed cost and a set delivery date - and it often “works”.• Cons ... > Capital costs > Operational costs (power, cooling, space, administration) > Stagnation: The applicable math, science, and vocabulary has ended up deferred – for nearly an entire era. Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 9
    • Lets Talk PerfCap Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 10
    • PerfCap Language: Goals• Business Metrics > System performance in business terms, such as transactions per second, batch run time, or percent of jobs/transactions meeting some performance criteria (Service Level Agreement, or SLA) > Business objectives are typically diverse in terms of importance and resource demands• Business Metrics and Indicators (BMIs) > Business metrics plus secondary indicator variables, such as aggregate packet rate or commit rate > These are observables one might monitor and alarm on Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 11
    • PerfCap: Solving the Right Problem• “The Goal” - Goldratt > Written as a novel; an unusual approach to conveying principles from Operations Research • “Are Your Lights On?” - Gause & Weinberg > A fun and easy read > From the same Weinberg as the classic “Psychology of Computer Programming” Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 12
    • PerfCap Language: Capacity • Some definitions > English: The ability to do a job. > Technical: The maximum reliable throughput with acceptable response times. > Geek: The throughput limitation of the bottleneck device. • Supermarket metaphors > What percent of cashiers should be always idle? > What purposes do “express lanes” serve? • Submarine metaphor > Compare “100% underwater” with “crush depth”; which one represents capacity? Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 13
    • PerfCap Language: Capacity Planning • Capacity Planning defined - with footnotes > Estimating[A] capacity requirements[B] in time to be able order, receive, provision, and deploy – before you run out of capacity. [A] Prognostication and prestodigitation, usually based on B.S. forecasts from marketing departments [B] NOTE: Related disciplines increase capacity without capital outlays ● Efficiency – doing more with less; tuning; optimization ● Software Performance Engineering (SPE) – the discipline of engineering to meet performance requirements • Its not all about U! (Utilization) > Its mostly about R (response time), X (throughput), service demands, and efficiency (which relates to U) and The Way Things Work Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 14
    • PerfCap Language: Queuology• Queueing Theory = math used for PerfCap work > Too bad it does not have a simple one-word name like arithmetic, calculus, topology, trigonometry, or sadistics (how about “queuology”?)• Response-time = Queue wait + Service time > R=W+S > NOTE: This is not Plain English. It must be taught in context to enable meaningful conversations.• Bottleneck = scaling constraint > NOTE: This is not Plain English. In PerfCap, this term has no negative emotional connotation. Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 15
    • PerfCap Language: Crazy about U!• Utilization (U) > The percent of time a resource is not idle > Physics analogy: Work = Force * Displacement ● No displacement means no work• Another physical metaphor ... > Helicopter: What does a helicopters engine tachometer tell you about the helicopters performance? Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 16
    • PerfCap Language: U is for Useless?• “Utilization is Virtually Useless as a Metric” - Adrian Cockcroft, CMG 2006 > http://perfcap.blogspot.com/2005/12/cmg05-trip-comments-and-utilization-is.html > http://www.cmg.org/membersonly/2006/papers/6133.pdf “We have all been conditioned over the years to use utilization or %busy as the primary metric for capacity planning. Unfortunately, with increasing use of CPU virtualization and sophisticated CPU optimization techniques such as hyper-threading and power management the measurements we get from systems are "virtually useless". This paper will explain many of the fundamental alternatives, and express capacity in terms of headroom, in units of throughput within a response time limit.”• Adrian wins 2007 CMG Michelson Award > http://perfcap.blogspot.com/2007/12/a-michelson-award-acceptance-speech.html "Those who ask questions about utilization dont understand that their questions have no meaning so the answers are irrelevant :-)" Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 17
    • Aggregate Utilization: U-all?• Business Logic > Workload classes (eg: OLTP, BATCH, pseudo-BATCH) ● Varies in business priority ● Varies in relative I/O content ● Varies in propensity to compute > Per-class utilization varies based on many system factors (CPU architecture, OS scheduling, space/speed tradeoffs, efficiency tradeoffs, virtualization), and also due to often- uncontrolled competition for resources > Cycles-per-instruction (CPI) varies with compile/build factors and competition factors > Utilization is limited by concurrency of demand and bounded by serialization per Amdahls Law > Utilization often largely due to bad app code and/or bugs Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 18
    • Aggregate Utilization: U what !?@#!• Overhead categories > Polling operations > Lock and latch spins (adaptive) > Locking and latching cache coherency > Memory management (a maze of twisty passages ...) > Re-work (fail-and-retry logic) > Migrations & cache invalidations > Context switches (voluntary and involuntary) > Hardware thread-switching (some cheap, some not) ● SMP, VMT, SMT, CMT – all different! > Performance monitoring and management tools ● Significant “probe effect” can occur from some tools ● The aggregate impact of tools is often a root cause of problems > Bad tuning and bugs - outside of the business logic Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 19
    • PerfCap Language: Like, U-know?• Workload Characterization > PerfCap definition: Attribution of resource utilization to various distinct business processes or technical functionality ● Essential to understanding resource usage > Engineering definition: Characterization of platform response factors under a given workload ● Interesting to drive systems engineering > Vernacular definition: Various broad terms like OLTP, BATCH, DSS, DW, PROD, UETP, DVLP, TEST, OLAP, ERP, ETL, ad-hoc, and my personal favourite - “mixed” ● Suggestive of requirements, but non-quantitative Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 20
    • Hockey Sticks and Knees 4 U Excerpted from "Analyzing Computer System Performance” by Neil J. Gunther, Springer-Verlag 2005. ISBN 3540208658 (Used with permission.) Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 21
    • So, what do U know? • Do you know your overhead/work ratio? • Do you know your ratio of OLTP to pseudo-BATCH? • Do you know how these vary under load? • Do you know how to observe, measure, and manage these things? Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 22
    • PerfCap Language: Method Rrrrrr! Right Wrong • Performance • Performance > Response time > CPU %busy, %usr/%sys ratio > Throughput > IOPS, disk latency, %wio > Variance > Graphs of aggregated data • Capacity • Capacity > Latent performance > Whatever you get at 100% utilization • Headroom • Headroom > ((100% capacity) – > (100% – utilization) (current peak performance)) • Utilization • Utilization > (100% – %idle) > (100% – headroom) Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 23
    • Case StudyCopyright © 2008, Sun Microsystems, Inc. All rights reserved. 24
    • Case Study: Scenario• Financial E10K user upgraded to E2900 > CPU power of E2900 was 125% that of the 10K system ● E10K: #64 US-II @ (64 “slow” cores) ● E2900: #12 US-IV+ @ (24 “fast” cores) > Result: Utilization on E2900 was greater than on E10K! > Impact: Great angst! Management wanted %idle > 20! E2900 dissed. Move to E6900 contemplated. (Focus was on utilization (U) ... response-time (R) and throughput (X) were essentially ignored) > Breakthrough! Customer agreed to a test-to-fail exercise! ● Monitor response times per-transaction-class ● Increase benchmark workload until SLA not met Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 25
    • Its not all about U! 600RTX2 480 360SLA = 240600 sec 120 0 0 100 200 300 400 500RTX1 0.5 0.4SLA = 0.3 0.20.5 sec 0.1 0 0 100 200 300 400 500 100 80UCPU OMG! 20%Headroom? 60Max = 40100% 20 No! 300% Headroom 0 0 100 200 300 400 500 Users Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 26
    • Case Study: Experimental Results• The new system had plenty of latent capacity! > Test-to-fail revealed 300% headroom at 80% utilization! > All they needed was 1X headroom at 100 users! > Workload characterization revealed that a single CPU- greedy transaction of no business importance was vastly over-achieving its SLA > The CPU-greedy transaction under Solaris TS scheduling automatically fell to priority 0 - thus having zero impact on real OLTP as OLTP demand ramped up to 4x the level that corresponded with 80% aggregate CPU utilization > At the “tipping point”, the chaos may have been due to LGWR priority dropping to 0 under Solaris TS scheduling Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 27
    • Case Study: Business Outcome• Customer emergency upgraded to an E6900 > CPU power of E6900 was 200% that of the 10K system > Rumor has it that they got a really good discount > E6900 showed a “comforting” 20%+ idle under full test load• Moral > Science is often secondary in commercial IT > Due to issues of organizational behaviour, even empirical results might fail to triumph over rules of thumb > The cost of hardware is a minor issue to many IT managers decision-making process > Get over it ... or - develop new metrics and methods by which IT managers can be made comfortable! Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 28
    • Ruminations on the State of the Art Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 29
    • Common PerfCap Mistakes• Absence of business metrics > What Problem are You Trying to Solve?• Equating usage with demand or requirement > In other words, assuming that demand is inelastic• Failure to do performance first and often > Why scale waste and inefficiency?• Assuming supply is inelastic > In other words, assuming service times are constant• Misinterpreting “the device with the highest utilization is the bottleneck device” > Hmm, what about polling loops?• Decisions based on intuition and rules of thumb > Sophistication can pay great rewards! Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 30
    • Whats the Right Way to do PerfCap?1) Empirical Methods (The Best & Most Expensive) ● Benchmarks, stress testing, test-to-scale, test-to-fail – with known Best Practices & basic performance analysis and tuning2) Modeling (Highly Recommended & Moderate Cost) ● Using tools such as TeamQuest Model (TQM), BMC Perform/Predict, Hy- Performix, Gunthers PDQ or other application of proper science and math3) Expert Opinions (The Minimum & Cheapest) ● Listening to the right experts for Best Practices, analysis and tuning methods, and sizing4) Guesswork (The Norm) ● Straight-line extrapolations, naïve use of reference benchmarks, massive over-provisioning, bogus testing, luck5) Opportunism (Commonplace) ● Spend the available budget Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 31
    • RTFM: PerfCap Resources• Dr. Neil Gunther – prolific, readable, digestible > “The Practical Performance Analyst” - foundational http://www.amazon.com/dp/059512674X/ > “Guerrilla Capacity Planning” - http://www.perfdynamics.com/Manifesto/gcaprules.html Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 32
    • RTFM: PerfCap Resources• Cary Millsap – digestible, practical, methodical > “Optimizing Oracle Performance” ● Chapter 1 & 2 – a great intro to the art of PerfCap, whether or not one applies it to Oracle ● Method R Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 33
    • RTFM: PerfCap Resources• Raj Jain - “The Art of Computer Systems Performance Analysis” > Fundamental, foundational, readable Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 34
    • When Models Break • Good models break due to factors that are exogenous to the model (ie: not considered) > Examples: bus saturation, cache saturation, lock contention, covariance • Bad models break because they are bad models > Examples: “straight line” projections, models that do not consider basic queuing phenomena Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 35
    • What Breaks Existing Models • Heterogeneity > There is diversity in both supply and demand factors > For example, OLTP, BATCH, and DSS are classical characterizations for common workload elements • Elasticity > Resource supply and demand factors are each elastic > For example, per-transaction demand might diminish under increasing load and supply might become more efficient • Covariance > Competition for resources impacts all competitors - sometimes adversely or pathologically Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 36
    • Heterogeneity, Elasticity, and Covariance Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 37
    • Heterogeneity: Many Dimensions • Business priority > Importance to the enterprise • Service demand > Resource requirement, including deadline constraints • Technical priority > Solaris scheduling priority • Quality (versus quantity) > Not all CPU-seconds are created equal • Urgency > Importance, as distinct from priority or share ● (example: princes and paupers) Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 38
    • Heterogeneity: Early Warning Signs • “ERP” • “Consolidation” • “RDBMS” • “Ad-hoc” • “Custom” • “Producer/Consumer” • “Client/Server” • “Dispatcher thread/process” • Testimony to the contrary (eg: “Its entirely homogeneous OLTP!”) Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 39
    • Heterogeneity: Example(s)# prstat -m PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 13632 oracle 50 50 0.0 0.0 0.0 0.0 0.0 0.0 0 0 48K 0 sqlplus/1 13633 oracle 0.0 96 0.0 0.0 0.0 0.0 48 0.0 0 0 46K 0 sqlplus/1 15849 oracle 92 0.1 0.0 0.0 0.0 100 100 0.1 13 45 1K 0 oracle/11 27639 oracle 91 0.1 0.0 0.0 0.0 100 100 0.1 24 50 2K 0 oracle/11 13601 root 18 54 0.0 0.0 0.0 0.0 36 0.0 178 178 87K 0 ps/1 13551 root 0.0 68 0.0 0.0 0.0 0.0 39 0.0 244 195 93K 0 prstat/1 12614 oracle 64 0.2 0.0 0.0 0.0 100 100 0.1 50 38 3K 0 oracle/11 24020 oracle 47 0.5 0.0 0.0 0.0 100 100 0.1 190 36 10K 0 oracle/11[...] 11087 oracle 9.3 0.1 0.0 0.0 0.0 0.0 90 0.0 5 6 6K 0 oracle/1 13490 root 0.0 8.5 0.0 0.0 0.0 0.0 93 0.0 380 0 25K 0 sh/1 2154 oracle 7.9 0.2 0.0 0.0 0.0 100 100 0.0 53 5 3K 0 oracle/11 9656 oracle 7.1 0.1 0.0 0.0 0.0 0.0 92 0.0 37 5 2K 0 oracle/1 24156 oracle 6.7 0.1 0.0 0.0 0.0 100 100 0.0 6 4 2K 0 oracle/11 13496 oracle 6.2 0.0 0.0 0.0 0.0 0.0 93 0.0 341 0 19K 0 sh/1 13488 oracle 6.0 0.0 0.0 0.0 0.0 0.0 96 0.0 330 0 19K 0 sh/1 25478 oracle 3.9 0.1 0.0 0.0 0.0 0.0 96 0.0 46 3 2K 0 oracle/1 8098 oracle 2.9 0.1 0.0 0.0 0.0 0.0 97 0.0 60 3 2K 0 oracle/1[...]Total: 295 processes, 2869 lwps, load averages: 11.64, 12.02, 12.05 Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 40
    • Heterogeneity: Exploring • Fun commands you can use at home ... # Taking U apart prstat -n 8192 -m // Microstate accounting prstat -n 8192 -mL // Per-thread microstate accounting # Thread count ... awk {print $15} < prstat-sample.1 | sort | grep oracle | uniq -c | more # CPU intensity ... grep oracle/ prstat-sample.1 | awk {print $3} | sort -n +1 | uniq -c | more # Diverse priorities ... ps -e -o pid,class,pri,args Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 41
    • Heterogeneity: Deal with it! • Identify it > This is one aspect of workload characterization in the language of PerfCap > Consider its many dimensions (business priority, service demand, technical priority, urgency, deadlines) • Tell the OS about it > The OS does not know your priorities, so tell it! > Automating this is a good investment • Model it > w.r.t. competition and covariance – TBD Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 42
    • Elasticity: Supply Factors • In general, “supply” is net of competing demands > “Im giving ya all I got, captain!” > FCFS – who got in line first? • In a specific configuration, elastic factors abound > With mixed-speed CPUs, Q(CPU-second) = f(MHz) > With CMT, Q(CPU-second) = f(core loading) > Q(CPU-second) = f(ISA & pipeline sophistication) • Unmanaged, the probability of thread pinning will increase with increasing interrupt load Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 43
    • Elasticity: Supply Factors • Priority preemption > Good – under TS, compute hogs will drift to priority 0 > Bad - unmanaged, a large population of homogeneous threads may frivolously preempt each other > Ugly – interrupts have top priority; they can even interrupt and “pin” realtime (RT) threads > Hideous – its really tragically bad when TS demotes your highest-importance thread (eg: Oracle LGWR) Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 44
    • Elasticity: Supply Factors# mpstat 5CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 211 449 142 423 4 21 25 0 460 17 2 0 82 1 1 0 127 155 2 296 2 6 23 0 199 13 1 0 86 2 0 0 30 30 0 56 0 3 9 0 64 1 0 0 98 3 0 0 0 2 0 2 0 1 4 0 0 0 0 0 100 8 1 0 199 278 0 548 4 11 37 0 470 23 1 0 76 9 0 0 0 2 0 2 0 1 4 0 0 0 0 0 100 10 0 0 30 53 0 104 0 3 11 0 155 4 0 0 95 11 0 0 0 2 0 2 0 1 3 0 0 0 0 0 100 16 1 0 178 258 0 508 3 10 29 0 521 16 1 0 82 17 0 0 3 5 3 4 0 1 6 0 2 0 0 0 100[...]104 1 0 222 194 4 377 1 6 28 0 281 16 1 0 83105 0 0 0 2 0 2 0 1 2 0 0 0 0 0 100106 0 0 0 3 0 4 0 1 3 0 13 0 0 0 100107 0 0 0 2 0 2 0 1 2 0 0 0 0 0 100112 1 0 141 229 1 451 2 3 23 0 289 18 1 0 81113 0 0 1 3 1 2 0 1 1 0 0 0 0 0 100114 0 0 0 6 0 9 0 2 2 0 3 0 0 0 100115 0 0 0 2 0 2 0 1 1 0 0 0 0 0 100120 4 0 397 409 3 804 4 3 44 0 450 23 3 0 74121 0 0 1 3 1 2 0 1 2 0 0 0 0 0 100122 0 0 13 15 0 28 0 2 3 0 13 1 0 0 99123 0 0 0 2 0 2 0 1 1 0 0 0 0 0 100 Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 45
    • Elasticity: Supply Factors$ awk {print $3,$4} ps-sample.out | sort | uniq -c | sort -nr +2 1 RT 157 1 RT 140 1 RT 100 1 SYS 98 1 SYS 96 Important! 3 TS 60 2 FX 60 1 SYS 608238 TS 59 Primary modality; Hey! Wait a minute! 1 TS 58 3 TS 54 OLTP shadows Im really important! 11 TS 53 2 TS 52 Why didnt anyone tell 1 TS 51 6 TS 50 the OS? 14 TS 49 CPU hogs, Help! 1 TS 36 punished by TS 1 TS 34 1 TS 29 1 TS 22 1 TS 12 3 TS 0$ grep lgw ps-sample.out10494 1 TS 34 ora_lgwr_XYZPNOTE: ps-sample.out data was from ps -e -o pid,ppid,class,pri,args Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 46
    • Elasticity: Demand Factors• “The mythical CPU-second” > Sensitivity to compile options – eg: branch mispredicts, pipelining, inlined macro-operations versus library calls > Sensitivity to link options – eg: locality versus I$ and D$ behaviour > Sensitivity to competition – could be viewed as elasticity of demand or supply, or as covariance ... depending on ones point of perspective > Adaptive algorithms – eg: decisions to yield and re-queue (rather than spin) might be made as a function of system load –– and that can reduce the CPU-sec/transaction as load increases Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 47
    • Elasticity: Demand Factors• Under high load, frivolous migrations should decrease, leading to improved cache utilization and reduced memory waits• Demand can vary in both quality (overhead/work) and quantity (overhead+work) as load is varied > Ratio of business logic to spins for locks and latches > Write coalescing by LGWR > Checkpoint write deferral by DBWR Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 48
    • Elasticity: Deal with it! • Demand > Seek out and destroy inefficiency – but keep the 80/20 rule in mind > Use Resource Management (RM) at the app, OS, and DB levels – maybe Oracle Resource Manager (ORM)? > The final constraints are the speed of your components and the speed of light • Supply > Invest in getting required factor-level QoS to various processes in relation to their business criticality Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 49
    • Covariance: Pigs at the Trough• Workloads often unmanaged and multi-modal > Spectrum is wide, but simple case is BATCH vs. OLTP• What if your OLTP SLA outliers are due to I/O competition from your BATCH? > Maybe your BATCH is being over-served for I/O? > Maybe you could throttle your BATCH I/O demands?• What if your BATCH SLA outliers are due to CPU competition from your OLTP? > Maybe your OLTP is being over-served for CPU? > Maybe you could dynamically compromise on your OLTP CPU priority? Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 50
    • Covariance: Some Examples• “Foxes and chickens” problem: mixing incompatible species in the same cage• Most famously: “batch versus OLTP” > I/O demand by batch is what typically slows OLTP, but CPU demand by batch should not impact OLTP > OLTP demand for I/O or CPU might impact batch• Harder to see: “cache-sensitive” versus “cache- poluting” competition > Cache-sensitive workload elements can be slowed by elements that constantly spoil the cache• Heads-up! Virtualization means increased sharing! Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 51
    • Covariance: Deal with It! • Expensive: Physical segregation and isolation > e.g. - run BATCH or reports on another system > e.g. - dedicate disks, channels, buses, and CPU to business or technical functions as required • Primitive: Temporal segregation and isolation > e.g. - run BATCH at night • Refined: Prioritization, throttling, deadline scheduling > e.g. - run BATCH at low priority, inject delays, increase priorities as deadlines get closer Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 52
    • Concluding Remarks Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 53
    • Parting Thoughts • Participate in CMG http://www.cmg.org “Ignorance of the law is no excuse!” • Go where you may not have gone before > Test-to-fail > Analyse > Fix or manage > Repeat • If you are not managing to Business Metrics, you are wasting time and energy! Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 54
    • Q&A?Special Thanks to ...• Adrian Cockcroft, Cary Millsap, Jim Holtman, Dr. Neil Gunther > mentors and provocateurs• David J. Miller, Benoit Chaffanjon > editorial services & peer review• Glenn Fawcett > smoke-jumping brotherhood & cool graphics• Jim Mauro > northern star• Larry Klein > inspiration from “Its all about U” ... and in general Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 55
    • Extended Discussion Slides Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 56
    • Primitivism• “You might be a redneck if ...” > You think "capacity" is when you pass out. > You cannot imagine why anyone would model a cue. > You have only seen a queue on Hop Sing or David Caradine. > You believe chaos past 80% utilization is a law of nature. > You make no effort whatsoever to control whats important to you. [... with a tip of the hat to Jeff Foxworthy ...] Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 57
    • “Some people think that once they know thetricks of the trade, that they know the trade.” “A little bit of knowledge can be a dangerous thing.” Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 58
    • Paths Forward • Increased education in PerfCap > Math, science, language/vocabulary > “Do performance first, then capacity.” • Increase usage of available tools > Extract benefits, learn limitations, develop art • Increased networking amongst stakeholders > Build awareness of what can go wrong; seek synergy • Breaking new ground > CMT and Virtualization challenges > Power management > Automating workload management > “PerfViz” - CMG focus area > “Regarding Capacity” - Our focus for the rest of the hour ... Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 59
    • Water Glass Metaphors • Is it 50% full or 50% empty? > CMG-speak: Is it 80% busy, or 20% under-utilized? • “Big Rocks” > Demonstrates heterogeneity and priority Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 60
    • Two Views of “Best Practices”• Bob Sneeds > “Best Practices are time-proven and customer-proven practices which are well-documented and believed to have little or no downside potential.” > “... practical workarounds for product design limitations” > “... contrast with just works; needs no practices” > “... contrast with tuning, which implies trial and error• Dr. Neil Gunthers > “Best Practices are an admission of failure.” > “... trading workarounds, practices, and rules of thumb does not advance the science or deepen understanding > “... contrast with decomposing, understanding, modeling, proper engineering” > “... just another form of trial and error” Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 61
    • Pop Quiz #1• SITUATION: A system runs at 100% CPU usage for 1 hour each day completing a single compute-bound task. The SLA requires the task to complete in 4 hours.• Q1: How much “headroom” does this system have?• Q2: How can this tasks resource footprint be managed to never exceed 80% CPU usage? Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 62
    • Pop Quiz #1: Answers• SITUATION: A system runs at 100% CPU usage for 1 hour each day completing a single compute-bound task. The SLA requires the task to complete in 4 hours.• Q1: How much “headroom” does this system have?• A1: 300% (in workload terms) or 75% (in percent-of- system terms) - it can do 4x the work it now does and remain within the SLA.• Q2: How can this tasks resource footprint be managed to never exceed 80% CPU usage?• A2a: Huh? Why would anyone want to do that?• A2b: Resource management. Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 63
    • Pop Quiz #2• SITUATION: An 8-way 1000-BogoMIPs box runs at 75% CPU busy, with a workload that includes four compute-bound threads plus some OLTP. The new target system is a 4-way 2000-BogoMIPs system.• Q1: What is the new systems projected CPU utilization?• Q2: How can this systems workload be managed to never exceed 75% CPU utilization? Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 64
    • Pop Quiz #2: Answers• SITUATION: An 8-way 1000-BogoMIPs box runs at 75% CPU busy, with a workload that includes four compute-bound threads plus some OLTP. The new target system is a 4-way 2000-BogoMIPs system.• Q1: What is the new systems projected CPU utilization?• A1: 100%. Each of the four compute-bound threads will keep one CPU 100% busy.• Q2: How can this systems workload be managed to never exceed 75% CPU utilization?• A2a: Huh? Why would anyone want to do that?• A2b: Resource management. Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 65
    • Pop Quiz #3• SITUATION: An 8-way 1000-BogoMIPs box runs at 75% CPU busy, with a workload that includes four compute-bound threads plus some OLTP. The new target system is a 4-way 2000-BogoMIPs system. (Same as last quiz, OK?)• Q1: How will the compute-bound threads performance be impacted by the upgrade? (Just roughly speaking – no need for precision here!) Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 66
    • Pop Quiz #3: Answers• SITUATION: An 8-way 1000-BogoMIPs box runs at 75% CPU busy, with a workload that includes four compute-bound threads plus some OLTP. The new target system is a 4-way 2000-BogoMIPs system. (Same as last quiz, OK?)• Q1: How will the compute-bound threads performance be impacted by the upgrade? (Just roughly speaking – no need for precision here!)• A1: It should run almost 4x faster. Each new CPU is 4x faster than the old ones. (2000/4)/(1000/8) = 4. The OLTP will use some of the CPU cycles, but its service demand pales next to the compute jobs. Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 67
    • Pop Quiz #4• ESSAY QUESTION: “At what point do these principles become difficult?” Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 68