Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
CPU QoS
                  (Quality of Service)




   Bob Sneed - Sr. Staff Engineer
            Sun Microsystems, Inc.
  ...
Abstract

    This is a discussion of the qualitative aspects
      of a CPU-second. These low-level metrics
      are cri...
Disclaimers
     Opinions and views expressed herein are those of the
      author, Bob Sneed, and do not represent any of...
What About Bob?
• Bob works in the Systems Quality Office (SQO) at
  Sun Microsystems, Inc; 13-year veteran at Sun
  > Mai...
Agenda
•   Context & Motivations
•   Metrics & Measurements
•   Tales of CPU QoS
•   Conclusions




                     ...
Context & Motivations




                                                                             6
           Copyri...
End-User QoS
At the risk of stating the obvious ... Q User = ...

   Σ N1Q1 + N2Q2 + N3Q3 + N4Q4 + ...
   or Σ NI/OQI/O + ...
CPU QoS and Capacity
• Capacity α Efficiency
  > 100% busy is 50% capacity if efficiency could be doubled
  > 100% busy is...
Quality vs. Quality
• Qualities of Time
  >   Start time; early or late?
  >   End time; on-time?
  >   Interruptions vers...
Quality vs. Quality Analogues
• Qualities of Time                                     • Qualities of Computing
  > Start t...
Controlling CPU QoS
• Developer-level                                                                  Out of my control!
...
Metrics & Measurements




                                                                             12
           Copy...
A Thread in Heaven: Ideal CPU QoS
• The scheduler is not interrupting me often
   > I have a nice big quantum, and my prio...
Scheduling: Some Useful Metrics
 • With Solaris microstate accounting, prstat -mL shows
   per-thread, among other things ...
Scheduling: Solaris Preemption Control
• Preemption Control API
  > See: schedctl_init(3C) et al - allows a process to tel...
Scheduling: What Matters Most?
$ ps -e -o class,pri | sort | uniq -c | sort -nr +2
   1 RT 157
   1 RT 140
   1 RT 100
   ...
Memory QoS
• CPU QoS is tightly coupled with memory QoS
  > “If Mama ain't happy, ain't nobody happy” - Dr. Phil
• Memory ...
Memory QoS: Metrics
• Hardware-level memory quality factors
  > Latency, varies with technologies and backplane
     > Bas...
Memory QoS: Observability
• Lots of instrumentation in the Solaris OS ...
  > Some well-documented and end-user accessible...
Memory QoS: High-Order Bits
• TLB-management generally shows up as %usr
  > Mapping a page involves a table walk lookup, a...
OS-Level Counters: ISA Emulation
• Instruction Emulation: not all instructions in an
  Instruction Set Architecture (ISA o...
OS-Level Counters: ISA Augmentation
• Special-purpose extensions to an ISA may be designed
  into some CPUs for accelerati...
Chip-Level Counters: Typical
• Specifics vary enormously by chip ISA and model
  > Clock rate (may vary with power managem...
CPI: Bigger isn't Better
• High cycles-per-instruction (CPI) implies memory
  waits and/or long-running/complex instructio...
Chip-Level Counters: Constraints
• In general, not accessible from virtual environments
  > Access usually requires privil...
Chip-Level Counters: Access in Solaris
•   cpustat – basic counter access
•   cputrack – counters as they change for a pro...
Chip-Level Counters: Bedtime Reading
• 'cpustat -h' lists counters available on current CPU, but counter
  details vary fr...
Chip-Level Counters: Rollup Tools
• har – Hardware Activity Reporter - reports MIPS and
  more on selected CPUs – SPARC an...
Tales of CPU QoS




                                                                             29
           Copyright ...
Common Themes
• Capacity thinking is done in “MIPS”, but the real
  problem is often elsewhere
  > latency
  > scheduling
...
Case Study: A Famous Compute-Hog
• SPARC-specific Oracle Bug #6814520
  > DSS workload was exhibiting disappointing bandwi...
Case Study: Partner Pairs
• Problem: Disappointing throughput on messaging app
  with many producer/consumer pairs on larg...
Case Study: Foxes and Hens
• Problem: Disappointing throughput on messaging app
  with many producer/consumer pairs on sma...
Checkpoint ...
• Regarding the last two examples ...
  > Same approximate problem description
    ... but completely oppos...
Juxtaposition Games You Can Play
• How well does it run when bound to a dedicated CPU?
  > In Solaris, a processor set (ps...
Case Study: How Much Concurrency?
• Problem: Oracle Parallel Query produced disappointing
  results on 64-way CMT chip
• D...
Case Study: How Much Concurrency?
• Problem: Oracle Parallel Query produced non-linear
  scaling results on large VMT syst...
Checkpoint ...
• Regarding the last two examples ...
  > Same approximate problem description
    ... but resolution was h...
Case Study: The Sneaky Leak
• Problem: Batch application takes far too long; only 1/4
  of month's data can be processed p...
“... ay, there's the rub ...”
• Developers have many options ...
  >   Application architecture
  >   Algorithm selection
...
Controlling CPU QoS
• Developer-level                                                                  Out of my control!
...
Closing Remarks & Call to Action




                                                                              42
    ...
So, Here's the Thing ...
• There must be many cases undiagnosed in the
  wild of issues illustrated by the cases cited her...
Call to Action!
• Find tools for observing low-level metrics on your
  critical production platforms
• Poke around on thos...
Q&A?

                                                                  45
Copyright © 2007 by Sun Microsystems, Inc. All ...
Upcoming SlideShare
Loading in …5
×

CPU QoS 1.1

2,480 views

Published on

These slides were delivered at Computer Measurement Group (CMG) regional meetings in Richmond Virginia and Raleigh North Carolina in 2009 as a call to arms for capacity planners to look beyond the CPU-second. Way beyond!

Published in: Technology
  • Be the first to comment

CPU QoS 1.1

  1. 1. CPU QoS (Quality of Service) Bob Sneed - Sr. Staff Engineer Sun Microsystems, Inc. Systems Quality Office Southern Area CMG Meetings September 24, 2009 @ Richmond, VA September 25, 2009 @ Raleigh, NC Rev 1.1; October 27, 2009 Copyright © 2009 by Sun Microsystems, Inc. All Rights Reserved.
  2. 2. Abstract This is a discussion of the qualitative aspects of a CPU-second. These low-level metrics are critical to understanding efficiency and capacity, yet there's no consideration on them in mainstream CP practices! Examples here are from Sun Solaris, but the concerns here have a much broader scope. 2 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  3. 3. Disclaimers Opinions and views expressed herein are those of the author, Bob Sneed, and do not represent any official opinion of Sun Microsystems, Inc. I'm not a doctor - and I don't even play one on TV, but I am a huge fan of Tom Baker and Chris Eccleston. There is no warranty, expressed or implied, in the quality of the information herein, or its fitness for any given purpose. If you goof up applying this stuff and have a bad outcome or destroy a bunch of data – it's not my fault or Sun's. This is version 1.0 material. Batteries not included. Your mileage may vary (YMMV). 3 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  4. 4. What About Bob? • Bob works in the Systems Quality Office (SQO) at Sun Microsystems, Inc; 13-year veteran at Sun > Main focus is real-world performance and capacity issues > monitor root causes and their cures > promote Best Practices > work with ISV's on performance- and capacity-related matters > assist with performance-related service incidents > provide feedback to engineering and marketing > travel to teach/share/fix > SQO colleagues are among Sun's top trouble-shooters • See also: http://blogs.sun.com/bobs > (Sorry; it has been on pause for many months!) 4 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  5. 5. Agenda • Context & Motivations • Metrics & Measurements • Tales of CPU QoS • Conclusions 5 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  6. 6. Context & Motivations 6 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  7. 7. End-User QoS At the risk of stating the obvious ... Q User = ... Σ N1Q1 + N2Q2 + N3Q3 + N4Q4 + ... or Σ NI/OQI/O + NNetQNet + NMemQMem + NCPUQCPU + ... ... where N = Quantity, and Q = Quality NOTE: These equations are notional, not mathematical, so relax ... plus ... they are obvious! In Plain English: “Well, it depends!” or “A chain is only as strong as its weakest link.” 7 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  8. 8. CPU QoS and Capacity • Capacity α Efficiency > 100% busy is 50% capacity if efficiency could be doubled > 100% busy is 50% capacity if SLA is exceeded by 2X > 100% busy is 25% capacity if both of these are true • Amdahl's Law > Scaling is limited by the serial portion of the work > Shouldn't serial sections get special attention for efficiency? > The benefit to the user of optimizing a part of the work is limited by the dominance of the part being optimized > Yeah, but a 2X reduction looks great on the bottom line! • Parkinson's Law > Work expands to fill the time (CPU) available > Not very 'green', eh? 8 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  9. 9. Quality vs. Quality • Qualities of Time > Start time; early or late? > End time; on-time? > Interruptions versus focus and flow? > Productive versus wasted or overhead? > Variance; predictable versus erratic or skewed? • “Quality Time” > Doing stuff that matters ... > ... using time of the right qualities 9 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  10. 10. Quality vs. Quality Analogues • Qualities of Time • Qualities of Computing > Start time; early or late? > Dispatch latency > Interruptions versus focus > Timeslice expiration or and flow? preemption; interrupts > Productive versus wasted > “Business logic” versus or overhead? “overhead” and “waste”? > End time; on-time? > SLA; attained or missed? > Variance; predictable > Controlled on purpose, versus erratic or skewed? or not? • “Quality Time” • Priorities > Doing stuff that matters ... > Critical path to SLA > ... using time of the right > ... controlled to minimize qualities variance 10 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  11. 11. Controlling CPU QoS • Developer-level Out of my control! > Algorithmic efficiency ☹ > Data structures and locality factors > Compile & link-time efficiency factors > Platform-specific APIs and pre-optimized libraries > Hints to the OS • Operations-level Maybe; how? > OS scheduling factors > ISV-provided tunables > Competition factors; competing workloads, virtualization • Architectural-level > NUMA effects Huh? > Cache size, organization, and usage > Specific CPU considerations > Specific system architectural factors 11 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  12. 12. Metrics & Measurements 12 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  13. 13. A Thread in Heaven: Ideal CPU QoS • The scheduler is not interrupting me often > I have a nice big quantum, and my priority is high > My CPU pipeline is not being flushed by any sort of context switches > The scheduler is not migrating me to a cold cache • My compute is highly register-to-register • No hardware interrupts are interrupting me • I'm not doing things that cause global cache coherency events • My memory references are hitting nicely in L1 or L2 caches > My performance-tuned data structures are paying off! > No other threads sharing my caches are spoiling them for me > Data and instruction pre-fetching is working well for me > My partner threads are leaving our shared data in my cache • My L2 misses have tight locality in large pages, so I'm not waiting much on TLB remaps • My L2 misses are in low-latency local memory in this NUMA architecture • My branch predictions are working out really well, statistically-speaking • My programmer inlined a few frequently-used small functions > That's saving me some function calls and keeping my I$ locality tight • That new compiler I came through taught me a lot of new tricks! > I'm keeping multiple instructions units busy at the same time > I'm get multiple instruction completions per cycle on this superscalar CPU • It must be my birthday; I want a pony! 13 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  14. 14. Scheduling: Some Useful Metrics • With Solaris microstate accounting, prstat -mL shows per-thread, among other things ... > Scheduling Latency (LAT) – wait time for compute-ready thread to execute > Involuntary Context Switch (ICX) – rate of thread interruption by threads of higher priority or for exceeding their scheduling quantum PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 19798 oracle 49 46 3.5 0.0 0.0 0.0 0.0 0.9 496 1K .4M 0 oracle/1 19800 oracle 31 63 6.1 0.0 0.0 0.0 0.0 0.4 0 1K .8M 0 oracle/1 19788 oracle 35 30 4.1 0.0 0.0 0.0 19 12 4K 2K .3M 0 oracle/1 19790 oracle 36 26 1.9 0.0 0.0 0.0 27 8.6 4K 2K .3M 0 oracle/1 19796 oracle 35 28 5.3 0.0 0.0 0.0 27 4.9 818 1K .2M 0 oracle/1 4172 oracle 3.8 41 20 0.0 0.0 0.0 27 8.4 5K 2K 33K 0 tnslsnr/1 1779 root 0.1 1.2 0.1 0.0 0.0 0.0 98 0.4 169 27 1K 0 init.cssd/1 1 root 0.3 0.8 0.0 0.0 0.0 0.0 99 0.1 549 13 19K 573 init/1 2893 oracle 0.8 0.2 0.0 0.0 0.0 0.0 98 1.3 1K 37 14K 1K oracle/1 ... 14 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  15. 15. Scheduling: Solaris Preemption Control • Preemption Control API > See: schedctl_init(3C) et al - allows a process to tell the OS it's in a critical section and should get extra time if its quantum expires > If thread does not yield CPU after getting a reprieve, future requests are ignored > “Baked in” to some products, including Oracle > No user action required; it's a programmed-in thing • Challenge questions: > What if someone's capacity model is based on data from a system where preemption control has been disabled for some key process? > How could one measure if that had happened? 15 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  16. 16. Scheduling: What Matters Most? $ ps -e -o class,pri | sort | uniq -c | sort -nr +2 1 RT 157 1 RT 140 1 RT 100 1 SYS 98 1 SYS 96 Important! 3 TS 60 2 FX 60 1 SYS 60 Oracle's log writer? 8238 TS 59 Hey! Wait a 1 TS 58 3 TS 54 Primary modality; minute! That's really 11 TS 53 OLTP processes important! Why 2 TS 52 1 TS 51 6 TS 50 didn't anyone tell 14 TS 49 CPU hogs; the OS? Help! 1 TS 36 1 TS 34 1 TS 29 demoted by the 1 TS 22 TS scheduler 1 TS 12 3 TS 0 $ ps -e -o pid,ppid,class,pri,args | grep lgw 10494 1 TS 34 ora_lgwr_XYZP SOLUTION: Force LGWR into FX 60 as a Best Practice! NOTE: Snapshot 'ps -e -o pid,ppid,class,pri,args' to a file for analysis; these details change rapidly! 16 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  17. 17. Memory QoS • CPU QoS is tightly coupled with memory QoS > “If Mama ain't happy, ain't nobody happy” - Dr. Phil • Memory QoS metrics are not widely-known > Popular naiveté underlies the belief that memory is “flat” 17 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  18. 18. Memory QoS: Metrics • Hardware-level memory quality factors > Latency, varies with technologies and backplane > Base technology (eg: DDRN vs. FB-DIMM) > Locality, NUMA proximity > Coherency events > Memory Interleave • OS-level factors > TLB remap rate > segmap remap rate (newer mechanisms now in use) > Faults: major, minor > ISM vs. DISM > Hardware page size • Swap page-in latency > ... when this matters, you're in trouble! 18 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  19. 19. Memory QoS: Observability • Lots of instrumentation in the Solaris OS ... > Some well-documented and end-user accessible > trapstat – TLB maintenance overhead > pmap -x – pages sizes and other qualities > ppgsz – platform-specific page sizes > ipcs - IPC allocations & characteristics > ps – footprint; RSS ... wait ... that's quantity, not quality > Some very propeller-headed > DTrace – able to probe kernel and userland > kstat – many memory-related counters > cpustat – performance of caches > busstat – memory controllers and cache coherency events 19 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  20. 20. Memory QoS: High-Order Bits • TLB-management generally shows up as %usr > Mapping a page involves a table walk lookup, a TLB 'shootdown', and instantiation of new mapping in the MMU > High %usr is not necessarily a good thing! • Rate of TLB-misses varies with HW page sizes used > Mapping larger pages less-often is less overhead than mapping small pages more-often • TLB observability varies > On most SPARC processors, use 'trapstat -T' in Solaris > Many modern processors features “hardware table walk”, making remaps far more efficient, but less observable 20 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  21. 21. OS-Level Counters: ISA Emulation • Instruction Emulation: not all instructions in an Instruction Set Architecture (ISA or 'CPU family') are implemented on all CPU models > Chip designers save transistors by not implementing certain complex, rarely-used, or deprecated opcodes > Unimplemented opcodes trap into emulation code, costing many more cycles than native instructions > Emulator traps are counted in Solaris using kstat facility > Common candidates for emulation include ... > Visual Instruction Set (VIS) opcodes > Certain Floating-Point (FP) opcodes > Suggested reading ... > http://www.sun.com/blueprints/1205/819-5144.pdf > http://blogs.sun.com/travi/entry/corestat_for_ultrasparc_t2 > http://docs.sun.com/source/817-6702/ncg_sparc.html (historical) 21 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  22. 22. OS-Level Counters: ISA Augmentation • Special-purpose extensions to an ISA may be designed into some CPUs for acceleration of certain tasks, like encryption > Leveraging such extensions typically requires a vendor- supplied, platform-optimized library – and some application- specific configuration > OS-level counters and chip-level counters may both be available to assess the utilization of such extensions > As with multiple CPU functional units, such features blur the concept of CPU utilization as a simple percentage > Example: Cryptographic acceleration extensions on Sun's TM CoolThreads series CMT CPUs > See: “USING THE CRYPTOGRAPHIC ACCELERATORS IN THE ULTRASPARC® T1 AND T2 PROCESSORS“ 22 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  23. 23. Chip-Level Counters: Typical • Specifics vary enormously by chip ISA and model > Clock rate (may vary with power management) > Cycles (1/clock_rate) > Instructions > DERIVED: Cycles-per-Instruction (CPI) > DERIVED: Millions of Instructions per Second (MIPS) > Branch mispredictions > L1 cache misses > L2 cache misses > NOTE: These often explain or correlate with CPI and MIPS 23 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  24. 24. CPI: Bigger isn't Better • High cycles-per-instruction (CPI) implies memory waits and/or long-running/complex instructions • Low CPI isn't necessarily better ... while (!white_of_their_eyes) ; // Hard poll fire_our_guns(); • Context is everything! • Profiling tells you where the time is going; low-level metrics can give important clues about why 24 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  25. 25. Chip-Level Counters: Constraints • In general, not accessible from virtual environments > Access usually requires privilege in the primary host context or domain (dom0, control domain, global zone, etc.) • In general, limited counters available per-sample > On-chip counters tend to be plentiful, but need to be mapped for retrieval thru limited windows • They are counters, not rates; post-processing required > perl and awk scripts and spreadsheets are popular • Counter names can vary on the chip designer's whims > They are neither stable nor standard, even between chips in the same family 25 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  26. 26. Chip-Level Counters: Access in Solaris • cpustat – basic counter access • cputrack – counters as they change for a process • busstat – counters on non-CPU components • kstat – scalable mechanism for OS-level counters • DTrace with libcpc extensions – potential to correlate low-level measurements with anything else > per-thread per-vcpu > per schedule interval > per transaction (with some work) > per transaction class (with a lot of work) 26 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  27. 27. Chip-Level Counters: Bedtime Reading • 'cpustat -h' lists counters available on current CPU, but counter details vary from chip-to-chip • SPARC CPUs > [PDF] UltraSPARC IV+ Processor User's Manual Supplement > [PDF] SPARC64™ VI Extensions > [PDF] SPARC64™ VII Extensions > [HTML] Using busstat to Monitor Performance Counters for UltraSPARC T2 Plus External Coherency Hub Architecture • x64 CPUs > Intel and AMD implementations are different > See chip-specific manufacturer documents 27 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  28. 28. Chip-Level Counters: Rollup Tools • har – Hardware Activity Reporter - reports MIPS and more on selected CPUs – SPARC and x64 > http://blogs.sun.com/openomics/entry/cpu_hardware_counter_stats • corestat – for SPARC Coolthreads CPUs, show actual core utilizations > http://cooltools.sunsource.net/corestat • EMON – For Intel x64, Intel proprietary toolkit for observing low-level counters > http://software.intel.com/en-us/articles/code-downloads/ 28 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  29. 29. Tales of CPU QoS 29 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  30. 30. Common Themes • Capacity thinking is done in “MIPS”, but the real problem is often elsewhere > latency > scheduling > algorithm scalability • Based on simple capacity models, upgrades often disappoint > “More iron” was not the best strategy > “Work smarder, not harder” - always worth investigating 30 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  31. 31. Case Study: A Famous Compute-Hog • SPARC-specific Oracle Bug #6814520 > DSS workload was exhibiting disappointing bandwidth > Profiling revealed huge elapsed time in checksum routine > Low-level counters revealed severe memory waits • Diagnosis: Old “hand-rolled” assembler code for checksum validation had no data prefetch hints, resulting in really-high memory-wait for data! • Remedy & Payoff: Added prefetch hint for 4X speedup on some CPUs. Used standard optimized compile of generic C code to get 16X speedup on other CPUs. • Challenge: Might this have first been seen by low-level metrics? > Bonus Q: How long did this issue go unnoticed, and why? 31 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  32. 32. Case Study: Partner Pairs • Problem: Disappointing throughput on messaging app with many producer/consumer pairs on large SMP • Diagnosis: Messages written by producers suffered high latency being read by consumers which were migrating freely around the system > Cache-to-cache copies were inefficient on host architecture > Co-locating producers and consumers caused 4X gain • Remedy & payoff: Custom daemon written to dynamically co-locate producer/consumer pairs; result is 3X improved aggregate throughput. • Challenge: What would capacity planners have done without this analysis? 32 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  33. 33. Case Study: Foxes and Hens • Problem: Disappointing throughput on messaging app with many producer/consumer pairs on small SMP • Diagnosis: Consumers and producers were invalidating each other's cache contents, resulting in high rates of cache misses • Remedy & Payoff: Segregated producers and consumers into distinct processor sets causing 4X gain by keeping respective caches warmed to each task • Challenge: How would one diagnose this without low- level metrics? 33 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  34. 34. Checkpoint ... • Regarding the last two examples ... > Same approximate problem description ... but completely opposite problem resolution (segregation versus integration) > Shared concepts: proximity or juxtaposition of ... > processes to CPU resources > processes to memory resources > processes to other processes > Real estate: “Location, location, location.” > Comedy: “Timing.” 34 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  35. 35. Juxtaposition Games You Can Play • How well does it run when bound to a dedicated CPU? > In Solaris, a processor set (psrset) can be used > A psrset can be set 'nointr' for immunity from hardware interrupts > A psrset can be made to contain all HW threads associated with a pipeline, core, socket, or sysboard > Memory will tend to be local with Memory Placement Optimization (MPO) policy of 'set lgrp_mem_pset_aware=1' in /etc/system, • If the answer is “a lot better”, hypotheses include ... > Excessive migrations were spoiling its caches > Remote memory latency was a problem > Interrupt handlers were preempting it; possibly even “pinning” it (ie: preventing it migrating to an idle thread) • Challenge questions: > How useful is “utilization” versus “CPU QoS” in such cases? > How might future tools automate such sensitivity analyses? 35 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  36. 36. Case Study: How Much Concurrency? • Problem: Oracle Parallel Query produced disappointing results on 64-way CMT chip • Diagnosis: Excessive Degree-of-Parallelism (DOP) was being used, causing CPU to go increasingly to overhead categories (context switches, migrations, excessive spins on mutexes, etc.) > 'corestat' utility was used to observe low-level utilization while varying DOP • Remedy & Payoff: Stop increasing DOP when CPUs' theoretical MIPS limits are reached and deploy other • Challenge: ISV defaults did not anticipate the thread density of modern CMT systems 36 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  37. 37. Case Study: How Much Concurrency? • Problem: Oracle Parallel Query produced non-linear scaling results on large VMT system with DSS workload • Diagnosis: Classic low-CPI DSS workload saturated a CPU core running only its 'primary' hardware thread > 'cpustat' showed many thread-switching operations and L2 cache saturation • Remedy & Payoff: Turning off secondary thread on each CPU eliminated negative scaling • Challenge: The architectural differences between different multi-core, multi-thread CPUs can be highly significant! 37 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  38. 38. Checkpoint ... • Regarding the last two examples ... > Same approximate problem description ... but resolution was highly architecture-dependent > Shared concept: workload-specific architectural impact > “If it hurts when you do that; don't do that!” > “Work smarder, not harder!” > “Size matters.” 38 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  39. 39. Case Study: The Sneaky Leak • Problem: Batch application takes far too long; only 1/4 of month's data can be processed per month! • Diagnoses: Numerous > Sub-batch hash terrible; longest-running sub-batch takes weeks for only a portion of the month's data! > Memory leak! Longer-running jobs develop ever-worse memory locality, shown by a trend in cache misses. • Remedies & Payoffs: Implement better hash; run many more smaller jobs. Shorter-running jobs suffer less from memory leak and keep all SMP threads busy. One month's data now processes in two days! 39 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  40. 40. “... ay, there's the rub ...” • Developers have many options ... > Application architecture > Algorithm selection > Data structure design > Hinting the execution environment > Compile-time optimizations > Minimal target chip architecture > Selective function inlining > Conditional compilation (eg: #ifdef) > Compiler version: newer is better! > Link-time optimizations > Feedback-optimized linking > Platform-optimized libraries • ... but Capacity Planners are all-too-often far removed from the developers, ISVs, or application teams 40 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  41. 41. Controlling CPU QoS • Developer-level Out of my control! > Algorithmic efficiency ☹ > Data structures and locality factors > Compile & link-time efficiency factors > Platform-specific APIs and pre-optimized libraries > Hints to the OS • Operations-level Maybe; how? > OS scheduling factors > ISV-provided tunables > Competition factors; competing workloads, virtualization • Architectural-level > NUMA effects Huh? > Cache size, organization, and usage > Specific CPU considerations > Specific system architectural factors 41 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  42. 42. Closing Remarks & Call to Action 42 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  43. 43. So, Here's the Thing ... • There must be many cases undiagnosed in the wild of issues illustrated by the cases cited here! > Where so, by implication, might the prevailing capacity plan may be 2X – 8X inflated due to unexploited latent capacity? • Bob says ... > “How can anyone confine themselves to the realm of 'utilization' and 'cpu-seconds' and truly believe they are properly – or optimally - managing the high-order aspects of capacity?” > “Low-level metrics and cycle accounting are the frontier of performance analysis and QoS management.” 43 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  44. 44. Call to Action! • Find tools for observing low-level metrics on your critical production platforms • Poke around on those systems to see what's 'normal' at the system and workload level • See what correlations you can find between low- level metrics and end-user QoS • See if you can diagnose some mechanisms linking your low-level- and high-level-QoS • Let us know what discover! Bring it to CMG! • EXTRA CREDIT: What can tool vendors and modelers do with these metrics? 44 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.
  45. 45. Q&A? 45 Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.

×