Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Runtime Methods to Improve Energy Efficiency in HPC Applications

109 views

Published on

The principal means to increase performance of modern high performance computing (HPC) applications is to use more processors in parallel. However, energy use increases linearly with the number of processing cores. Energy costs for top supercomputers already exceed 10M EUR per year, and the operating costs and carbon footprint of next generation systems (exaflop scale) are a major concern.

HPC applications use a parallel programming paradigm like the Message Passing Interface (MPI) to coordinate computation and communication among thousands of processors. With dynamically-changing factors both in hardware and software affecting energy usage of processors, there exists an opportunity for power monitoring and regulation at runtime to achieve savings in energy.

In this talk, an adaptive runtime framework is described that enables processors with core-specific power control to reduce power with little or no performance impact. Two opportunities to improve the energy efficiency of processors running MPI applications are identified - computational workload imbalance and memory system saturation.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Runtime Methods to Improve Energy Efficiency in HPC Applications

  1. 1. Two short talks on current topics in Computer Science Jan Prins Department of Computer Science University of North Carolina at Chapel Hill 1 Runtime Methods to Improve Energy Efficiency in Supercomputing Applications 2 Computational Methods in Transcriptome Analysis
  2. 2. Runtime Methods to Improve Energy Efficiency in HPC Applications Sridutt Bhalachandra1, Robert Fowler2, Stephen Olivier3, Allan Porterfield2, and Jan Prins1 1Department of Computer Science, University of North Carolina at Chapel Hill 2Renaissance Computing Institute, Chapel Hill 3Sandia National Laboratories May 8, 2018
  3. 3. computing performance: 120 years of exponential growth! Runtime methods to improve energy efficiency 2
  4. 4. What is driving performance growth? Moore’s “law” - transistor density doubles every 18-24 months Runtime methods to improve energy efficiency 3
  5. 5. What is driving performance growth? Moore’s “law” - transistor density doubles every 18-24 months Dennard scaling - total power remains the same and maximum operating frequency increases Runtime methods to improve energy efficiency 3
  6. 6. What is driving performance growth? Moore’s “law” - transistor density doubles every 18-24 months Dennard scaling - total power remains the same and maximum operating frequency increases but look what has happened over the past two decades Runtime methods to improve energy efficiency 3
  7. 7. The end of Dennard scaling and faster transistors Consequences additional transistors require additional area power and heat increase commensurately parallel computing is the only route to scaling performance multicore processors multiprocessor nodes interconnection networks Runtime methods to improve energy efficiency 4
  8. 8. Scalable parallel computing Message Passing Interface (MPI) is used to coordinate computation and communication among all processor cores Runtime methods to improve energy efficiency 5
  9. 9. Current largest parallel computer Source: http://www.nsccwx.cn/wxcyw Sunway Taihulight 40,960 nodes 10,649,600 cores (256+4 per node) at 1.45GHz 20PB storage $273 million Top500 #1 93.01 PFLOPS @ 15.4MW 1 PetaFLOPS (PFLOPS) = 1015 Floating Point Operations Per Second 1 MegaWatt (MW) can roughly power 1000 homes Runtime methods to improve energy efficiency 6
  10. 10. Exascale (1018 FLOPS) power requirements System/Site Performance (PFLOPS) Power (MW) Energy Efficiency (GFLOPS/W) Exascale 1000 ? ? Taihulight 93 15 6 Tianhe 2 34 18 2 Piz Daint 20 2 9 Runtime methods to improve energy efficiency 7
  11. 11. Exascale (1018 FLOPS) power requirements System/Site Performance (PFLOPS) Power (MW) Energy Efficiency (GFLOPS/W) Exascale 1000 ? ? Taihulight 93 15 6 Tianhe 2 34 18 2 Piz Daint 20 2 9 TSUBAME 3.0 2 0.14 14 kukai 0.46 0.03 14 AIST AI Cloud 0.96 0.08 13 Runtime methods to improve energy efficiency 7
  12. 12. Exascale (1018 FLOPS) power requirements System/Site Performance (PFLOPS) Power (MW) Energy Efficiency (GFLOPS/W) Exascale 1000 20 50 Taihulight 93 15 6 Tianhe 2 34 18 2 Piz Daint 20 2 9 TSUBAME 3.0 2 0.14 14 kukai 0.46 0.03 14 AIST AI Cloud 0.96 0.08 13 Runtime methods to improve energy efficiency 8
  13. 13. Exascale (1018 FLOPS) power requirements System/Site Performance (PFLOPS) Power (MW) Energy Efficiency (GFLOPS/W) Exascale 1000 20 50 Taihulight 93 15 6 Tianhe 2 34 18 2 Piz Daint 20 2 9 TSUBAME 3.0 2 0.14 14 kukai 0.46 0.03 14 AIST AI Cloud 0.96 0.08 13 5x - 10x improvement in energy efficiency required Runtime methods to improve energy efficiency 8
  14. 14. breakdown of power use in a large parallel computer Source: Use Case: Quantifying the Energy Efficiency of a Computing System -Hsu et al. Runtime methods to improve energy efficiency 9
  15. 15. Opportunity to save energy “Race to the end” in parallel regions each processor core operates on data in its node each processor maximizes speed while staying within thermal limit all processors spinwait on lock at end of the region last processor to arrive releases the lock Runtime methods to improve energy efficiency 10
  16. 16. Computational workload imbalance could be inherent in application could be due to system heterogeneity exacerbated by the race to the end Runtime methods to improve energy efficiency 11
  17. 17. Saving energy by mitigating workload imbalance Runtime methods to improve energy efficiency 12
  18. 18. Saving energy by mitigating workload imbalance Challenges each core is set to operate at a suitable frequency based on previous phase observation the frequency can change at every phase Runtime methods to improve energy efficiency 12
  19. 19. Fine grained power control Dynamic Duty Cycle Modulation (DDCM) – T-states − Actual clock rate is not changed, DVFS and TurboBoost still operational − Modulation range constant across architecture - 100% to 6.25% − IA32_CLOCK_MODULATION MSR Runtime methods to improve energy efficiency 13
  20. 20. Fine grained power control Dynamic Duty Cycle Modulation (DDCM) – T-states − Actual clock rate is not changed, DVFS and TurboBoost still operational − Modulation range constant across architecture - 100% to 6.25% − IA32_CLOCK_MODULATION MSR DVFS - core specific (Haswell) – P-states − Can slow only non-critical cores − Operational range machine-dependent even for the same architecture − acpi_cpufreq kernel module Runtime methods to improve energy efficiency 13
  21. 21. Runtime control policy Core-specific control − match a core’s effective duty cycle to its workload Duty cycle = Time core in active state Total time (clock cycles) ∗Change core active time using DDCM or clock cycles using DVFS Work = Compute time Compute time + Idle time (constant frequency) Effective Work = Compute time Compute time + Idle time ∗ Max frequency Current frequency Runtime methods to improve energy efficiency 14
  22. 22. Runtime policy Assumes similar behavior across successive phases Policy calculation local to core, no communication Runtime methods to improve energy efficiency 15
  23. 23. Runtime policy Assumes similar behavior across successive phases Policy calculation local to core, no communication Combined policy (PowerDVFS < PowerDDCM ) − Use DVFS policy until lowest frequency reached − Thereafter, use DDCM policy Runtime methods to improve energy efficiency 15
  24. 24. Adaptive Core-specific Runtime (ACR) ACR = Runtime Policy + User Options 1 Can monitor performance degradation at the end of every phase − Rudimentary method to detect phase change 2 Can induce minimum phase length limit − Useful in skipping start-up phases 3 Support for user-annotations − However, not used in current experimentation ∗ Runtime is transparent, eliminating the need for code changes to MPI applications Runtime methods to improve energy efficiency 16
  25. 25. Experimental Setup Mini-apps & Applications Unstructured grids – MiniFE, HPCCG, AMG Structured grids – MiniGhost Mesh Refinement – MiniAMR Hydrodynamics – CloverLeaf − mini-apps representative of key production HPC applications Dislocation Dynamics – ParaDis System 32 Haswell node partition (Sandia Shepard) = 1024 cores − Dell M420: two 16-cores Xeon E5-2698v3 128GB at 2.3GHz − RHEL6.8, Slurm 2.3.3-1.18chaos and Linux 3.17.8 kernel − Mpich 3.2 Results are average of 12 runs taken at stable temperatures (to promote reproducibility) Runtime methods to improve energy efficiency 17
  26. 26. ParaDis results Runtime methods to improve energy efficiency 18
  27. 27. ParaDis results Runtime methods to improve energy efficiency 18
  28. 28. ParaDis critical path on 24 nodes (768 cores) - Default 0.51.01.52.02.5 Phase ComputeTime(s) 0 200 400 600 800 1000 1200 250030003500 AverageFrequency(MHz) Runtime methods to improve energy efficiency 19
  29. 29. ParaDis critical path on 24 nodes (768 cores) - Default 0.51.01.52.02.5 Phase ComputeTime(s) 0 200 400 600 800 1000 1200 250030003500 AverageFrequency(MHz) Bimodal distribution of critical path times < 1.0s and > 1.0s Successive phases are similar, with only occasional jumps Average critical path frequency (Default) = 2507.4MHz Runtime methods to improve energy efficiency 19
  30. 30. ParaDis critical path on 24 nodes (768 cores) - DVFS 0.51.01.52.0 Phase ComputeTime(s) 0 200 400 600 800 1000 1200 2200260030003400 AverageFrequency(MHz) Average critical path frequency (Default) = 2467.3MHz Runtime methods to improve energy efficiency 20
  31. 31. ParaDis critical path on 24 nodes (768 cores) - DDCM 0.51.01.52.0 Phase ComputeTime(s) 0 200 400 600 800 1000 1200 250030003500 AverageFrequency(MHz) Very low frequency on non-critical cores for prolonged periods reduces variation, and increases available thermal headroom for critical cores Average critical path frequency (Default) = 2784.8MHz Runtime methods to improve energy efficiency 21
  32. 32. Mitigating workload balance average results across all experiments Policy %Power reduced %Energy saved %Time increase Temp decrease (C) DDCM 19.3 15.1 5.3 3.2 DVFS 20.5 20.2 0.5 3.3 Combined 24.9 22.6 2.9 4.2 ACR demonstrates that dynamic control of power at runtime is possible Runtime methods to improve energy efficiency 22
  33. 33. Mitigating workload balance average results across all experiments Policy %Power reduced %Energy saved %Time increase Temp decrease (C) DDCM 19.3 15.1 5.3 3.2 DVFS 20.5 20.2 0.5 3.3 Combined 24.9 22.6 2.9 4.2 ACR demonstrates that dynamic control of power at runtime is possible At Exascale, runtimes such as ACR will allow − more work to be run at one time by using less power − individual applications to run faster by allowing a higher thermal headroom on critical cores Runtime methods to improve energy efficiency 22
  34. 34. Mitigating workload balance average results across all experiments Policy %Power reduced %Energy saved %Time increase Temp decrease (C) DDCM 19.3 15.1 5.3 3.2 DVFS 20.5 20.2 0.5 3.3 Combined 24.9 22.6 2.9 4.2 ACR demonstrates that dynamic control of power at runtime is possible At Exascale, runtimes such as ACR will allow − more work to be run at one time by using less power − individual applications to run faster by allowing a higher thermal headroom on critical cores Energy optimization can also be performance optimization Runtime methods to improve energy efficiency 22
  35. 35. Saving energy in memory-bound applications Many HPC applications are memory-bound Memory operations are seldom visible to OS/runtime − Power wasted in CPU while waiting on memory Approach − sample table of request occupancy in memory subsystem − use DVFS to slow non-critical cores Published work 1 Improving Energy Efficiency in Memory-constrained Applications Using Core-specific Power Control (E2SC 2017) Runtime methods to improve energy efficiency 23
  36. 36. Acknowledgements Ph.D. research of Sridutt Bha- lachandra (Argonne National Labs Exascale Group) Published work 1 An Adaptive Core-specific Runtime for Energy Efficiency (IPDPS 2017) 2 Using Dynamic Duty Cycle Modulation to improve energy efficiency in High Performance Computing (HPPAC 2015) Runtime methods to improve energy efficiency 24

×