A review of power & energy
consumption optimization in HPC

                   Rishi Pathak
                 riship@cdac.in
 National PARAM Supercomputing Facility, C-DAC, Pune


    Symposium on HPC Applications – IIT Kanpur
                 March 12 - 14, 2012
Top 10 – Top500
Top 10 – Green 500
3

          2.02                                                     GF/Watt
                      2.02
                                         1.98
2.5                                                        1.68


                                                                                     Green 500, Rank 1-10 (GF per Watt)
                                 1.99

                                                                                     Top 500, Rank 1-10 (GF per Watt)
 2




                                                                        1.37
                                                                                 1.26
                                                                  GPU
1.5                                                                                               GPU
                                                                               GPU
                                                                                                   1.01                      0.95
                                                                                                                         GPU
                                                                                                              0.96
                                                                                                               GPU
 1
                                                           GPU
          0.83                                             0.85

                     GPU
                      0.63
                                        GPU
0.5                                             0.49
                                                                                                  0.36                       0.44
                                                                        0.28
                                 0.25                                                                         0.29
                                                                                     0.27

 0
      1          2           3            4            5            6           7             8           9             10
Exascale system
• Likely to be feasible by 2017±2
• 10-100 Million processing elements (cores or mini-
  cores)
• Chips perhaps as dense as 1,000 cores per socket
• Clock rates will grow more slowly
• Large-scale optics based interconnects
• 10-100 PB of aggregate memory
• Performance per watt ~ 100 GF/watt sustained
  performance
• 10 – 100 MW Exascale system
Power & Energy
   E=P*T
   Energy(E) consumed in time(T) with average
    power(P)
   Minimizing time interval will limit energy
   A minimum value of T for an application
       Mapping of application to cluster system
       Scalability & system bottlenecks
   Beyond that – Power management approaches
Power management techniques
   Static Power Management(SPM)
       Low power CPUs
       Local flash storage
       Suitable for data centric applications
   Dynamic Power Management(DPM)
       Software & power scalable components
       Dynamically adjust power consumption
       Frequency & Voltage scaling for CPU & memory
DVFS
   Dynamic Voltage & Frequency Scaling
   P = C * V2 * f
   Throttling when
       Workload is not CPU bound
       Is not much CPU intensive
DVFS Scheduling
   Off-line, trace-based scheduling
       Source code instrumentation for performance profiling
       Execution with profiling
       Determination of appropriate processor frequencies for
        each phase
       Source code instrumentation for DVFS scheduling


S. Huang & W. Feng – Proc. Cluster computing[IEEE/ACM](2009)
DVFS Scheduling
   Run-time, profiling-based scheduling
       Time-window based performance prediction model
       No a priori information of application phases
       False prediction will have dire consequences for performance
        or energy efficiency
       Metrics
            MIPS & CPU utilization
            Interception of MPI communication calls
            File I/O calls
            MPI receive wait cycles
       Shown to reduce energy with pre-specified performance loss
        constraint
DVFS Implementations
    Memory MISER (Management Infra-Structure for Enerygy
     Reduction)
    CPU MISER
    Linux CPUSPEED
    Ecod
    Beta-Algorithm
    M. E. Tolentino, J. Turner & K. W. Cameron – Proc. of the 4th international conference
    on Computing frontiers(2007)
    S. Huang & W. Feng – Proc. Cluster computing[IEEE/ACM](2009)
    C. Hsu & W. Feng - Proc. of the 2005 ACM/IEEE conference on Supercomputing
Enhancements in DVFS
   Dynamic Frequency Scaling per Core
       Each core runs at its own clock
       Power is linear with frequency
       Power savings are relatively small
   Separate power planes for the core and "uncore" part
    of the CPU
       Cores can go to sleep (C-state)
       Memory controller is still operational for external device
        (e.g. via DMA)
Enhancements in DVFS
   Clock gating
       Clock disabled sleep state (AMD-C1,E1, Intel-
        C[0,1,3,6])
       At the CPU block level
       At the core level
       Reduces dynamic power
   Power Gating
       Power to CPU/core cut off (~0V)
       Reduces both dynamic and static(leakage) power
Nehalem core sleep states
AMD's and Intel's techniques
Power optimization at NPSF
   Scheduler capable of :
       Power off a node after a pre specified state of idleness(no
        job)
       Power optimization with QOS(turnaround time)
       Node power on time(2-3 min) is additional
   Targeted power policies
       Aggressive optimization w/o regard to QOS
       Power capping
       Power budget
Power optimization at NPSF
   Node packing via checkpointing, migration & restart
       MPI with BLCR – one approach
       Use of virtualization – another approach
       Considerations –
            Remaining walltime of job being migrated
            Remaining walltime of jobs on node in consideration
            Associated cost of migration against power savings expected to
             be achieved
Saving Potential
Simulation Result - Plot
Simulation Results - Table
       Parameter Case           Case I   Case II   Case III



 Power saving (in percentage)    4.05     4.22      9.29



NODEIDLEPOWERTHRESHOLD            8        6          4
        (In minutes)
Power optimization at NPSF
   Feedback driven policy engine
       Speculative power on/off of nodes at any given time
       Metrics/deciding factors
            Function of Jobs arrival time & resource requirements
            How many nodes at what time
            Current and probable cluster utilization at given time – another
             metric
       Expected starttime of jobs in queue
       Minimize impact on turnaround time of job
Job Arrival Time
PARAM Yuva – Access & Account
https://yuva.cdac.in/
Technical Affiliation Scheme
Thank You
npsfhelp@cdac.in

Symposium on HPC Applications – IIT Kanpur

  • 1.
    A review ofpower & energy consumption optimization in HPC Rishi Pathak riship@cdac.in National PARAM Supercomputing Facility, C-DAC, Pune Symposium on HPC Applications – IIT Kanpur March 12 - 14, 2012
  • 2.
    Top 10 –Top500
  • 3.
    Top 10 –Green 500
  • 4.
    3 2.02 GF/Watt 2.02 1.98 2.5 1.68 Green 500, Rank 1-10 (GF per Watt) 1.99 Top 500, Rank 1-10 (GF per Watt) 2 1.37 1.26 GPU 1.5 GPU GPU 1.01 0.95 GPU 0.96 GPU 1 GPU 0.83 0.85 GPU 0.63 GPU 0.5 0.49 0.36 0.44 0.28 0.25 0.29 0.27 0 1 2 3 4 5 6 7 8 9 10
  • 5.
    Exascale system • Likelyto be feasible by 2017±2 • 10-100 Million processing elements (cores or mini- cores) • Chips perhaps as dense as 1,000 cores per socket • Clock rates will grow more slowly • Large-scale optics based interconnects • 10-100 PB of aggregate memory • Performance per watt ~ 100 GF/watt sustained performance • 10 – 100 MW Exascale system
  • 6.
    Power & Energy  E=P*T  Energy(E) consumed in time(T) with average power(P)  Minimizing time interval will limit energy  A minimum value of T for an application  Mapping of application to cluster system  Scalability & system bottlenecks  Beyond that – Power management approaches
  • 7.
    Power management techniques  Static Power Management(SPM)  Low power CPUs  Local flash storage  Suitable for data centric applications  Dynamic Power Management(DPM)  Software & power scalable components  Dynamically adjust power consumption  Frequency & Voltage scaling for CPU & memory
  • 8.
    DVFS  Dynamic Voltage & Frequency Scaling  P = C * V2 * f  Throttling when  Workload is not CPU bound  Is not much CPU intensive
  • 9.
    DVFS Scheduling  Off-line, trace-based scheduling  Source code instrumentation for performance profiling  Execution with profiling  Determination of appropriate processor frequencies for each phase  Source code instrumentation for DVFS scheduling S. Huang & W. Feng – Proc. Cluster computing[IEEE/ACM](2009)
  • 10.
    DVFS Scheduling  Run-time, profiling-based scheduling  Time-window based performance prediction model  No a priori information of application phases  False prediction will have dire consequences for performance or energy efficiency  Metrics  MIPS & CPU utilization  Interception of MPI communication calls  File I/O calls  MPI receive wait cycles  Shown to reduce energy with pre-specified performance loss constraint
  • 11.
    DVFS Implementations  Memory MISER (Management Infra-Structure for Enerygy Reduction)  CPU MISER  Linux CPUSPEED  Ecod  Beta-Algorithm M. E. Tolentino, J. Turner & K. W. Cameron – Proc. of the 4th international conference on Computing frontiers(2007) S. Huang & W. Feng – Proc. Cluster computing[IEEE/ACM](2009) C. Hsu & W. Feng - Proc. of the 2005 ACM/IEEE conference on Supercomputing
  • 12.
    Enhancements in DVFS  Dynamic Frequency Scaling per Core  Each core runs at its own clock  Power is linear with frequency  Power savings are relatively small  Separate power planes for the core and "uncore" part of the CPU  Cores can go to sleep (C-state)  Memory controller is still operational for external device (e.g. via DMA)
  • 13.
    Enhancements in DVFS  Clock gating  Clock disabled sleep state (AMD-C1,E1, Intel- C[0,1,3,6])  At the CPU block level  At the core level  Reduces dynamic power  Power Gating  Power to CPU/core cut off (~0V)  Reduces both dynamic and static(leakage) power
  • 14.
  • 15.
  • 16.
    Power optimization atNPSF  Scheduler capable of :  Power off a node after a pre specified state of idleness(no job)  Power optimization with QOS(turnaround time)  Node power on time(2-3 min) is additional  Targeted power policies  Aggressive optimization w/o regard to QOS  Power capping  Power budget
  • 17.
    Power optimization atNPSF  Node packing via checkpointing, migration & restart  MPI with BLCR – one approach  Use of virtualization – another approach  Considerations –  Remaining walltime of job being migrated  Remaining walltime of jobs on node in consideration  Associated cost of migration against power savings expected to be achieved
  • 18.
  • 19.
  • 20.
    Simulation Results -Table Parameter Case Case I Case II Case III Power saving (in percentage) 4.05 4.22 9.29 NODEIDLEPOWERTHRESHOLD 8 6 4 (In minutes)
  • 21.
    Power optimization atNPSF  Feedback driven policy engine  Speculative power on/off of nodes at any given time  Metrics/deciding factors  Function of Jobs arrival time & resource requirements  How many nodes at what time  Current and probable cluster utilization at given time – another metric  Expected starttime of jobs in queue  Minimize impact on turnaround time of job
  • 22.
  • 23.
    PARAM Yuva –Access & Account
  • 24.
  • 25.
  • 26.