Optimizing Energy for High
Performance Applications
Discovering when to Compute Green
What is HPC? Welcome to our world
Aerospace and
Space
Automotive Oil and Gas EDA
Weather and
climate
Financial Defence Government Labs Life sciences Academic
Energy in HPC
The world’s top 500
supercomputers cost 400M€
annually in energy alone
If software reduces
its energy footprint
… payback could
be enormous
Solution
Enable developers
and users to
improve application
energy
consumption
Our tools
Debug Tune
Profile
Develop
Two Key Questions
• Can developers optimize code for energy?
• Can owners and users tune applications for
energy?
What is
energy?
Approximations for Energy
• Floating point, vector operations,
memory access
• L1 or L2 misses vs main memory
orders of magnitude in energy
Heuristics
• Real data from some processor,
memory subsystems, accelerators
• Available in kernel - Intel RAPL
Low level
measurement
• PDU and server level readings
• Real data – real energy
Server level
monitoring
Optimizing Time
Capture
performance
• Profiler creates
application profile
• Allinea MAP
records multiple
processes
Find
bottlenecks
• Source code
viewer pinpoints
key consumers
• Timelines find
unusual patterns
Optimize
• Rewrite key loops
• Reorganize
memory access
patterns
• Change algorithms
CPU Package and System Metrics
Whole System
Power Usage
CPU Package
Power Usage
Coprocessor Metrics
• Coprocessors and accelerators
– NVIDIA CUDA GPU
– INTEL XEON PHI
• Devices provide kernel access to power
– HIGH POWER CONSUMPTION WHEN ACTIVE
– LOW POWER CONSUMPTION WHEN IDLE
– VERY EFFICIENT IN FLOPS PER WATT
• System now has variable energy usage to consider
– OPTIMIZATION FOR TIME - IS THE GPU ROUTE QUICKER?
– OPTIMIZATION FOR ENERGY - WHICH IS MOST EFFICIENT?
• (GPU + SERVER energy) * GPU time
• Or SERVER * CPU time?
Two Key Questions
• Can developers optimize code for energy? YES
• Can owners and users tune applications for
energy?
Tuning Time
No instrumentation needed
No source code needed
No recompilation needed
Less than 5% runtime overhead
Fully scalable
Explicit and usable output
Allinea Performance Reports
Example Report
Run details
Visual
breakdown
chart
Clear categorization
Explanation of
figures and
advice for
follow-up
Breakdown of resource usage across CPU, MPI, I/O
Integrated Energy Information
Key Observation: In a Nutshell
• For many HPC workloads
– THE FASTER AN APPLICATION COMPLETES, THE LOWER ITS
ENERGY CONSUMPTION
– OR … OPTIMIZE FOR SPEED AND YOU ARE (USUALLY)
ALREADY OPTIMIZING FOR ENERGY
• But for some HPC and non-HPC cases
– FREQUENCY SCALING SAVES ENERGY
Two Key Questions
• Can developers optimize code for energy? YES
• Can owners and users tune applications for
energy? YES
…. But should they?
• Are we counting all energy?
• Are we considering all costs?
What is
energy?
Approximations for Energy
• Floating point, vector operations,
memory access
• L1 or L2 misses vs main memory orders
of magnitude in energy
Heuristics
• Real data from some processor, memory
subsystems
• Available in kernel - Intel RAPL
Low level
measurement
• PDU and server level readings
• Real data – real energy
Server level
monitoring
• Air-con
• Servers, switches, storage….
Full system
monitoring
Two Key Questions
• When should developers optimize code for
energy?
• When should owners and users tune applications
for energy?
Frequency Scaling
Some
workloads have
low compute
requirement, but
high data
volume
Data crunching vs number
crunching
Processor is
over-powered
for the speed of
memory, disk or
network
CPU frequency can be scaled
down in software
Providing
information to
developer, user
and system
owner
Allinea MAP
Allinea Performance Reports
A lot of codes are memory-bound
Multiple cores share bandwidth
Core 1
Core 2
Core 3
Core 4
…
Lots of
clever
technology
Main memory
Can we tune them for energy efficiency?
Core 1
Core 2
Core 3
Core 4
…
Lots of
clever
technology
Main
memory
How can we improve energy efficiency?
Buy a new cluster with ambient warm water
cooling an integrated espresso machine
Reduce CPU frequency
Run on fewer cores per node
How can we improve energy efficiency?
Buy a new cluster with ambient warm water
cooling an integrated espresso machine
Reduce CPU frequency?
Run on fewer cores per node?
The Experiment
One
simple
code
A well-understood
wave equation
solver
One
compute
node
Minimize effect of
MPI
communications
Change
CPU
frequency
and
#cores
Measure the results
with Allinea
Performance
Reports
4 PPN @ 2.1 Ghz, 30 seconds
4 PPN @ 2.1 Ghz, 30 seconds 4 PPN @ 1.3 Ghz, 34 seconds
1.3 Ghz
1.7 Ghz
2.1 Ghz
0%
10%
20%
30%
40%
50%
60%
70%
2
4
6
8
Slowdown relative to 4 PPN @ 2.1Ghz
Data gathered with Performance Reports’ CSV export
0%-10% 10%-20% 20%-30% 30%-40% 40%-50% 50%-60% 60%-70%
1.7Ghz run completes as quickly as at 2.1Ghz
1.3 Ghz
1.7 Ghz
2.1 Ghz -10%
-5%
0%
5%
10%
15%
20%
2
4
6
8
Energy savings relative to 4 PPN @ 2.1Ghz
Data gathered with Performance Reports’ CSV export
-10%--5% -5%-0% 0%-5% 5%-10% 10%-15% 15%-20%
5-10% energy savings with zero performance impact
1.3 Ghz
1.7 Ghz
2.1 Ghz
0%
10%
20%
30%
40%
50%
60%
70%
2
4
6
8
Slowdown relative to 4 PPN @ 2.1Ghz
Data gathered with Performance Reports’ CSV export
0%-10% 10%-20% 20%-30% 30%-40% 40%-50% 50%-60% 60%-70%
15% energy savings with 20% performance impact
The Results
1.3 Ghz
1.7 Ghz
2.1 Ghz
-10%
-5%
0%
5%
10%
15%
20%
2
4
6
8
2 PPN: 15% energy savings,
20% increased runtime
-10%--5% -5%-0% 0%-5% 5%-10% 10%-15% 15%-20%
1.3 Ghz
1.7 Ghz
2.1 Ghz
0%
10%
20%
30%
40%
50%
60%
70%
2
4
6
8
1.7Ghz: 6% Energy savings
for free
0%-10% 10%-20% 20%-30% 30%-40%
40%-50% 50%-60% 60%-70%
So… should we run every job at a reduced clock speed?
Or only ever use half the cores on each node?
Improving energy efficiency
• Each application and system has different
characteristics
– TOOLS CAN SHOW IF THE APPLICATION WASTES POWER
UNNECESSARILY
– DEVELOPERS CAN SEE WHERE TO OPTIMIZE AND CHANGE
CODE
– USERS CAN IMPROVE EFFICIENCY WITHOUT CHANGING CODE
• Don’t forget the opportunity cost
– IN HPC SLOWING DOWN APPLICATIONS COSTS SCIENCE
– MACHINES AND PHDS HAVE FINITE LIFETIME – AND THEIR COST
DOMINATES
• Time and energy are not the same
– OPTIMIZE FOR TIME BEFORE OPTIMIZING FOR ENERGY

Optimizing High Performance Computing Applications for Energy

  • 1.
    Optimizing Energy forHigh Performance Applications Discovering when to Compute Green
  • 2.
    What is HPC?Welcome to our world Aerospace and Space Automotive Oil and Gas EDA Weather and climate Financial Defence Government Labs Life sciences Academic
  • 3.
    Energy in HPC Theworld’s top 500 supercomputers cost 400M€ annually in energy alone If software reduces its energy footprint … payback could be enormous Solution Enable developers and users to improve application energy consumption
  • 4.
  • 5.
    Two Key Questions •Can developers optimize code for energy? • Can owners and users tune applications for energy?
  • 6.
    What is energy? Approximations forEnergy • Floating point, vector operations, memory access • L1 or L2 misses vs main memory orders of magnitude in energy Heuristics • Real data from some processor, memory subsystems, accelerators • Available in kernel - Intel RAPL Low level measurement • PDU and server level readings • Real data – real energy Server level monitoring
  • 7.
    Optimizing Time Capture performance • Profilercreates application profile • Allinea MAP records multiple processes Find bottlenecks • Source code viewer pinpoints key consumers • Timelines find unusual patterns Optimize • Rewrite key loops • Reorganize memory access patterns • Change algorithms
  • 8.
    CPU Package andSystem Metrics Whole System Power Usage CPU Package Power Usage
  • 9.
    Coprocessor Metrics • Coprocessorsand accelerators – NVIDIA CUDA GPU – INTEL XEON PHI • Devices provide kernel access to power – HIGH POWER CONSUMPTION WHEN ACTIVE – LOW POWER CONSUMPTION WHEN IDLE – VERY EFFICIENT IN FLOPS PER WATT • System now has variable energy usage to consider – OPTIMIZATION FOR TIME - IS THE GPU ROUTE QUICKER? – OPTIMIZATION FOR ENERGY - WHICH IS MOST EFFICIENT? • (GPU + SERVER energy) * GPU time • Or SERVER * CPU time?
  • 10.
    Two Key Questions •Can developers optimize code for energy? YES • Can owners and users tune applications for energy?
  • 11.
    Tuning Time No instrumentationneeded No source code needed No recompilation needed Less than 5% runtime overhead Fully scalable Explicit and usable output
  • 12.
    Allinea Performance Reports ExampleReport Run details Visual breakdown chart Clear categorization Explanation of figures and advice for follow-up Breakdown of resource usage across CPU, MPI, I/O
  • 13.
  • 14.
    Key Observation: Ina Nutshell • For many HPC workloads – THE FASTER AN APPLICATION COMPLETES, THE LOWER ITS ENERGY CONSUMPTION – OR … OPTIMIZE FOR SPEED AND YOU ARE (USUALLY) ALREADY OPTIMIZING FOR ENERGY • But for some HPC and non-HPC cases – FREQUENCY SCALING SAVES ENERGY
  • 15.
    Two Key Questions •Can developers optimize code for energy? YES • Can owners and users tune applications for energy? YES …. But should they? • Are we counting all energy? • Are we considering all costs?
  • 16.
    What is energy? Approximations forEnergy • Floating point, vector operations, memory access • L1 or L2 misses vs main memory orders of magnitude in energy Heuristics • Real data from some processor, memory subsystems • Available in kernel - Intel RAPL Low level measurement • PDU and server level readings • Real data – real energy Server level monitoring • Air-con • Servers, switches, storage…. Full system monitoring
  • 17.
    Two Key Questions •When should developers optimize code for energy? • When should owners and users tune applications for energy?
  • 18.
    Frequency Scaling Some workloads have lowcompute requirement, but high data volume Data crunching vs number crunching Processor is over-powered for the speed of memory, disk or network CPU frequency can be scaled down in software Providing information to developer, user and system owner Allinea MAP Allinea Performance Reports
  • 19.
    A lot ofcodes are memory-bound
  • 20.
    Multiple cores sharebandwidth Core 1 Core 2 Core 3 Core 4 … Lots of clever technology Main memory
  • 21.
    Can we tunethem for energy efficiency? Core 1 Core 2 Core 3 Core 4 … Lots of clever technology Main memory
  • 22.
    How can weimprove energy efficiency? Buy a new cluster with ambient warm water cooling an integrated espresso machine Reduce CPU frequency Run on fewer cores per node
  • 23.
    How can weimprove energy efficiency? Buy a new cluster with ambient warm water cooling an integrated espresso machine Reduce CPU frequency? Run on fewer cores per node?
  • 24.
    The Experiment One simple code A well-understood waveequation solver One compute node Minimize effect of MPI communications Change CPU frequency and #cores Measure the results with Allinea Performance Reports
  • 25.
    4 PPN @2.1 Ghz, 30 seconds
  • 26.
    4 PPN @2.1 Ghz, 30 seconds 4 PPN @ 1.3 Ghz, 34 seconds
  • 27.
    1.3 Ghz 1.7 Ghz 2.1Ghz 0% 10% 20% 30% 40% 50% 60% 70% 2 4 6 8 Slowdown relative to 4 PPN @ 2.1Ghz Data gathered with Performance Reports’ CSV export 0%-10% 10%-20% 20%-30% 30%-40% 40%-50% 50%-60% 60%-70% 1.7Ghz run completes as quickly as at 2.1Ghz
  • 28.
    1.3 Ghz 1.7 Ghz 2.1Ghz -10% -5% 0% 5% 10% 15% 20% 2 4 6 8 Energy savings relative to 4 PPN @ 2.1Ghz Data gathered with Performance Reports’ CSV export -10%--5% -5%-0% 0%-5% 5%-10% 10%-15% 15%-20% 5-10% energy savings with zero performance impact
  • 29.
    1.3 Ghz 1.7 Ghz 2.1Ghz 0% 10% 20% 30% 40% 50% 60% 70% 2 4 6 8 Slowdown relative to 4 PPN @ 2.1Ghz Data gathered with Performance Reports’ CSV export 0%-10% 10%-20% 20%-30% 30%-40% 40%-50% 50%-60% 60%-70% 15% energy savings with 20% performance impact
  • 30.
    The Results 1.3 Ghz 1.7Ghz 2.1 Ghz -10% -5% 0% 5% 10% 15% 20% 2 4 6 8 2 PPN: 15% energy savings, 20% increased runtime -10%--5% -5%-0% 0%-5% 5%-10% 10%-15% 15%-20% 1.3 Ghz 1.7 Ghz 2.1 Ghz 0% 10% 20% 30% 40% 50% 60% 70% 2 4 6 8 1.7Ghz: 6% Energy savings for free 0%-10% 10%-20% 20%-30% 30%-40% 40%-50% 50%-60% 60%-70% So… should we run every job at a reduced clock speed? Or only ever use half the cores on each node?
  • 31.
    Improving energy efficiency •Each application and system has different characteristics – TOOLS CAN SHOW IF THE APPLICATION WASTES POWER UNNECESSARILY – DEVELOPERS CAN SEE WHERE TO OPTIMIZE AND CHANGE CODE – USERS CAN IMPROVE EFFICIENCY WITHOUT CHANGING CODE • Don’t forget the opportunity cost – IN HPC SLOWING DOWN APPLICATIONS COSTS SCIENCE – MACHINES AND PHDS HAVE FINITE LIFETIME – AND THEIR COST DOMINATES • Time and energy are not the same – OPTIMIZE FOR TIME BEFORE OPTIMIZING FOR ENERGY