SlideShare a Scribd company logo
Dynamic Tuning of HPC Applications
Overview
Methodology
• Motivation and Introduction
• READEX project overview
• What can you achieve with static and dynamic tuning
• Tuning of the hardware parameters
• Effect on the energy consumption
• Effect of hardware parameter tuning on kernels with various arithmetic intensity
• Evaluation of complex HPC applications
• BEM4I
• OpenFOAM
• Scalability tests with ESPRESO
• Tuning of the application parameters
Motivation and Introduction
READEX Project & Motivation
• Energy efficiency is critical to current and future systems
• Applications exhibit dynamic behavior
• Changing resource requirements
• Computational characteristics
• Changing load on processors over time
Goal was to create a tools-aided methodology for automatic tuning of parallel applications.
Dynamically adjust system parameters to actual resource requirements
What is dynamic tuning
FREQ=2 GHz
Phase region
Significant region
Significant region
FREQ=1.5 GHz
READEX Tool Suite
1. Instrument application
• Score-P provides different kinds of instrumentation
2. Detect dynamism
• Check whether runtime situations could benefit
from tuning
3. Detect energy saving potential and
configurations (DTA)
• Use tuning plugin and power measurement
infrastructure to search for optimal configuration
• Create tuning model
4. Runtime application tuning (RAT)
• Apply tuning model, use optimal configuration
Periscope Tuning
Framework
READEX
Tuning Plugin
Application
Tuning Model
Score-P
READEX Runtime
LibraryOnline
Access
Interface
Substrate
Plugin
Interface
Parameter
Control Plugin
Energy
Measurements
(HDEEM)
READEX Tool Suite
READEX Test Suite
Consists of benchmarks, proxy apps and complex production
applications
Key features:
• Full set of scripts allows reproducibility of experiments on
• TUD Taurus HSW (HDEEM) and BDW partitions
• IT4I Salomon machine (RAPL)
• Support for Slurm and PBS schedulers
• Automatic savings evaluation
• Performs evaluation of
• hardware and system parameter tuning
• application parameter tuning
• Contains manual instrumentation of significant regions
• using header file à can be adopted to test other tools
Application
type
Application
name
benchmarks or
proxy apps
AMG2013
Blasbench
Kripke
Lulesh
NPB3.3
production
applications
BEM4I
ESPRESO
INDEED
OpenFOAM
What can you expect from static tuning
MANUAL STATIC TUNING
12.6%
PROPOSAL
4.3%
17.6% Test Suite MAX
Test Suite MIN
Test Suite AVG
Software
Static tuning
savings
AMG2013 12.5 %
Blasbench 7.4 %
Kripke 11.5 %
Lulesh 17.6 %
NPB3.3 11.0 %
BEM4I 15.7 %
INDEED 17.6 %
ESPRESO 4.3 %
OpenFOAM 15.9 %
Average 12.6 %
What can you expect from dynamic tuning
Test Suite MAX
Test Suite MIN
Test Suite AVG
proposal goal: up to 30%
Test Suite MAX
MANUAL DYNAMIC TUNING
34.1%
PROPOSAL
Test Suite MIN 8.2%
Test Suite AVG 17.%
Software
Dynamic tuning
savings
AMG2013 12.5 %
Blasbench 15.3 %
Kripke 18.5 %
Lulesh 18.7 %
NPB3.3 11.0%
BEM4I 34.1 %
INDEED 19.5 %
ESPRESO 8.2 %
OpenFOAM 20.1%
Average 17.5 %
Energy savings achieved by static and dynamic tuning
Application
(default is Intel compiler)
(* uses GCC compiler)
HW parameters
Static tuning saving
node energy / time
Dynamic tuning
savings
node energy/time
READEX tunin
savings
node energy/ti
AMG2013 CF, UCF, threads 12.5% / −0.9% N/A 7.0% / −14.0%
Blasbench CF, UCF, threads 7.4% / −0.9% 15.3% / −18.1% 9.9% / −9.2%
Kripke CF, UCF 11.5% / −28.3% 18.8% / − 18.7% 10.5% / −28.9
Lulesh CF, UCF, threads 17.6% / −8.9% 18.7% / −11.7% 18.2% / −25.7
NPB3.3-BT-MZ CF, UCF, threads 11% / −11.3% N/A 10.8% / −12%
BEM4I CF, UCF, threads 15.7% / −6.2% 34.1% / 10.9% 34.0% / 10.9%
INDEED CF, UCF, threads 17.6% / −12.8% 19.5% / −14.2% 19.1% / −17.3
ESPRESO CF, UCF, threads 4.3% / −8.9% 8.2% / −10.1% 7.1% / −12.3%
OpenFOAM CF, UCF 15.9% / −10.5% 20.1% / 11.5% 9.8% / −9.8%
Evaluation of READEX Tool Suite on TUD Taurus Haswell system with HDEEM energy measurements
Key findings:
• Best savings achieved with BEM4I application – up to 34% for energy and 11% for runtime
• In general energy savings are ”paid” by extra runtime
Tuning of the hardware parameters
Hardware parameter tuning
Investigation of impact of CPU uncore frequency tuning on memory bound code:
• Optimal frequency, with low energy consumption, and a small performance impact
Evaluation using STREAM Copy benchmark
Results by TU Dresden under READEX project
Hardware parameter tuning
Effect of changing core frequencies on uncore performance using memory bound code
• Just a small impact on the Bandwidth and Energy
Evaluation using STREAM Copy benchmark
Results by TU Dresden under READEX project
Heatmap of the energy consumption of a stream benchmark for different core and uncore frequencies.
The data array does not fit in the processor’s L3 processor cache
Hardware parameter tuning
L3 Cache Energy efficiency and Bandwidth:
• Different optimal uncore frequencies
Evaluation using STREAM Copy benchmark
Results by TU Dresden under READEX project
Heatmap of the energy consumption of a stream benchmark for different core and uncore frequencies.
The data array does fit in the processor’s L3 processor cache
Effect of hardware parameter tuning on kernels
with various arithmetic intensity
Arithmetic intensity
Static tuning for various arithmetic intensity
Ratio from 1:9
Static tuning for various arithmetic intensity
Ratio from 2:8
Static tuning for various arithmetic intensity
Ratio from 3:7
Static tuning for various arithmetic intensity
Ratio from 4:6
Static tuning for various arithmetic intensity
Ratio from 5:5
Static tuning for various arithmetic intensity
Ratio from 6:4
Static tuning for various arithmetic intensity
Ratio from 7:3
Static tuning for various arithmetic intensity
Ratio from 8:2
Static tuning for various arithmetic intensity
Ratio from 9:1
Hardware parameter tuning
Behavior of the simple application with two kernels
• Low computational intensity – DGEMV
• High computational intensity – DGEMM
• Tuning of three parameters
• Core frequency
• Uncore frequency
• Number of OpenMP threads
• Visualized by RADAR
....
Low CI (DGEMV) High CI (DGEMM)
10 threads
2.2 GHz UCF
1.2 GHz CF
12 threads
1.2 GHz UCF
2.5 GHz CF
Static tuning for both kernels
12 threads
2.2 GHz UCF
2.4 GHz CF
Computenodeenergyconsumption[J]
CPU core frequency [GHz] CPU core frequency [GHz] CPU core frequency [GHz]
Computenodeenergyconsumption[J]
Computenodeenergyconsumption[J]
Note: runtime of both kernels was equal for default settings
Two kernels with
1:1 workload ratio
Energy
consumption
Energy
savings
Default settings 2017J - -
Static optimal 1833J 179J 9%
Dynamic optimal 1612J 221J 12%
Total savings - 400J 20%
Core and uncore frequency tuning under power cap
Experiments description and testbed parameters
Testbed: Broadwell partition of the Galileo
supercomputer in CINECA
• dual socket server
• two 18-core Intel Xeon E5-2697v4 processor
• 2.3 GHz nominal frequency.
• 2.7 GHz turbo frequency when all 18 cores are utilized
• 145W TDP
Key tunable parameters of the 18-core Intel Xeon E5-
2697v4 processor and their respective ranges and steps.
A set of experiments performed on Intel Broadwell Architecture
Tuning of COMPUTE bound workload
• behavior of the platform when running memory bound workload
• under 145 W (TDP level, no power cap)
• three different power cap levels 100 W, 80 W and 60 W.
3,268s 3,268s
3,903s
3,903s
7,409s
3,577s
7,693s
4,379s
3,653s
363,4J
311,8J 311,8J
285,4J
304,2J
271,6J
290,0J
0
50
100
150
200
250
300
350
400
2,5
4,5
6,5
8,5
10,5
12,5
1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0
Energyconsumption[J]
Runtime[s]
Frequency [GHz]
Tuningof computebound region under 80W power cap
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - min UCF
EXP7 - DVFS & UCF - max UCF
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - min UCF
EXP7 - DVFS & UCF - max UCF
8.5% energy savings
8.4% time savings
12.9% energy savings
10.9% time extension
3,268s
3,450s 3,450s
7,411s
3,293s
4,378s
7,698s
363,4J
344,4J 344,4J
300,4J
293,0J
305,4J
271,0J
297J
0
50
100
150
200
250
300
350
400
2,5
4,5
6,5
8,5
10,5
12,5
1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0
Energyconsumption[J]
Runtime[s]
Frequency [GHz]
Tuningof computebound region under 100W power cap
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - min UCF
EXP7 - DVFS & UCF - max UCF
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - min UCF
EXP7 - DVFS & UCF - max UCF
14.9% energy savings
4.5% time savings
21.3% energy savings
21.1% time extension
3,268s
4,944s 4,944s
7,410s
4,849s4,477s
7,692s
4,565s 4,606s
363,4J
296J
295,0J
268,0J
303,0J
270,4J
0
50
100
150
200
250
300
350
400
2,5
4,5
6,5
8,5
10,5
12,5
1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0
Energyconsumption[J]
Runtime[s]
Frequency [GHz]
Tuningof computebound region under 60W power cap
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - max UCF
EXP7 - DVFS & UCF - min UCF
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - max UCF
EXP7 - DVFS & UCF - min UCF
9.1% energy savings
9.4% time savings
3,268s
3,450s 3,450s
7,411s
3,293s
4,378s
7,698s
363,4J
344,4J
344,4J
300,4J
293,0J
305,4J
271,0J
297J
0
50
100
150
200
250
300
350
400
2,5
4,5
6,5
8,5
10,5
12,5
1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0
Energyconsumption[J]
Runtime[s]
Frequency [GHz]
Tuning of compute bound region under 100W power cap
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - min UCF
EXP7 - DVFS & UCF - max UCF
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - min UCF
EXP7 - DVFS & UCF - max UCF
14.9% energy savings
4.5% time savings
21.3% energy savings
21.1% time extension
Tuning of COMPUTE bound workload under 100W power cap
3,268s
3,268s
3,903s 3,903s
7,409s
3,577s
7,693s
4,379s
3,653s
363,4J
311,8J 311,8J
285,4J
304,2J
271,6J
290,0J
0
50
100
150
200
250
300
350
400
2,5
4,5
6,5
8,5
10,5
12,5
1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0
Energyconsumption[J]
Runtime[s]
Frequency [GHz]
Tuning of compute bound region under 80W power cap
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - min UCF
EXP7 - DVFS & UCF - max UCF
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - min UCF
EXP7 - DVFS & UCF - max UCF
8.5% energy savings
8.4% time savings
12.9% energy savings
10.9% time extension
Tuning of COMPUTE bound workload under 80W power cap
3,268s
4,944s 4,944s
7,410s
4,849s
4,477s
7,692s
4,565s 4,606s
363,4J
296J
295,0J
268,0J
303,0J
270,4J
0
50
100
150
200
250
300
350
400
2,5
4,5
6,5
8,5
10,5
12,5
1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0
Energyconsumption[J]
Runtime[s]
Frequency [GHz]
Tuning of compute bound region under 60W power cap
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - max UCF
EXP7 - DVFS & UCF - min UCF
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - max UCF
EXP7 - DVFS & UCF - min UCF
9.1% energy savings
9.4% time savings
Tuning of COMPUTE bound workload under 60W power cap
Observations for COMPUTE bound workload
• To achieve the best possible performance
• the uncore frequency must be reduces to minimum
• 9.4 % performance gain up to and
• 14.9 % lower energy consumption
• If further energy savings are required – use DVFS and lower the core freq.
• up to 21 % of energy savings
• up to 21 % penalty in runtime
• this effect is more visible for higher powercap levels
Tuning of memory bound workload
• behavior of the platform when running memory bound workload
• under 145 W (TDP level, no power cap)
• three different power cap levels 100 W, 80 W and 60 W.
1,886s
1,959s
1,886s
197,6J
188,2J
188,2J
148,6J
115,2J
170J
145,6J
0
50
100
150
200
250
1,8
2,3
2,8
3,3
3,8
4,3
4,8
5,3
5,8
1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0
Energyconsumption[J]
Runtime[s]
Frequency [GHz]
Tuningofmemorybound region under 100W power cap
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - UCF = 2.2GHz
EXP7 - DVFS & UCF - max UCF
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - UCF = 2.2GHz
EXP7 - DVFS & UCF - max UCF
38.7% energy savings
3.6% time extension
1,886s
1,920s
1,890s
1,959s
197,6J
153,2J
114,4J
146,0J
0
50
100
150
200
250
1,8
2,3
2,8
3,3
3,8
4,3
4,8
5,3
5,8
1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0
Energyconsumption[J]
Runtime[s]
Frequency [GHz]
Tuningofmemorybound region under 80W power cap
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - UCF = 2.2GHz
EXP7 - DVFS & UCF - max UCF
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - UCF = 2.2GHz
EXP7 - DVFS & UCF - max UCF
21.6% energy savings
3.6% time extension
1,886s
2,475s 2,475s
1,945s
2,397s
1,925s
197,6J
147,8J
116,2J
115,0J
0
50
100
150
200
250
1,8
2,3
2,8
3,3
3,8
4,3
4,8
5,3
5,8
1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0
Energyconsumption[J]
Runtime[s]
Frequency [GHz]
Tuningofmemorybound region under 60W power cap
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - UCF = 2.2GHz
EXP7 - DVFS & UCF - max UCF
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - UCF = 2.2GHz
EXP7 - DVFS & UCF - max UCF
22.2% energy savings
22.2% time savings
1,886s
1,959s
1,886s
197,6J
188,2J
188,2J
148,6J
115,2J
170J
145,6J
0
50
100
150
200
250
1,8
2,3
2,8
3,3
3,8
4,3
4,8
5,3
5,8
1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0
Energyconsumption[J]
Runtime[s]
Frequency [GHz]
Tuning of memory bound region under 100W power cap
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - UCF = 2.2GHz
EXP7 - DVFS & UCF - max UCF
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - UCF = 2.2GHz
EXP7 - DVFS & UCF - max UCF
38.7% energy savings
3.6% time extension
Tuning of memory bound workload under 100W power cap
1,886s
1,920s
1,890s
1,959s
197,6J
153,2J
114,4J
146,0J
0
50
100
150
200
250
1,8
2,3
2,8
3,3
3,8
4,3
4,8
5,3
5,8
1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0
Energyconsumption[J]
Runtime[s]
Frequency [GHz]
Tuning of memory bound region under 80W power cap
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - UCF = 2.2GHz
EXP7 - DVFS & UCF - max UCF
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - UCF = 2.2GHz
EXP7 - DVFS & UCF - max UCF
21.6% energy savings
3.6% time extension
Tuning of memory bound workload under 80W power cap
1,886s
2,475s 2,475s
1,945s
2,397s
1,925s
197,6J
147,8J
116,2J
115,0J
0
50
100
150
200
250
1,8
2,3
2,8
3,3
3,8
4,3
4,8
5,3
5,8
1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0
Energyconsumption[J]
Runtime[s]
Frequency [GHz]
Tuning of memory bound region under 60W power capEXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - UCF = 2.2GHz
EXP7 - DVFS & UCF - max UCF
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - UCF = 2.2GHz
EXP7 - DVFS & UCF - max UCF
22.2% energy savings
22.2% time savings
Tuning of memory bound workload under 60W power cap
Observations for both workloads
Observations for memory bound workload
• Under the power budget lower that 80 W
• DVFS should be set to minimum value
• boost the performance of the uncore part by 22%.
• Tuning of the uncore frequency
• has low effect on the performance
• but a major effect on energy consumption
• between 21% (60 W and 80W) to 38% (100W)
Observations for compute bound workload
• To achieve the best possible performance
• the uncore frequency must be reduces to minimum
• 9.4 % performance gain up to and
• 14.9 % lower energy consumption
• If further energy savings are required – use DVFS and lower the core freq.
• up to 21 % of energy savings
• up to 21 % penalty in runtime
• this effect is more visible for higher powercap levels
Evaluation of complex HPC applications
BEM4I Application
Application runtime
assemble_k
[s]
assemble_v
[s]
gmres_solve
[s]
print_vtu
[s]
main
[s]
default runtime 5.4 5.9 10.2 5.6 27.3
static tuning runtime 9.8 10.6 6.1 2.4 29.0
dynamic tuning runtime 7.0 7.2 7.9 2.1 24.3
static savings [%] -82.3% -79.1% 40.5% 56.8% -6.2%
dynamic savings [%] -30.6% -20.9% 23.2% 62.9% 10.9%
Hardware: dual socket system with 2x12 CPU cores – ”standard HW” in HPC centres
Region description:
• assemble_k and assemble_v – high utilization of vector units, extreme level of
optimization – fully compute bound great utilization of both sockets and all cores
• gmres_solve – uses DGEMV from MKL – memory bound, suffers on NUMA effect;
this routine is more efficient on single socket
• print_vtu – single threaded I/O and network bound region why stores data to a
file on LUSTRE system
”static": {
"FREQUENCY": ”25", <--------- 2.5 GHz
"NUM_THREADS": ”12", <--------- 12 OpenMP threads
"UNCORE_FREQUENCY": ”22” } <--------- 2.2 GHz
"assemble_k": {
"FREQUENCY": "23",
"NUM_THREADS": "24",
"UNCORE_FREQUENCY": ”16”
},
"assemble_v": {
"FREQUENCY": ”25",
"NUM_THREADS": "24",
"UNCORE_FREQUENCY": ”14”
},
"gmres_solve": {
"FREQUENCY": ”17",
"NUM_THREADS": ”8",
"UNCORE_FREQUENCY": ”22”
},
"print_vtu": {
"FREQUENCY": "25",
"NUM_THREADS": ”6",
"UNCORE_FREQUENCY": ”24”
}
Compute node energy
assemble_k
[J]
assemble_v
[J]
gmres_solve
[J]
print_vtu
[J]
main
[J]
default energy 1476 1484 2733 1142 6872
static tuning energy 1962 2015 1366 420 5792
dynamic tuning energy 1467 1462 1259 293 4531
static savings [%] -33.8% -35.8% 50.0% 63.2% 15.7%
dynamic savings [%] 0.6% 1.5% 53.9% 74.3% 34.1%
BEM4I Application
Large energy savings is combination of optimal HW settings and runtime savings
due to mitigation of NUMA effect by optimal settings of OpenMP threading
• Without savings in runtime caused by similar application will
• Energy savings approx. 15 – 20%
• Runtime savings approx. -15%
”static": {
"FREQUENCY": ”25", <--------- 2.5 GHz
"NUM_THREADS": ”12", <--------- 12 OpenMP threads
"UNCORE_FREQUENCY": ”22” } <--------- 2.2 GHz
"assemble_k": {
"FREQUENCY": "23",
"NUM_THREADS": "24",
"UNCORE_FREQUENCY": ”16”
},
"assemble_v": {
"FREQUENCY": ”25",
"NUM_THREADS": "24",
"UNCORE_FREQUENCY": ”14”
},
"gmres_solve": {
"FREQUENCY": ”17",
"NUM_THREADS": ”8",
"UNCORE_FREQUENCY": ”22”
},
"print_vtu": {
"FREQUENCY": "25",
"NUM_THREADS": ”6",
"UNCORE_FREQUENCY": ”24”
}
OpenFOAM Application
OpenFOAM Energy consumption Energy savings
Default settings 14 231J - -
Static tuning 12 264J 2 264J 15.9%
Dynamic tuning
Total savings
• Computational fluid dynamics
• Finite volume + multigrid solver
OpenFOAM Application
OpenFOAM Energy consumption Energy savings
Default settings 14 231J - -
Static tuning 12 264J 2 264J 15.9%
Dynamic tuning 11 370J 597J 4.8%
Total savings 2 861J 20.1%
• Computational fluid dynamics
• Finite volume + multigrid solver
ESPRESO Application
33% of energy savings
22% of time savings and improved strong scalability
• Structural mechanics code
• Finite element + sparse FETI solver
• Different tuning models for different # of nodes is needed for strong scalability –
workload per node is varies
• Includes dynamic switching overheads
Energy savings analysis for the strong scalability test of the
ESPRESO library when running the cube benchmark
Application parameters tuning
Application parameters tuning of the ESPRESO
50% - 66% against ”reasonable” settings
86% against the worst case
0
50
100
150
200
250
300
0 500 1000 1500 2000 2500 3000 3500
Energyconsumption[kJ]
Configuration index
the “reasonable” settings
the optimal settings
9 parameters
3840 combinations
• FETI METHOD 2x
• PRECONDITIONER 5x
• ITERATIVE SOLVER TYPE 2x
• HFETI type 2x
• NON-UNIFORM PARTS 6x
• REDUNDANT LAGRANGE 2x
• SCALING 2x
• B0_TYPE 2x
• ADAPTIVE PRECISION 2x
Application parameters tuning
Application parameter tuning parameters is very promising
• application configuration parameters are given in the input file
• each setting requires an individual start of the application
• tool performs automatic search of application parameter space
Application
number of parameters tested /
total number of options
Energy savings
compared
to the worst settings
Energy savings compared to
default or reasonable settings
ESPRESO 9 / 3840 86% 50 – 66%
ELMER 1 / 40 97% 50 – 75%
OpenFOAM 2 / 12 24% 8%
INDEED 3 / 12 35% 25%
Thank you

More Related Content

What's hot

State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
inside-BigData.com
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
Ganesan Narayanasamy
 
IBM HPC Transformation with AI
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
Ganesan Narayanasamy
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
Ganesan Narayanasamy
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
inside-BigData.com
 
DPDK Summit - 08 Sept 2014 - Ericsson - A Multi-Socket Ferrari for NFV
DPDK Summit - 08 Sept 2014 - Ericsson - A Multi-Socket Ferrari for NFVDPDK Summit - 08 Sept 2014 - Ericsson - A Multi-Socket Ferrari for NFV
DPDK Summit - 08 Sept 2014 - Ericsson - A Multi-Socket Ferrari for NFV
Jim St. Leger
 
BXI: Bull eXascale Interconnect
BXI: Bull eXascale InterconnectBXI: Bull eXascale Interconnect
BXI: Bull eXascale Interconnect
inside-BigData.com
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWER
inside-BigData.com
 
DOME 64-bit μDataCenter
DOME 64-bit μDataCenterDOME 64-bit μDataCenter
DOME 64-bit μDataCenter
inside-BigData.com
 
Trends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient PerformanceTrends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient Performance
inside-BigData.com
 
ARM HPC Ecosystem
ARM HPC EcosystemARM HPC Ecosystem
ARM HPC Ecosystem
inside-BigData.com
 
Intel dpdk Tutorial
Intel dpdk TutorialIntel dpdk Tutorial
Intel dpdk Tutorial
Saifuddin Kaijar
 
FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)
Kirill Tsym
 
Summit workshop thompto
Summit workshop thomptoSummit workshop thompto
Summit workshop thompto
Ganesan Narayanasamy
 
AI is Impacting HPC Everywhere
AI is Impacting HPC EverywhereAI is Impacting HPC Everywhere
AI is Impacting HPC Everywhere
inside-BigData.com
 
Lenovo HPC Strategy Update
Lenovo HPC Strategy UpdateLenovo HPC Strategy Update
Lenovo HPC Strategy Update
inside-BigData.com
 
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand SolutionsMellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
inside-BigData.com
 
HPC Accelerating Combustion Engine Design
HPC Accelerating Combustion Engine DesignHPC Accelerating Combustion Engine Design
HPC Accelerating Combustion Engine Design
inside-BigData.com
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systems
inside-BigData.com
 
A Fresh Look at HPC from Huawei Enterprise
A Fresh Look at HPC from Huawei EnterpriseA Fresh Look at HPC from Huawei Enterprise
A Fresh Look at HPC from Huawei Enterprise
inside-BigData.com
 

What's hot (20)

State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
IBM HPC Transformation with AI
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
DPDK Summit - 08 Sept 2014 - Ericsson - A Multi-Socket Ferrari for NFV
DPDK Summit - 08 Sept 2014 - Ericsson - A Multi-Socket Ferrari for NFVDPDK Summit - 08 Sept 2014 - Ericsson - A Multi-Socket Ferrari for NFV
DPDK Summit - 08 Sept 2014 - Ericsson - A Multi-Socket Ferrari for NFV
 
BXI: Bull eXascale Interconnect
BXI: Bull eXascale InterconnectBXI: Bull eXascale Interconnect
BXI: Bull eXascale Interconnect
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWER
 
DOME 64-bit μDataCenter
DOME 64-bit μDataCenterDOME 64-bit μDataCenter
DOME 64-bit μDataCenter
 
Trends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient PerformanceTrends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient Performance
 
ARM HPC Ecosystem
ARM HPC EcosystemARM HPC Ecosystem
ARM HPC Ecosystem
 
Intel dpdk Tutorial
Intel dpdk TutorialIntel dpdk Tutorial
Intel dpdk Tutorial
 
FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)
 
Summit workshop thompto
Summit workshop thomptoSummit workshop thompto
Summit workshop thompto
 
AI is Impacting HPC Everywhere
AI is Impacting HPC EverywhereAI is Impacting HPC Everywhere
AI is Impacting HPC Everywhere
 
Lenovo HPC Strategy Update
Lenovo HPC Strategy UpdateLenovo HPC Strategy Update
Lenovo HPC Strategy Update
 
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand SolutionsMellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
 
HPC Accelerating Combustion Engine Design
HPC Accelerating Combustion Engine DesignHPC Accelerating Combustion Engine Design
HPC Accelerating Combustion Engine Design
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systems
 
A Fresh Look at HPC from Huawei Enterprise
A Fresh Look at HPC from Huawei EnterpriseA Fresh Look at HPC from Huawei Enterprise
A Fresh Look at HPC from Huawei Enterprise
 

Similar to Energy Efficient Computing using Dynamic Tuning

Runtime Methods to Improve Energy Efficiency in HPC Applications
Runtime Methods to Improve Energy Efficiency in HPC ApplicationsRuntime Methods to Improve Energy Efficiency in HPC Applications
Runtime Methods to Improve Energy Efficiency in HPC Applications
Facultad de Informática UCM
 
E03403027030
E03403027030E03403027030
E03403027030
theijes
 
Power Optimization with Efficient Test Logic Partitioning for Full Chip Design
Power Optimization with Efficient Test Logic Partitioning for Full Chip DesignPower Optimization with Efficient Test Logic Partitioning for Full Chip Design
Power Optimization with Efficient Test Logic Partitioning for Full Chip Design
Pankaj Singh
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
inside-BigData.com
 
A Brief Survey of Current Power Limiting Strategies
A Brief Survey of Current Power Limiting StrategiesA Brief Survey of Current Power Limiting Strategies
A Brief Survey of Current Power Limiting Strategies
IRJET Journal
 
How to achieve 95%+ Accurate power measurement during architecture exploration?
How to achieve 95%+ Accurate power measurement during architecture exploration? How to achieve 95%+ Accurate power measurement during architecture exploration?
How to achieve 95%+ Accurate power measurement during architecture exploration?
Deepak Shankar
 
Energy Efficiency in Large Scale Systems
Energy Efficiency in Large Scale SystemsEnergy Efficiency in Large Scale Systems
Energy Efficiency in Large Scale Systems
Jerry Sheehan
 
Optimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for EnergyOptimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for Energy
David Lecomber
 
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...
Tulipp. Eu
 
Run-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environmentsRun-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environments
NECST Lab @ Politecnico di Milano
 
GUI overhead
GUI overheadGUI overhead
GUI overhead
Heather Brotherton
 
Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...
Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...
Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...
Intel IT Center
 
Six-Core AMD Opteron EE Processor
Six-Core AMD Opteron EE ProcessorSix-Core AMD Opteron EE Processor
Six-Core AMD Opteron EE ProcessorAMD
 
Intel speed-select-technology-base-frequency-enhancing-performance
Intel speed-select-technology-base-frequency-enhancing-performanceIntel speed-select-technology-base-frequency-enhancing-performance
Intel speed-select-technology-base-frequency-enhancing-performance
Vijaianand Sundaramoorthy
 
customization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAcustomization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLA
Shien-Chun Luo
 
SDC Server Sao Jose
SDC Server Sao JoseSDC Server Sao Jose
SDC Server Sao Jose
Roberto Brandao
 
RT15 Berkeley | Introduction to FPGA Power Electronic & Electric Machine real...
RT15 Berkeley | Introduction to FPGA Power Electronic & Electric Machine real...RT15 Berkeley | Introduction to FPGA Power Electronic & Electric Machine real...
RT15 Berkeley | Introduction to FPGA Power Electronic & Electric Machine real...
OPAL-RT TECHNOLOGIES
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의DzH QWuynh
 
intel speed-select-technology-base-frequency-enhancing-performance
intel speed-select-technology-base-frequency-enhancing-performanceintel speed-select-technology-base-frequency-enhancing-performance
intel speed-select-technology-base-frequency-enhancing-performance
DESMOND YUEN
 

Similar to Energy Efficient Computing using Dynamic Tuning (20)

Runtime Methods to Improve Energy Efficiency in HPC Applications
Runtime Methods to Improve Energy Efficiency in HPC ApplicationsRuntime Methods to Improve Energy Efficiency in HPC Applications
Runtime Methods to Improve Energy Efficiency in HPC Applications
 
E03403027030
E03403027030E03403027030
E03403027030
 
Power Optimization with Efficient Test Logic Partitioning for Full Chip Design
Power Optimization with Efficient Test Logic Partitioning for Full Chip DesignPower Optimization with Efficient Test Logic Partitioning for Full Chip Design
Power Optimization with Efficient Test Logic Partitioning for Full Chip Design
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
A Brief Survey of Current Power Limiting Strategies
A Brief Survey of Current Power Limiting StrategiesA Brief Survey of Current Power Limiting Strategies
A Brief Survey of Current Power Limiting Strategies
 
How to achieve 95%+ Accurate power measurement during architecture exploration?
How to achieve 95%+ Accurate power measurement during architecture exploration? How to achieve 95%+ Accurate power measurement during architecture exploration?
How to achieve 95%+ Accurate power measurement during architecture exploration?
 
Energy Efficiency in Large Scale Systems
Energy Efficiency in Large Scale SystemsEnergy Efficiency in Large Scale Systems
Energy Efficiency in Large Scale Systems
 
Optimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for EnergyOptimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for Energy
 
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...
 
Run-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environmentsRun-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environments
 
GUI overhead
GUI overheadGUI overhead
GUI overhead
 
Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...
Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...
Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...
 
Six-Core AMD Opteron EE Processor
Six-Core AMD Opteron EE ProcessorSix-Core AMD Opteron EE Processor
Six-Core AMD Opteron EE Processor
 
Intel speed-select-technology-base-frequency-enhancing-performance
Intel speed-select-technology-base-frequency-enhancing-performanceIntel speed-select-technology-base-frequency-enhancing-performance
Intel speed-select-technology-base-frequency-enhancing-performance
 
customization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAcustomization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLA
 
SDC Server Sao Jose
SDC Server Sao JoseSDC Server Sao Jose
SDC Server Sao Jose
 
RT15 Berkeley | Introduction to FPGA Power Electronic & Electric Machine real...
RT15 Berkeley | Introduction to FPGA Power Electronic & Electric Machine real...RT15 Berkeley | Introduction to FPGA Power Electronic & Electric Machine real...
RT15 Berkeley | Introduction to FPGA Power Electronic & Electric Machine real...
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의
 
Ip so c-30sept2010
Ip so c-30sept2010Ip so c-30sept2010
Ip so c-30sept2010
 
intel speed-select-technology-base-frequency-enhancing-performance
intel speed-select-technology-base-frequency-enhancing-performanceintel speed-select-technology-base-frequency-enhancing-performance
intel speed-select-technology-base-frequency-enhancing-performance
 

More from inside-BigData.com

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
inside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
inside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
inside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
inside-BigData.com
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
inside-BigData.com
 
Making Supernovae with Jets
Making Supernovae with JetsMaking Supernovae with Jets
Making Supernovae with Jets
inside-BigData.com
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
inside-BigData.com
 
Scientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous ArchitecturesScientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous Architectures
inside-BigData.com
 
SW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computingSW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computing
inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 
Making Supernovae with Jets
Making Supernovae with JetsMaking Supernovae with Jets
Making Supernovae with Jets
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Scientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous ArchitecturesScientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous Architectures
 
SW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computingSW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computing
 

Recently uploaded

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 

Recently uploaded (20)

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 

Energy Efficient Computing using Dynamic Tuning

  • 1. Dynamic Tuning of HPC Applications
  • 2. Overview Methodology • Motivation and Introduction • READEX project overview • What can you achieve with static and dynamic tuning • Tuning of the hardware parameters • Effect on the energy consumption • Effect of hardware parameter tuning on kernels with various arithmetic intensity • Evaluation of complex HPC applications • BEM4I • OpenFOAM • Scalability tests with ESPRESO • Tuning of the application parameters
  • 4. READEX Project & Motivation • Energy efficiency is critical to current and future systems • Applications exhibit dynamic behavior • Changing resource requirements • Computational characteristics • Changing load on processors over time Goal was to create a tools-aided methodology for automatic tuning of parallel applications. Dynamically adjust system parameters to actual resource requirements
  • 5. What is dynamic tuning FREQ=2 GHz Phase region Significant region Significant region FREQ=1.5 GHz
  • 6. READEX Tool Suite 1. Instrument application • Score-P provides different kinds of instrumentation 2. Detect dynamism • Check whether runtime situations could benefit from tuning 3. Detect energy saving potential and configurations (DTA) • Use tuning plugin and power measurement infrastructure to search for optimal configuration • Create tuning model 4. Runtime application tuning (RAT) • Apply tuning model, use optimal configuration Periscope Tuning Framework READEX Tuning Plugin Application Tuning Model Score-P READEX Runtime LibraryOnline Access Interface Substrate Plugin Interface Parameter Control Plugin Energy Measurements (HDEEM) READEX Tool Suite
  • 7. READEX Test Suite Consists of benchmarks, proxy apps and complex production applications Key features: • Full set of scripts allows reproducibility of experiments on • TUD Taurus HSW (HDEEM) and BDW partitions • IT4I Salomon machine (RAPL) • Support for Slurm and PBS schedulers • Automatic savings evaluation • Performs evaluation of • hardware and system parameter tuning • application parameter tuning • Contains manual instrumentation of significant regions • using header file à can be adopted to test other tools Application type Application name benchmarks or proxy apps AMG2013 Blasbench Kripke Lulesh NPB3.3 production applications BEM4I ESPRESO INDEED OpenFOAM
  • 8. What can you expect from static tuning MANUAL STATIC TUNING 12.6% PROPOSAL 4.3% 17.6% Test Suite MAX Test Suite MIN Test Suite AVG Software Static tuning savings AMG2013 12.5 % Blasbench 7.4 % Kripke 11.5 % Lulesh 17.6 % NPB3.3 11.0 % BEM4I 15.7 % INDEED 17.6 % ESPRESO 4.3 % OpenFOAM 15.9 % Average 12.6 %
  • 9. What can you expect from dynamic tuning Test Suite MAX Test Suite MIN Test Suite AVG proposal goal: up to 30% Test Suite MAX MANUAL DYNAMIC TUNING 34.1% PROPOSAL Test Suite MIN 8.2% Test Suite AVG 17.% Software Dynamic tuning savings AMG2013 12.5 % Blasbench 15.3 % Kripke 18.5 % Lulesh 18.7 % NPB3.3 11.0% BEM4I 34.1 % INDEED 19.5 % ESPRESO 8.2 % OpenFOAM 20.1% Average 17.5 %
  • 10. Energy savings achieved by static and dynamic tuning Application (default is Intel compiler) (* uses GCC compiler) HW parameters Static tuning saving node energy / time Dynamic tuning savings node energy/time READEX tunin savings node energy/ti AMG2013 CF, UCF, threads 12.5% / −0.9% N/A 7.0% / −14.0% Blasbench CF, UCF, threads 7.4% / −0.9% 15.3% / −18.1% 9.9% / −9.2% Kripke CF, UCF 11.5% / −28.3% 18.8% / − 18.7% 10.5% / −28.9 Lulesh CF, UCF, threads 17.6% / −8.9% 18.7% / −11.7% 18.2% / −25.7 NPB3.3-BT-MZ CF, UCF, threads 11% / −11.3% N/A 10.8% / −12% BEM4I CF, UCF, threads 15.7% / −6.2% 34.1% / 10.9% 34.0% / 10.9% INDEED CF, UCF, threads 17.6% / −12.8% 19.5% / −14.2% 19.1% / −17.3 ESPRESO CF, UCF, threads 4.3% / −8.9% 8.2% / −10.1% 7.1% / −12.3% OpenFOAM CF, UCF 15.9% / −10.5% 20.1% / 11.5% 9.8% / −9.8% Evaluation of READEX Tool Suite on TUD Taurus Haswell system with HDEEM energy measurements Key findings: • Best savings achieved with BEM4I application – up to 34% for energy and 11% for runtime • In general energy savings are ”paid” by extra runtime
  • 11. Tuning of the hardware parameters
  • 12. Hardware parameter tuning Investigation of impact of CPU uncore frequency tuning on memory bound code: • Optimal frequency, with low energy consumption, and a small performance impact Evaluation using STREAM Copy benchmark Results by TU Dresden under READEX project
  • 13. Hardware parameter tuning Effect of changing core frequencies on uncore performance using memory bound code • Just a small impact on the Bandwidth and Energy Evaluation using STREAM Copy benchmark Results by TU Dresden under READEX project Heatmap of the energy consumption of a stream benchmark for different core and uncore frequencies. The data array does not fit in the processor’s L3 processor cache
  • 14. Hardware parameter tuning L3 Cache Energy efficiency and Bandwidth: • Different optimal uncore frequencies Evaluation using STREAM Copy benchmark Results by TU Dresden under READEX project Heatmap of the energy consumption of a stream benchmark for different core and uncore frequencies. The data array does fit in the processor’s L3 processor cache
  • 15. Effect of hardware parameter tuning on kernels with various arithmetic intensity
  • 17. Static tuning for various arithmetic intensity Ratio from 1:9
  • 18. Static tuning for various arithmetic intensity Ratio from 2:8
  • 19. Static tuning for various arithmetic intensity Ratio from 3:7
  • 20. Static tuning for various arithmetic intensity Ratio from 4:6
  • 21. Static tuning for various arithmetic intensity Ratio from 5:5
  • 22. Static tuning for various arithmetic intensity Ratio from 6:4
  • 23. Static tuning for various arithmetic intensity Ratio from 7:3
  • 24. Static tuning for various arithmetic intensity Ratio from 8:2
  • 25. Static tuning for various arithmetic intensity Ratio from 9:1
  • 26. Hardware parameter tuning Behavior of the simple application with two kernels • Low computational intensity – DGEMV • High computational intensity – DGEMM • Tuning of three parameters • Core frequency • Uncore frequency • Number of OpenMP threads • Visualized by RADAR .... Low CI (DGEMV) High CI (DGEMM) 10 threads 2.2 GHz UCF 1.2 GHz CF 12 threads 1.2 GHz UCF 2.5 GHz CF Static tuning for both kernels 12 threads 2.2 GHz UCF 2.4 GHz CF Computenodeenergyconsumption[J] CPU core frequency [GHz] CPU core frequency [GHz] CPU core frequency [GHz] Computenodeenergyconsumption[J] Computenodeenergyconsumption[J] Note: runtime of both kernels was equal for default settings Two kernels with 1:1 workload ratio Energy consumption Energy savings Default settings 2017J - - Static optimal 1833J 179J 9% Dynamic optimal 1612J 221J 12% Total savings - 400J 20%
  • 27. Core and uncore frequency tuning under power cap
  • 28. Experiments description and testbed parameters Testbed: Broadwell partition of the Galileo supercomputer in CINECA • dual socket server • two 18-core Intel Xeon E5-2697v4 processor • 2.3 GHz nominal frequency. • 2.7 GHz turbo frequency when all 18 cores are utilized • 145W TDP Key tunable parameters of the 18-core Intel Xeon E5- 2697v4 processor and their respective ranges and steps. A set of experiments performed on Intel Broadwell Architecture
  • 29. Tuning of COMPUTE bound workload • behavior of the platform when running memory bound workload • under 145 W (TDP level, no power cap) • three different power cap levels 100 W, 80 W and 60 W. 3,268s 3,268s 3,903s 3,903s 7,409s 3,577s 7,693s 4,379s 3,653s 363,4J 311,8J 311,8J 285,4J 304,2J 271,6J 290,0J 0 50 100 150 200 250 300 350 400 2,5 4,5 6,5 8,5 10,5 12,5 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 Energyconsumption[J] Runtime[s] Frequency [GHz] Tuningof computebound region under 80W power cap EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - min UCF EXP7 - DVFS & UCF - max UCF EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - min UCF EXP7 - DVFS & UCF - max UCF 8.5% energy savings 8.4% time savings 12.9% energy savings 10.9% time extension 3,268s 3,450s 3,450s 7,411s 3,293s 4,378s 7,698s 363,4J 344,4J 344,4J 300,4J 293,0J 305,4J 271,0J 297J 0 50 100 150 200 250 300 350 400 2,5 4,5 6,5 8,5 10,5 12,5 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 Energyconsumption[J] Runtime[s] Frequency [GHz] Tuningof computebound region under 100W power cap EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - min UCF EXP7 - DVFS & UCF - max UCF EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - min UCF EXP7 - DVFS & UCF - max UCF 14.9% energy savings 4.5% time savings 21.3% energy savings 21.1% time extension 3,268s 4,944s 4,944s 7,410s 4,849s4,477s 7,692s 4,565s 4,606s 363,4J 296J 295,0J 268,0J 303,0J 270,4J 0 50 100 150 200 250 300 350 400 2,5 4,5 6,5 8,5 10,5 12,5 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 Energyconsumption[J] Runtime[s] Frequency [GHz] Tuningof computebound region under 60W power cap EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - max UCF EXP7 - DVFS & UCF - min UCF EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - max UCF EXP7 - DVFS & UCF - min UCF 9.1% energy savings 9.4% time savings
  • 30. 3,268s 3,450s 3,450s 7,411s 3,293s 4,378s 7,698s 363,4J 344,4J 344,4J 300,4J 293,0J 305,4J 271,0J 297J 0 50 100 150 200 250 300 350 400 2,5 4,5 6,5 8,5 10,5 12,5 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 Energyconsumption[J] Runtime[s] Frequency [GHz] Tuning of compute bound region under 100W power cap EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - min UCF EXP7 - DVFS & UCF - max UCF EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - min UCF EXP7 - DVFS & UCF - max UCF 14.9% energy savings 4.5% time savings 21.3% energy savings 21.1% time extension Tuning of COMPUTE bound workload under 100W power cap
  • 31. 3,268s 3,268s 3,903s 3,903s 7,409s 3,577s 7,693s 4,379s 3,653s 363,4J 311,8J 311,8J 285,4J 304,2J 271,6J 290,0J 0 50 100 150 200 250 300 350 400 2,5 4,5 6,5 8,5 10,5 12,5 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 Energyconsumption[J] Runtime[s] Frequency [GHz] Tuning of compute bound region under 80W power cap EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - min UCF EXP7 - DVFS & UCF - max UCF EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - min UCF EXP7 - DVFS & UCF - max UCF 8.5% energy savings 8.4% time savings 12.9% energy savings 10.9% time extension Tuning of COMPUTE bound workload under 80W power cap
  • 32. 3,268s 4,944s 4,944s 7,410s 4,849s 4,477s 7,692s 4,565s 4,606s 363,4J 296J 295,0J 268,0J 303,0J 270,4J 0 50 100 150 200 250 300 350 400 2,5 4,5 6,5 8,5 10,5 12,5 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 Energyconsumption[J] Runtime[s] Frequency [GHz] Tuning of compute bound region under 60W power cap EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - max UCF EXP7 - DVFS & UCF - min UCF EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - max UCF EXP7 - DVFS & UCF - min UCF 9.1% energy savings 9.4% time savings Tuning of COMPUTE bound workload under 60W power cap
  • 33. Observations for COMPUTE bound workload • To achieve the best possible performance • the uncore frequency must be reduces to minimum • 9.4 % performance gain up to and • 14.9 % lower energy consumption • If further energy savings are required – use DVFS and lower the core freq. • up to 21 % of energy savings • up to 21 % penalty in runtime • this effect is more visible for higher powercap levels
  • 34. Tuning of memory bound workload • behavior of the platform when running memory bound workload • under 145 W (TDP level, no power cap) • three different power cap levels 100 W, 80 W and 60 W. 1,886s 1,959s 1,886s 197,6J 188,2J 188,2J 148,6J 115,2J 170J 145,6J 0 50 100 150 200 250 1,8 2,3 2,8 3,3 3,8 4,3 4,8 5,3 5,8 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 Energyconsumption[J] Runtime[s] Frequency [GHz] Tuningofmemorybound region under 100W power cap EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - UCF = 2.2GHz EXP7 - DVFS & UCF - max UCF EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - UCF = 2.2GHz EXP7 - DVFS & UCF - max UCF 38.7% energy savings 3.6% time extension 1,886s 1,920s 1,890s 1,959s 197,6J 153,2J 114,4J 146,0J 0 50 100 150 200 250 1,8 2,3 2,8 3,3 3,8 4,3 4,8 5,3 5,8 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 Energyconsumption[J] Runtime[s] Frequency [GHz] Tuningofmemorybound region under 80W power cap EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - UCF = 2.2GHz EXP7 - DVFS & UCF - max UCF EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - UCF = 2.2GHz EXP7 - DVFS & UCF - max UCF 21.6% energy savings 3.6% time extension 1,886s 2,475s 2,475s 1,945s 2,397s 1,925s 197,6J 147,8J 116,2J 115,0J 0 50 100 150 200 250 1,8 2,3 2,8 3,3 3,8 4,3 4,8 5,3 5,8 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 Energyconsumption[J] Runtime[s] Frequency [GHz] Tuningofmemorybound region under 60W power cap EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - UCF = 2.2GHz EXP7 - DVFS & UCF - max UCF EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - UCF = 2.2GHz EXP7 - DVFS & UCF - max UCF 22.2% energy savings 22.2% time savings
  • 35. 1,886s 1,959s 1,886s 197,6J 188,2J 188,2J 148,6J 115,2J 170J 145,6J 0 50 100 150 200 250 1,8 2,3 2,8 3,3 3,8 4,3 4,8 5,3 5,8 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 Energyconsumption[J] Runtime[s] Frequency [GHz] Tuning of memory bound region under 100W power cap EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - UCF = 2.2GHz EXP7 - DVFS & UCF - max UCF EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - UCF = 2.2GHz EXP7 - DVFS & UCF - max UCF 38.7% energy savings 3.6% time extension Tuning of memory bound workload under 100W power cap
  • 36. 1,886s 1,920s 1,890s 1,959s 197,6J 153,2J 114,4J 146,0J 0 50 100 150 200 250 1,8 2,3 2,8 3,3 3,8 4,3 4,8 5,3 5,8 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 Energyconsumption[J] Runtime[s] Frequency [GHz] Tuning of memory bound region under 80W power cap EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - UCF = 2.2GHz EXP7 - DVFS & UCF - max UCF EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - UCF = 2.2GHz EXP7 - DVFS & UCF - max UCF 21.6% energy savings 3.6% time extension Tuning of memory bound workload under 80W power cap
  • 37. 1,886s 2,475s 2,475s 1,945s 2,397s 1,925s 197,6J 147,8J 116,2J 115,0J 0 50 100 150 200 250 1,8 2,3 2,8 3,3 3,8 4,3 4,8 5,3 5,8 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 Energyconsumption[J] Runtime[s] Frequency [GHz] Tuning of memory bound region under 60W power capEXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - UCF = 2.2GHz EXP7 - DVFS & UCF - max UCF EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - UCF = 2.2GHz EXP7 - DVFS & UCF - max UCF 22.2% energy savings 22.2% time savings Tuning of memory bound workload under 60W power cap
  • 38. Observations for both workloads Observations for memory bound workload • Under the power budget lower that 80 W • DVFS should be set to minimum value • boost the performance of the uncore part by 22%. • Tuning of the uncore frequency • has low effect on the performance • but a major effect on energy consumption • between 21% (60 W and 80W) to 38% (100W) Observations for compute bound workload • To achieve the best possible performance • the uncore frequency must be reduces to minimum • 9.4 % performance gain up to and • 14.9 % lower energy consumption • If further energy savings are required – use DVFS and lower the core freq. • up to 21 % of energy savings • up to 21 % penalty in runtime • this effect is more visible for higher powercap levels
  • 39. Evaluation of complex HPC applications
  • 40. BEM4I Application Application runtime assemble_k [s] assemble_v [s] gmres_solve [s] print_vtu [s] main [s] default runtime 5.4 5.9 10.2 5.6 27.3 static tuning runtime 9.8 10.6 6.1 2.4 29.0 dynamic tuning runtime 7.0 7.2 7.9 2.1 24.3 static savings [%] -82.3% -79.1% 40.5% 56.8% -6.2% dynamic savings [%] -30.6% -20.9% 23.2% 62.9% 10.9% Hardware: dual socket system with 2x12 CPU cores – ”standard HW” in HPC centres Region description: • assemble_k and assemble_v – high utilization of vector units, extreme level of optimization – fully compute bound great utilization of both sockets and all cores • gmres_solve – uses DGEMV from MKL – memory bound, suffers on NUMA effect; this routine is more efficient on single socket • print_vtu – single threaded I/O and network bound region why stores data to a file on LUSTRE system ”static": { "FREQUENCY": ”25", <--------- 2.5 GHz "NUM_THREADS": ”12", <--------- 12 OpenMP threads "UNCORE_FREQUENCY": ”22” } <--------- 2.2 GHz "assemble_k": { "FREQUENCY": "23", "NUM_THREADS": "24", "UNCORE_FREQUENCY": ”16” }, "assemble_v": { "FREQUENCY": ”25", "NUM_THREADS": "24", "UNCORE_FREQUENCY": ”14” }, "gmres_solve": { "FREQUENCY": ”17", "NUM_THREADS": ”8", "UNCORE_FREQUENCY": ”22” }, "print_vtu": { "FREQUENCY": "25", "NUM_THREADS": ”6", "UNCORE_FREQUENCY": ”24” }
  • 41. Compute node energy assemble_k [J] assemble_v [J] gmres_solve [J] print_vtu [J] main [J] default energy 1476 1484 2733 1142 6872 static tuning energy 1962 2015 1366 420 5792 dynamic tuning energy 1467 1462 1259 293 4531 static savings [%] -33.8% -35.8% 50.0% 63.2% 15.7% dynamic savings [%] 0.6% 1.5% 53.9% 74.3% 34.1% BEM4I Application Large energy savings is combination of optimal HW settings and runtime savings due to mitigation of NUMA effect by optimal settings of OpenMP threading • Without savings in runtime caused by similar application will • Energy savings approx. 15 – 20% • Runtime savings approx. -15% ”static": { "FREQUENCY": ”25", <--------- 2.5 GHz "NUM_THREADS": ”12", <--------- 12 OpenMP threads "UNCORE_FREQUENCY": ”22” } <--------- 2.2 GHz "assemble_k": { "FREQUENCY": "23", "NUM_THREADS": "24", "UNCORE_FREQUENCY": ”16” }, "assemble_v": { "FREQUENCY": ”25", "NUM_THREADS": "24", "UNCORE_FREQUENCY": ”14” }, "gmres_solve": { "FREQUENCY": ”17", "NUM_THREADS": ”8", "UNCORE_FREQUENCY": ”22” }, "print_vtu": { "FREQUENCY": "25", "NUM_THREADS": ”6", "UNCORE_FREQUENCY": ”24” }
  • 42. OpenFOAM Application OpenFOAM Energy consumption Energy savings Default settings 14 231J - - Static tuning 12 264J 2 264J 15.9% Dynamic tuning Total savings • Computational fluid dynamics • Finite volume + multigrid solver
  • 43. OpenFOAM Application OpenFOAM Energy consumption Energy savings Default settings 14 231J - - Static tuning 12 264J 2 264J 15.9% Dynamic tuning 11 370J 597J 4.8% Total savings 2 861J 20.1% • Computational fluid dynamics • Finite volume + multigrid solver
  • 44. ESPRESO Application 33% of energy savings 22% of time savings and improved strong scalability • Structural mechanics code • Finite element + sparse FETI solver • Different tuning models for different # of nodes is needed for strong scalability – workload per node is varies • Includes dynamic switching overheads Energy savings analysis for the strong scalability test of the ESPRESO library when running the cube benchmark
  • 46. Application parameters tuning of the ESPRESO 50% - 66% against ”reasonable” settings 86% against the worst case 0 50 100 150 200 250 300 0 500 1000 1500 2000 2500 3000 3500 Energyconsumption[kJ] Configuration index the “reasonable” settings the optimal settings 9 parameters 3840 combinations • FETI METHOD 2x • PRECONDITIONER 5x • ITERATIVE SOLVER TYPE 2x • HFETI type 2x • NON-UNIFORM PARTS 6x • REDUNDANT LAGRANGE 2x • SCALING 2x • B0_TYPE 2x • ADAPTIVE PRECISION 2x
  • 47. Application parameters tuning Application parameter tuning parameters is very promising • application configuration parameters are given in the input file • each setting requires an individual start of the application • tool performs automatic search of application parameter space Application number of parameters tested / total number of options Energy savings compared to the worst settings Energy savings compared to default or reasonable settings ESPRESO 9 / 3840 86% 50 – 66% ELMER 1 / 40 97% 50 – 75% OpenFOAM 2 / 12 24% 8% INDEED 3 / 12 35% 25%