Low-Energy Heterogeneous Computing Workshop – @FPL 2020 – September 4, 2020
Energy Efficiency in Multicore CPUs:
Harnessing Voltage Margins
Dimitris Gizopoulos
University of Athens
@LEGaTO/FPL – September 2020
U Athens
CPUs: Power, Energy, Performance
2
• Powerdynamic = ½ x Capacitance x frequency x Voltage2
• Energydynamic = Powerdynamic x Time
• Nominal Voltage & Frequency
= Worst Case (Workload, Conditions, Variability, Aging)
• i.e. Power, Energy, Performance Costs
@LEGaTO/FPL – September 2020
U Athens
In a Nutshell – Beyond Margins
3
+
-
Margins ?
(how low can you
safely/unsafely go ?)
CPU is under-volted (supply voltage under-scaling)
(or CPU over-clocked, or DRAM under-refreshed)
Behavior ?
(what happens in
the danger zone ?)
Variability ?
(among cores/
chips/workloads ?)
Faster ?
(less time to
characterize ?)
Model ?
(simulation
models ?)
Predict ?
(correlate to
run time stats ?)
Monitor/Expose ?
(log/report
to sw ?)
??
1
2
3
4
5
6
7
This work is on ARMv8 CPUs and
their voltage scaling
@LEGaTO/FPL – September 2020
U Athens
Margins Characterization Landscape
o First study on ARMv8-based micro-server CPU chips
ISA Processor/Chip Technology Reference
POWER 7 / 7+ IBM Power 750, 780 45 / 32 nm IBM (MICRO’11), UT Austin (MICRO’15)
IA-64 Intel Itanium 9560 32 nm Ohio State U (ISCA ‘13, MICRO ’14)
x86-64 Intel i7-3970X, i5-4200U 32 / 22 nm University of Athens (IOLTS ’17)
Nvidia Fermi /
Kepler
GTX 480, 580, 680, 780 40 / 28 nm IBM, UT Austin (MICRO ’15)
Xilinx FPGAs
Virtex-7, Zynq7000,
Kintex-7
28 nm BSC/UPC (MICRO ’18)
ARMv8 (8 cores) APM X-Gene 2 28 nm U Athens (MICRO’17, ISPASS’18)
ARMv8 (32 cores) APM (Ampere) X-Gene 3 16 nm U Athens (HPCA ’19)
4
@LEGaTO/FPL – September 2020
U Athens
Ampere’s (Applied Micro’s) X-Gene 2 & X-Gene 3
Parameter Configuration
ISA ARMv8
Pipeline 64-bit OoO (4-issue)
CPU 32 cores
Core clock 3 GHz
L1I $ 32KB per core (Parity)
L1D $ 32KB per core (Parity)
L2 $ 256KB per PMD (SECDED)
L3 $ 32MB (SECDED)
Technology 16 nm
Voltage Domains PMD & PCP/SoC
Freq. Domains per PMD (pair of cores)
5
Parameter Configuration
ISA ARMv8
Pipeline 64-bit OoO (4-issue)
CPU 8 cores
Core clock 2.4 GHz
L1I $ 32KB per core (Parity)
L1D $ 32KB per core (Parity)
L2 $ 256KB per PMD (SECDED)
L3 $ 8MB (SECDED)
Technology 28 nm
Voltage Domains PMD & PCP/SoC
Freq. Domains per PMD (pair of cores)
PMD = Processor module (2 cores), PCP=Processor complex (all cores)
X-Gene 3X-Gene 2
@LEGaTO/FPL – September 2020
U Athens
System-Level Voltage Scaling Characterization
(MICRO 2017)
6
• Running many different workloads at different voltage levels
@LEGaTO/FPL – September 2020
U Athens
Automated Framework: Example Output
Benchmark @ 2.4 GHz
mV 0 1 2 3 4 5 6 7
980 + + + + + + + +
: + + + + + + + +
915 + + + + + + + +
910 SDC SDC + + + + + +
905 SDC 1L1CE-SDC SDC SDC + + + +
900 SDC SDC 1L1CE-SDC 2L1CE-SDC + + + +
895 1L1CE-SDC SDC 7L1CE-SDC 8L1CE-SDC + + SDC SDC
890 5L1CE-X SDC 14L1CE-3L2CE-SDC X + + SDC 1L1CE-SDC
885 - 3L1CE-1L2UE-X X - SDC SDC 3L1CE-SDC 3L1CE-1L2CE-SDC
880 - - - - SDC SDC 13L1CE-4L2CE-SDC $
875 - - - - 9L1CE-SDC 3L1CE-SDC 5L1CE-2L2CE-1L2UE-$ 1L1CE-$
870 - - - - 11L1CE-2L2CE-1L2UE-$ $ 1L1CE-1L2UE-$ X
865 - - - - 1L1CE-1L2UE-$ X X -
860 - - - - X - - -
7
@LEGaTO/FPL – September 2020
U Athens
Core-to-Core & Workload-to-Workload Variation
8
850
860
870
880
890
900
910
920
930
0 1 2 3 4 5 6 7
TTT
mV
cactusADM
Crash Unsaf
850
860
870
880
890
900
910
920
930
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
TTT TFF
mV
soplex
Crash Unsafe Safe Average Vmin Average
850
860
870
880
890
900
910
920
930
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
TTT TFF TSS
mV
bwaves
Crash Unsafe Safe Average Vmin Average Crash
850
860
870
880
890
900
910
920
930
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
TTT TFF TSS
mV
bwaves
Crash Unsafe Safe Average Vmin Average Crash
25mV
20mV
25mV
max Power
Savings
18.4%
min Power
Savings
14.7%
max Power
Savings
22.1%
min Power
Savings
17.5%
max Power
Savings
21.2%
min Power
Savings
16.6%
also – variability across different chips, TTT/TFF/TSS (not shown)
@LEGaTO/FPL – September 2020
U Athens
Faster Characterization using Micro-viruses
(ISPASS 2018)
Vnominal
Vmin
Vcrash
Safe
(nothing abnormal)
Energy savings
Unsafe
(errors/SDCs –
but no crashes)
Uncertain/potential
energy savings
Crash
(crashes happen) Keep out !!
9
How much
time is
needed…?
37.6
20.7
1.5 1.9
0
9
18
27
36
Single-Thread Multi-Thread
Days
SPEC CPU2006 Micro-Viruses
@LEGaTO/FPL – September 2020
U Athens
Multicore/Multithread CPUs Voltage Margins
(HPCA 2019)
o Single-Core and Multi-Core executions
o Different frequencies
 X-Gene 2  2.4 GHz, 1.2 GHz, and 0.9 GHz
 X-Gene 3  3.0 GHz and 1.5GHz
o Different Thread Scaling Options
 Max Threads (32T in X-Gene 3 - 8T in X-Gene 2)
 Half Threads (16T in X-Gene 3 - 4T in X-Gene 2)
 Quarter Threads (8T in X-Gene 3 - 2T in X-Gene 2)
o Thread Allocation Policies
(threads<max)
10
Spreaded
Clustered
0 1 2 3
0 1 2 3
@LEGaTO/FPL – September 2020
U Athens
Impact of Frequency and Core Allocation on Vmin
Droop
Magnitude
Utilized
PMDs
Thread Scaling and
Core Allocation
Vmin @
3.0GHz
Vmin @
1.5GHz
[25mV, 35mV) 1, 2 PMDs 1T, 2T, 4T(clustered) 780 mV 770 mV
[35mV, 45mV) 4 PMDs
8T(clustered),
4T(spreaded)
800 mV 780 mV
[45mV, 55mV) 8 PMDs
16T(clustered),
8T(spreaded)
810 mV 790 mV
[55mV, 65mV) 16 PMDs 32T, 16T(spreaded) 830 mV 820 mV
11
All X-Gene 3 execution combinations Vmins
@LEGaTO/FPL – September 2020
U Athens
Impact of Workload on Freq. and Core Allocation
12
-9.6%
-7.5%
-7.5%
-6.7%
-5.9%
-5.7%
-4.5%
-4.4%
-4.0%
-3.7%
-3.6%
-3.3%
-3.2%
-2.3%
-1.7%
-1.4%
-1.3%
0.0%
0.7%
1.4%
4.4%
6.6%
10.1%
13.0%
14.2%
-15%
-12%
-9%
-6%
-3%
0%
3%
6%
9%
12%
15%
18%
0
200
400
600
800
1000
1200
1400
1600 bodytrack
IS
EP
hmmer
LU
cactusADM
namd
zeusmp
h264ref
swaptions
MG
gromacs
bwaves
gcc
blackscholes
dealII
bzip2
fluidanimate
leslie3d
mcf
canneal
milc
CG
FT
dedup
Difference(%)
Energy(J)
Benchmarks
4T (Clustered)
4T (Spreaded)
Difference
CPU-Intensive Mem-Intensive
Energy calculations
@LEGaTO/FPL – September 2020
U Athens
Full Workload Execution Results
13
Baseline Safe Vmin Placement Optimal
Time (s) 3707 3707 3829 3829
Avg. Power
(W)
6.90 6.10 5.46 5.00
Energy (J) 25578.30 22612.07 20906.34 19145.00
Energy
Savings
-- 11.6% 18.3% 25.2%
ED2P
(workload)
351 x 109 311 x 109 307 x 109 281 x 109
ED2P
Savings
-- 11.6% 12.8% 20.1%
Baseline Safe Vmin Placement Optimal
Time (s) 3748 3748 3846 3846
Avg. Power
(W)
36.49 32.51 30.78 27.63
Energy (J) 136773.2 121847.48 118379.88 106283.5
Energy
Savings
-- 10.9% 13.4% 22.3%
ED2P
(workload)
19 x 1011 17 x 1011 17 x 1011 15 x 1011
ED2P
Savings
-- 10.9% 8.9% 18.2%
X-Gene 2 X-Gene 3
@LEGaTO/FPL – September 2020
U Athens
Thank you!
14
• Consolidated review paper: CPUs, GPUs, FPGAs voltage margins,
U Athens, BSC, Harvard, SJTU

Energy Efficiency in Multicore CPUs: Harnessing Voltage Margins

  • 1.
    Low-Energy Heterogeneous ComputingWorkshop – @FPL 2020 – September 4, 2020 Energy Efficiency in Multicore CPUs: Harnessing Voltage Margins Dimitris Gizopoulos University of Athens
  • 2.
    @LEGaTO/FPL – September2020 U Athens CPUs: Power, Energy, Performance 2 • Powerdynamic = ½ x Capacitance x frequency x Voltage2 • Energydynamic = Powerdynamic x Time • Nominal Voltage & Frequency = Worst Case (Workload, Conditions, Variability, Aging) • i.e. Power, Energy, Performance Costs
  • 3.
    @LEGaTO/FPL – September2020 U Athens In a Nutshell – Beyond Margins 3 + - Margins ? (how low can you safely/unsafely go ?) CPU is under-volted (supply voltage under-scaling) (or CPU over-clocked, or DRAM under-refreshed) Behavior ? (what happens in the danger zone ?) Variability ? (among cores/ chips/workloads ?) Faster ? (less time to characterize ?) Model ? (simulation models ?) Predict ? (correlate to run time stats ?) Monitor/Expose ? (log/report to sw ?) ?? 1 2 3 4 5 6 7 This work is on ARMv8 CPUs and their voltage scaling
  • 4.
    @LEGaTO/FPL – September2020 U Athens Margins Characterization Landscape o First study on ARMv8-based micro-server CPU chips ISA Processor/Chip Technology Reference POWER 7 / 7+ IBM Power 750, 780 45 / 32 nm IBM (MICRO’11), UT Austin (MICRO’15) IA-64 Intel Itanium 9560 32 nm Ohio State U (ISCA ‘13, MICRO ’14) x86-64 Intel i7-3970X, i5-4200U 32 / 22 nm University of Athens (IOLTS ’17) Nvidia Fermi / Kepler GTX 480, 580, 680, 780 40 / 28 nm IBM, UT Austin (MICRO ’15) Xilinx FPGAs Virtex-7, Zynq7000, Kintex-7 28 nm BSC/UPC (MICRO ’18) ARMv8 (8 cores) APM X-Gene 2 28 nm U Athens (MICRO’17, ISPASS’18) ARMv8 (32 cores) APM (Ampere) X-Gene 3 16 nm U Athens (HPCA ’19) 4
  • 5.
    @LEGaTO/FPL – September2020 U Athens Ampere’s (Applied Micro’s) X-Gene 2 & X-Gene 3 Parameter Configuration ISA ARMv8 Pipeline 64-bit OoO (4-issue) CPU 32 cores Core clock 3 GHz L1I $ 32KB per core (Parity) L1D $ 32KB per core (Parity) L2 $ 256KB per PMD (SECDED) L3 $ 32MB (SECDED) Technology 16 nm Voltage Domains PMD & PCP/SoC Freq. Domains per PMD (pair of cores) 5 Parameter Configuration ISA ARMv8 Pipeline 64-bit OoO (4-issue) CPU 8 cores Core clock 2.4 GHz L1I $ 32KB per core (Parity) L1D $ 32KB per core (Parity) L2 $ 256KB per PMD (SECDED) L3 $ 8MB (SECDED) Technology 28 nm Voltage Domains PMD & PCP/SoC Freq. Domains per PMD (pair of cores) PMD = Processor module (2 cores), PCP=Processor complex (all cores) X-Gene 3X-Gene 2
  • 6.
    @LEGaTO/FPL – September2020 U Athens System-Level Voltage Scaling Characterization (MICRO 2017) 6 • Running many different workloads at different voltage levels
  • 7.
    @LEGaTO/FPL – September2020 U Athens Automated Framework: Example Output Benchmark @ 2.4 GHz mV 0 1 2 3 4 5 6 7 980 + + + + + + + + : + + + + + + + + 915 + + + + + + + + 910 SDC SDC + + + + + + 905 SDC 1L1CE-SDC SDC SDC + + + + 900 SDC SDC 1L1CE-SDC 2L1CE-SDC + + + + 895 1L1CE-SDC SDC 7L1CE-SDC 8L1CE-SDC + + SDC SDC 890 5L1CE-X SDC 14L1CE-3L2CE-SDC X + + SDC 1L1CE-SDC 885 - 3L1CE-1L2UE-X X - SDC SDC 3L1CE-SDC 3L1CE-1L2CE-SDC 880 - - - - SDC SDC 13L1CE-4L2CE-SDC $ 875 - - - - 9L1CE-SDC 3L1CE-SDC 5L1CE-2L2CE-1L2UE-$ 1L1CE-$ 870 - - - - 11L1CE-2L2CE-1L2UE-$ $ 1L1CE-1L2UE-$ X 865 - - - - 1L1CE-1L2UE-$ X X - 860 - - - - X - - - 7
  • 8.
    @LEGaTO/FPL – September2020 U Athens Core-to-Core & Workload-to-Workload Variation 8 850 860 870 880 890 900 910 920 930 0 1 2 3 4 5 6 7 TTT mV cactusADM Crash Unsaf 850 860 870 880 890 900 910 920 930 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 TTT TFF mV soplex Crash Unsafe Safe Average Vmin Average 850 860 870 880 890 900 910 920 930 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 TTT TFF TSS mV bwaves Crash Unsafe Safe Average Vmin Average Crash 850 860 870 880 890 900 910 920 930 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 TTT TFF TSS mV bwaves Crash Unsafe Safe Average Vmin Average Crash 25mV 20mV 25mV max Power Savings 18.4% min Power Savings 14.7% max Power Savings 22.1% min Power Savings 17.5% max Power Savings 21.2% min Power Savings 16.6% also – variability across different chips, TTT/TFF/TSS (not shown)
  • 9.
    @LEGaTO/FPL – September2020 U Athens Faster Characterization using Micro-viruses (ISPASS 2018) Vnominal Vmin Vcrash Safe (nothing abnormal) Energy savings Unsafe (errors/SDCs – but no crashes) Uncertain/potential energy savings Crash (crashes happen) Keep out !! 9 How much time is needed…? 37.6 20.7 1.5 1.9 0 9 18 27 36 Single-Thread Multi-Thread Days SPEC CPU2006 Micro-Viruses
  • 10.
    @LEGaTO/FPL – September2020 U Athens Multicore/Multithread CPUs Voltage Margins (HPCA 2019) o Single-Core and Multi-Core executions o Different frequencies  X-Gene 2  2.4 GHz, 1.2 GHz, and 0.9 GHz  X-Gene 3  3.0 GHz and 1.5GHz o Different Thread Scaling Options  Max Threads (32T in X-Gene 3 - 8T in X-Gene 2)  Half Threads (16T in X-Gene 3 - 4T in X-Gene 2)  Quarter Threads (8T in X-Gene 3 - 2T in X-Gene 2) o Thread Allocation Policies (threads<max) 10 Spreaded Clustered 0 1 2 3 0 1 2 3
  • 11.
    @LEGaTO/FPL – September2020 U Athens Impact of Frequency and Core Allocation on Vmin Droop Magnitude Utilized PMDs Thread Scaling and Core Allocation Vmin @ 3.0GHz Vmin @ 1.5GHz [25mV, 35mV) 1, 2 PMDs 1T, 2T, 4T(clustered) 780 mV 770 mV [35mV, 45mV) 4 PMDs 8T(clustered), 4T(spreaded) 800 mV 780 mV [45mV, 55mV) 8 PMDs 16T(clustered), 8T(spreaded) 810 mV 790 mV [55mV, 65mV) 16 PMDs 32T, 16T(spreaded) 830 mV 820 mV 11 All X-Gene 3 execution combinations Vmins
  • 12.
    @LEGaTO/FPL – September2020 U Athens Impact of Workload on Freq. and Core Allocation 12 -9.6% -7.5% -7.5% -6.7% -5.9% -5.7% -4.5% -4.4% -4.0% -3.7% -3.6% -3.3% -3.2% -2.3% -1.7% -1.4% -1.3% 0.0% 0.7% 1.4% 4.4% 6.6% 10.1% 13.0% 14.2% -15% -12% -9% -6% -3% 0% 3% 6% 9% 12% 15% 18% 0 200 400 600 800 1000 1200 1400 1600 bodytrack IS EP hmmer LU cactusADM namd zeusmp h264ref swaptions MG gromacs bwaves gcc blackscholes dealII bzip2 fluidanimate leslie3d mcf canneal milc CG FT dedup Difference(%) Energy(J) Benchmarks 4T (Clustered) 4T (Spreaded) Difference CPU-Intensive Mem-Intensive Energy calculations
  • 13.
    @LEGaTO/FPL – September2020 U Athens Full Workload Execution Results 13 Baseline Safe Vmin Placement Optimal Time (s) 3707 3707 3829 3829 Avg. Power (W) 6.90 6.10 5.46 5.00 Energy (J) 25578.30 22612.07 20906.34 19145.00 Energy Savings -- 11.6% 18.3% 25.2% ED2P (workload) 351 x 109 311 x 109 307 x 109 281 x 109 ED2P Savings -- 11.6% 12.8% 20.1% Baseline Safe Vmin Placement Optimal Time (s) 3748 3748 3846 3846 Avg. Power (W) 36.49 32.51 30.78 27.63 Energy (J) 136773.2 121847.48 118379.88 106283.5 Energy Savings -- 10.9% 13.4% 22.3% ED2P (workload) 19 x 1011 17 x 1011 17 x 1011 15 x 1011 ED2P Savings -- 10.9% 8.9% 18.2% X-Gene 2 X-Gene 3
  • 14.
    @LEGaTO/FPL – September2020 U Athens Thank you! 14 • Consolidated review paper: CPUs, GPUs, FPGAs voltage margins, U Athens, BSC, Harvard, SJTU