Energy efficiency opportunities with long vector architectures
Sep. 10, 2020•0 likes
0 likes
Be the first to like this
Show More
•57 views
views
Total views
0
On Slideshare
0
From embeds
0
Number of embeds
0
Download to read offline
Report
Science
Presentation given by Oscar Palomar (BSC) at the LEGaTO Final Event: Low-Energy Heterogeneous Computing Workshop on 4 September 2020
This event was collocated with FPL 2020
2
Contents of this talk
Intro to Vector architectures
Energy Efficiency & Vector Processors
– Inherently energy efficient
– Related past research
Designing a VPU for the European Processor Initiative
– Overview
– Opportunities to save energy
Vector Processor
Processors that operate on a vector of data
with a SINGLE instruction
EXAMPLE:
for (i=0;i<512;i++)
c[i]=a[i]+b[i];
x512
4
LD r0, @a
LD r1, @b
ADD r2, r0, r1
ST r2, @c
ADD //counter
CMP
BR
VLD V0, @a
VLD V1, @b
VADD V2, V1, V0
VST V2, @c
ADD
CMP
BR
Each instruction operates
on 256 elements
x2
SCALAR VECTOR
An Efficient Solution
Traditionally designed for high-performance
– The “Vectorian” age of supercomputing
But vector processors are inherently
energy-efficient for workloads that exhibit
Data-Level Parallelism (DLP)
– Instruction Fetch, Decode, Dispatch
– Memory and Register file access
– Vector data can be correlated → lower activity
Lemuet, C. et al. (2006). “The potential energy efficiency of
vector acceleration”. In SC 2006
10
11
Vector extensions for DBMS
From Timothy Hayes. «Novel Vector Architectures
for Data Management», PhD thesis, 2016
Also, «VSR Sort: A Novel Vectorised Sorting
Algorithm and Architecture Extensions for Future
Microprocessors», HPCA’15
Design space exploration of vector FU
12
Lowest EDP 1-,2-, and 4-lane configurations
With Clock Gating
Averaged benchmarks
Longer vectors provide more energy efficiency
vL=16
vL=128
13
Clock gating techniques for vectors (FP FMA unit)
Scalar Operand Clock-Gating (ScalarCG)
Implicit Scalar Operand Clock-Gating (ImplCG)
Vector Masking and Vector Multi-Lane-Aware Clock-Gating (MaskCG)
Input Data Aware Clock-Gating (InputCG)
Idle Unit Clock-Gating (IdleCG)
ISLPED’16: Ivan Ratković, Oscar Palomar, Milan Stanić, Osman Unsal, Adrian Cristal, and Mateo Valero, “A Fully
Parameterizable Low Power Design of Vector Fused Multiply-Add Using Active Clock-Gating Techniques”
14
Scalar Operand Clock-Gating (ScalarCG)
Fixed operands during the instruction
Gating information derived from OPCODE
Impl: implicit operand
EPI VPU
European Processor Initiative
https://www.european-processor-initiative.eu/
This project has received funding from the European Union’s Horizon 2020 research and
innovation programme under grant agreement No 826647
VECTOR LANE MICROARCHITECTURE (SIMPLIFIED VIEW)
To/from
Load/Store
Unit
Buffer A:
Buffer B:
Buffer C:
Mem. Store Buffer:
Mem. Load Buffer
WriteBack Buffer
Load buffer: holding data
coming from memory
Load buffer is replicated and
managed by a dedicated load
unit to handle out-of-order
element transfers
WriteBack Buffer: holding
operation results
Buffer A, B, C: operand buffers,
Buffers A,B,C are doubled
(shadowed) to allow single-cycle
full refill while the shadow buffer
is being consumed.
Store Buffer: holding data to be
sent to memory
…Register
file
banks
18
Energy efficiency of current implementation?
First implementation, focus on functionality, not in efficiency
●
Which is great news, we have a lot of things to do for a second version :)
What is there: helps performance AND energy
●
Tail zeroing optimisation
●
Parallel reductions with reduced inter-lane communication (integer, unordered)
Against energy efficiency (some increase performance, others simplify hardware)
●
Speculation
●
Renaming
●
Retries
●
Buffering
●
Many small implementation details
19
Issues and Future work
ISA (Risc-V V-extension) issues
●
Context: Embedded vs. HPC // short vs. long
●
Tail handling (requires copying old values when operating with short vector length)
●
Mask layout (requires moving mask registers across lanes)
●
Lack of negated mask (requires additional instructions)
●
Mandatory LMUL support (complicates hardware)
Leverage vector ISA for clock gating as presented above
General optimisations
Evaluate design decisions (ring, load handling)
●
Inter-lane communication: ring vs crossbar? Pattern dependent