Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
www.bsc.es
04.09.2020
Oscar Palomar
Barcelona Supercomputing Center
oscar.palomar@bsc.es
Energy-efficient vector architect...
2
Contents of this talk
Intro to Vector architectures
Energy Efficiency & Vector Processors
– Inherently energy efficient
...
VECTOR ARCHITECTURES
Vector Processor
Processors that operate on a vector of data
with a SINGLE instruction
EXAMPLE:
for (i=0;i<512;i++)
c[i]=a...
VADD V2<-V0,V1
Vector register size – Maximum Vector Length =8
5
VADD V2<-V0,V1
6
VADD V2<-V0,V1
7
Vector Multi Lane
Partition vector registers,lock-stepped multiple ALUs
8
ENERGY EFFICIENCY &
VECTOR PROCESSORS
An Efficient Solution
Traditionally designed for high-performance
– The “Vectorian” age of supercomputing
But vector proce...
11
Vector extensions for DBMS
From Timothy Hayes. «Novel Vector Architectures
for Data Management», PhD thesis, 2016
Also,...
Design space exploration of vector FU
12
Lowest EDP 1-,2-, and 4-lane configurations
With Clock Gating
Averaged benchmarks...
13
Clock gating techniques for vectors (FP FMA unit)
Scalar Operand Clock-Gating (ScalarCG)
Implicit Scalar Operand Clock-...
14
Scalar Operand Clock-Gating (ScalarCG)
Fixed operands during the instruction
Gating information derived from OPCODE
Imp...
EPI VPU
European Processor Initiative
https://www.european-processor-initiative.eu/
This project has received funding from...
ACCELERATOR TILE
GPP NoC
Bridge
NOC
CrossPoint
L2$
GPP NoC
Bridge
L2$
EPAC ARCHITECTURE VIEW
 Up to 8 vector
processors
...
VECTOR LANE MICROARCHITECTURE (SIMPLIFIED VIEW)
To/from
Load/Store
Unit
Buffer A:
Buffer B:
Buffer C:
Mem. Store Buffer:
M...
18
Energy efficiency of current implementation?
First implementation, focus on functionality, not in efficiency
●
Which is...
19
Issues and Future work
ISA (Risc-V V-extension) issues
●
Context: Embedded vs. HPC // short vs. long
●
Tail handling (r...
www.bsc.es
Thank you!
For further information please contact
oscar.palomar@bsc.es
20
Upcoming SlideShare
Loading in …5
×

of

Energy efficiency opportunities with long vector architectures Slide 1 Energy efficiency opportunities with long vector architectures Slide 2 Energy efficiency opportunities with long vector architectures Slide 3 Energy efficiency opportunities with long vector architectures Slide 4 Energy efficiency opportunities with long vector architectures Slide 5 Energy efficiency opportunities with long vector architectures Slide 6 Energy efficiency opportunities with long vector architectures Slide 7 Energy efficiency opportunities with long vector architectures Slide 8 Energy efficiency opportunities with long vector architectures Slide 9 Energy efficiency opportunities with long vector architectures Slide 10 Energy efficiency opportunities with long vector architectures Slide 11 Energy efficiency opportunities with long vector architectures Slide 12 Energy efficiency opportunities with long vector architectures Slide 13 Energy efficiency opportunities with long vector architectures Slide 14 Energy efficiency opportunities with long vector architectures Slide 15 Energy efficiency opportunities with long vector architectures Slide 16 Energy efficiency opportunities with long vector architectures Slide 17 Energy efficiency opportunities with long vector architectures Slide 18 Energy efficiency opportunities with long vector architectures Slide 19 Energy efficiency opportunities with long vector architectures Slide 20
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Download to read offline

Energy efficiency opportunities with long vector architectures

Download to read offline

Presentation given by Oscar Palomar (BSC) at the LEGaTO Final Event: Low-Energy Heterogeneous Computing Workshop on 4 September 2020
This event was collocated with FPL 2020

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Energy efficiency opportunities with long vector architectures

  1. 1. www.bsc.es 04.09.2020 Oscar Palomar Barcelona Supercomputing Center oscar.palomar@bsc.es Energy-efficient vector architectures LEGaTO Workshop
  2. 2. 2 Contents of this talk Intro to Vector architectures Energy Efficiency & Vector Processors – Inherently energy efficient – Related past research Designing a VPU for the European Processor Initiative – Overview – Opportunities to save energy
  3. 3. VECTOR ARCHITECTURES
  4. 4. Vector Processor Processors that operate on a vector of data with a SINGLE instruction EXAMPLE: for (i=0;i<512;i++) c[i]=a[i]+b[i]; x512 4 LD r0, @a LD r1, @b ADD r2, r0, r1 ST r2, @c ADD //counter CMP BR VLD V0, @a VLD V1, @b VADD V2, V1, V0 VST V2, @c ADD CMP BR Each instruction operates on 256 elements x2 SCALAR VECTOR
  5. 5. VADD V2<-V0,V1 Vector register size – Maximum Vector Length =8 5
  6. 6. VADD V2<-V0,V1 6
  7. 7. VADD V2<-V0,V1 7
  8. 8. Vector Multi Lane Partition vector registers,lock-stepped multiple ALUs 8
  9. 9. ENERGY EFFICIENCY & VECTOR PROCESSORS
  10. 10. An Efficient Solution Traditionally designed for high-performance – The “Vectorian” age of supercomputing But vector processors are inherently energy-efficient for workloads that exhibit Data-Level Parallelism (DLP) – Instruction Fetch, Decode, Dispatch – Memory and Register file access – Vector data can be correlated → lower activity Lemuet, C. et al. (2006). “The potential energy efficiency of vector acceleration”. In SC 2006 10
  11. 11. 11 Vector extensions for DBMS From Timothy Hayes. «Novel Vector Architectures for Data Management», PhD thesis, 2016 Also, «VSR Sort: A Novel Vectorised Sorting Algorithm and Architecture Extensions for Future Microprocessors», HPCA’15
  12. 12. Design space exploration of vector FU 12 Lowest EDP 1-,2-, and 4-lane configurations With Clock Gating Averaged benchmarks Longer vectors provide more energy efficiency vL=16 vL=128
  13. 13. 13 Clock gating techniques for vectors (FP FMA unit) Scalar Operand Clock-Gating (ScalarCG) Implicit Scalar Operand Clock-Gating (ImplCG) Vector Masking and Vector Multi-Lane-Aware Clock-Gating (MaskCG) Input Data Aware Clock-Gating (InputCG) Idle Unit Clock-Gating (IdleCG) ISLPED’16: Ivan Ratković, Oscar Palomar, Milan Stanić, Osman Unsal, Adrian Cristal, and Mateo Valero, “A Fully Parameterizable Low Power Design of Vector Fused Multiply-Add Using Active Clock-Gating Techniques”
  14. 14. 14 Scalar Operand Clock-Gating (ScalarCG) Fixed operands during the instruction Gating information derived from OPCODE Impl: implicit operand
  15. 15. EPI VPU European Processor Initiative https://www.european-processor-initiative.eu/ This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 826647
  16. 16. ACCELERATOR TILE GPP NoC Bridge NOC CrossPoint L2$ GPP NoC Bridge L2$ EPAC ARCHITECTURE VIEW  Up to 8 vector processors  VPU (x8 vector Lanes) acts as tightly coupled acceleration unit to the scalar core in the vector processor  RISC-V vector extension compliant Copyright © European Processor Initiative 2019. RISC-V Scientific Day / Paris / 03-10-2019 Vecto r lane Vecto r lane VRF L1$ LSU Scalar core Vector Processor VLSU Vecto r lane Vecto r laneVector lanes V RF LANE INTERCONNECT SPM N T X N T X S T X Specialized Units DMA L1 $ LS U Scalar core VaRiable Precision Unit LS U Variable precision co-proc NOC CrossPoint NOC CrossPoint NOC CrossPoint NOC CrossPoint
  17. 17. VECTOR LANE MICROARCHITECTURE (SIMPLIFIED VIEW) To/from Load/Store Unit Buffer A: Buffer B: Buffer C: Mem. Store Buffer: Mem. Load Buffer WriteBack Buffer Load buffer: holding data coming from memory Load buffer is replicated and managed by a dedicated load unit to handle out-of-order element transfers WriteBack Buffer: holding operation results Buffer A, B, C: operand buffers, Buffers A,B,C are doubled (shadowed) to allow single-cycle full refill while the shadow buffer is being consumed. Store Buffer: holding data to be sent to memory …Register file banks
  18. 18. 18 Energy efficiency of current implementation? First implementation, focus on functionality, not in efficiency ● Which is great news, we have a lot of things to do for a second version :) What is there: helps performance AND energy ● Tail zeroing optimisation ● Parallel reductions with reduced inter-lane communication (integer, unordered) Against energy efficiency (some increase performance, others simplify hardware) ● Speculation ● Renaming ● Retries ● Buffering ● Many small implementation details
  19. 19. 19 Issues and Future work ISA (Risc-V V-extension) issues ● Context: Embedded vs. HPC // short vs. long ● Tail handling (requires copying old values when operating with short vector length) ● Mask layout (requires moving mask registers across lanes) ● Lack of negated mask (requires additional instructions) ● Mandatory LMUL support (complicates hardware) Leverage vector ISA for clock gating as presented above General optimisations Evaluate design decisions (ring, load handling) ● Inter-lane communication: ring vs crossbar? Pattern dependent
  20. 20. www.bsc.es Thank you! For further information please contact oscar.palomar@bsc.es 20

Presentation given by Oscar Palomar (BSC) at the LEGaTO Final Event: Low-Energy Heterogeneous Computing Workshop on 4 September 2020 This event was collocated with FPL 2020

Views

Total views

50

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

3

Shares

0

Comments

0

Likes

0

×