More Related Content

More from LEGATO project(20)

Recently uploaded(20)


Energy efficiency opportunities with long vector architectures

  1. 04.09.2020 Oscar Palomar Barcelona Supercomputing Center Energy-efficient vector architectures LEGaTO Workshop
  2. 2 Contents of this talk Intro to Vector architectures Energy Efficiency & Vector Processors – Inherently energy efficient – Related past research Designing a VPU for the European Processor Initiative – Overview – Opportunities to save energy
  4. Vector Processor Processors that operate on a vector of data with a SINGLE instruction EXAMPLE: for (i=0;i<512;i++) c[i]=a[i]+b[i]; x512 4 LD r0, @a LD r1, @b ADD r2, r0, r1 ST r2, @c ADD //counter CMP BR VLD V0, @a VLD V1, @b VADD V2, V1, V0 VST V2, @c ADD CMP BR Each instruction operates on 256 elements x2 SCALAR VECTOR
  5. VADD V2<-V0,V1 Vector register size – Maximum Vector Length =8 5
  6. VADD V2<-V0,V1 6
  7. VADD V2<-V0,V1 7
  8. Vector Multi Lane Partition vector registers,lock-stepped multiple ALUs 8
  10. An Efficient Solution Traditionally designed for high-performance – The “Vectorian” age of supercomputing But vector processors are inherently energy-efficient for workloads that exhibit Data-Level Parallelism (DLP) – Instruction Fetch, Decode, Dispatch – Memory and Register file access – Vector data can be correlated → lower activity Lemuet, C. et al. (2006). “The potential energy efficiency of vector acceleration”. In SC 2006 10
  11. 11 Vector extensions for DBMS From Timothy Hayes. «Novel Vector Architectures for Data Management», PhD thesis, 2016 Also, «VSR Sort: A Novel Vectorised Sorting Algorithm and Architecture Extensions for Future Microprocessors», HPCA’15
  12. Design space exploration of vector FU 12 Lowest EDP 1-,2-, and 4-lane configurations With Clock Gating Averaged benchmarks Longer vectors provide more energy efficiency vL=16 vL=128
  13. 13 Clock gating techniques for vectors (FP FMA unit) Scalar Operand Clock-Gating (ScalarCG) Implicit Scalar Operand Clock-Gating (ImplCG) Vector Masking and Vector Multi-Lane-Aware Clock-Gating (MaskCG) Input Data Aware Clock-Gating (InputCG) Idle Unit Clock-Gating (IdleCG) ISLPED’16: Ivan Ratković, Oscar Palomar, Milan Stanić, Osman Unsal, Adrian Cristal, and Mateo Valero, “A Fully Parameterizable Low Power Design of Vector Fused Multiply-Add Using Active Clock-Gating Techniques”
  14. 14 Scalar Operand Clock-Gating (ScalarCG) Fixed operands during the instruction Gating information derived from OPCODE Impl: implicit operand
  15. EPI VPU European Processor Initiative This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 826647
  16. ACCELERATOR TILE GPP NoC Bridge NOC CrossPoint L2$ GPP NoC Bridge L2$ EPAC ARCHITECTURE VIEW  Up to 8 vector processors  VPU (x8 vector Lanes) acts as tightly coupled acceleration unit to the scalar core in the vector processor  RISC-V vector extension compliant Copyright © European Processor Initiative 2019. RISC-V Scientific Day / Paris / 03-10-2019 Vecto r lane Vecto r lane VRF L1$ LSU Scalar core Vector Processor VLSU Vecto r lane Vecto r laneVector lanes V RF LANE INTERCONNECT SPM N T X N T X S T X Specialized Units DMA L1 $ LS U Scalar core VaRiable Precision Unit LS U Variable precision co-proc NOC CrossPoint NOC CrossPoint NOC CrossPoint NOC CrossPoint
  17. VECTOR LANE MICROARCHITECTURE (SIMPLIFIED VIEW) To/from Load/Store Unit Buffer A: Buffer B: Buffer C: Mem. Store Buffer: Mem. Load Buffer WriteBack Buffer Load buffer: holding data coming from memory Load buffer is replicated and managed by a dedicated load unit to handle out-of-order element transfers WriteBack Buffer: holding operation results Buffer A, B, C: operand buffers, Buffers A,B,C are doubled (shadowed) to allow single-cycle full refill while the shadow buffer is being consumed. Store Buffer: holding data to be sent to memory …Register file banks
  18. 18 Energy efficiency of current implementation? First implementation, focus on functionality, not in efficiency ● Which is great news, we have a lot of things to do for a second version :) What is there: helps performance AND energy ● Tail zeroing optimisation ● Parallel reductions with reduced inter-lane communication (integer, unordered) Against energy efficiency (some increase performance, others simplify hardware) ● Speculation ● Renaming ● Retries ● Buffering ● Many small implementation details
  19. 19 Issues and Future work ISA (Risc-V V-extension) issues ● Context: Embedded vs. HPC // short vs. long ● Tail handling (requires copying old values when operating with short vector length) ● Mask layout (requires moving mask registers across lanes) ● Lack of negated mask (requires additional instructions) ● Mandatory LMUL support (complicates hardware) Leverage vector ISA for clock gating as presented above General optimisations Evaluate design decisions (ring, load handling) ● Inter-lane communication: ring vs crossbar? Pattern dependent
  20. Thank you! For further information please contact 20