Energy Efficient Coarse-Grain
Reconfigurable Array for Accelerating
     Digital Signal Processing


  Pasquale Corsonello...
Outline

  Motivation

  The proposed Coarse Grain Reconfigurable
  Array (CGRA)
    Architectural overview
    Computatio...
The Challenge
Nowadays, Digital Signal Processing (DSP) is extensively used for
   several applications

    Multimedia
  ...
Executing DSP on various architectures
                                               General Purpose
        Full Custom ...
Reconfigurable Computing

  FPGAs are very flexible, …
    Gate-level functions
    General routing
   … ,but the flexibil...
Architectural Overview
                                                        Addr.
                                     ...
The Reconfigurable Cell
  I/O interface similar to a
                                AddrA/B_ext Data_InA/B_ext
  conventi...
Functionality of the RC in the executing
state

                                                 RAM
           RAM       ...
The Processing Element
                                       B-Register                     A-Register
                  ...
The Control Unit

 Instructions define the                             Configuaration Data
 execution of vector/block
 ope...
The Address Generator
                                                                   subset
                     step
...
The Interconnection Topology
                                  N-bit




                                           NW    ...
Applications Mapping: Block-level pipelining

    RAM(i-1)
                         Load Execute Load Execute Load Execute...
Applications Mapping: Flexible computational
load balancing
                                                              ...
Architecture evaluation
  Hardware-assisted simulation environment
  developed using a XILINX XC4VLX200 device
    The imp...
RC Layout
              Input Stage
                                       Technology
                                    ...
Resources usage/energy/performance trade-
off comparisons: New to Xilinx Virtex-4
 Algorithm        Proposed Reconfigurabl...
Conclusion
  Presented VLSI implementation of a new coarse-grain
  reconfigurable architecture optimized for high throughp...
Upcoming SlideShare
Loading in …5
×

RCIM 2008 - - UniCal

909 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
909
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

RCIM 2008 - - UniCal

  1. 1. Energy Efficient Coarse-Grain Reconfigurable Array for Accelerating Digital Signal Processing Pasquale Corsonello, Fabio Frustaci, Marco Lanuzza, Stefania Perri, Paolo Zicari. Department of Electronics, Computer Science and Systems (DEIS) University of Calabria, Rende (CS)
  2. 2. Outline Motivation The proposed Coarse Grain Reconfigurable Array (CGRA) Architectural overview Computational model Post Layout Results Comparison Conclusion
  3. 3. The Challenge Nowadays, Digital Signal Processing (DSP) is extensively used for several applications Multimedia Image analysis and processing Speech processing Wireless communication These applications impose strict hardware requirements High performance Real-time operations High computational load Intensive arithmetic operations (add, sub, shift, mult, mult-acc) Energy-efficiency Portable devices Flexibility Support multiple applications Match the rapid evolving of the algorithms
  4. 4. Executing DSP on various architectures General Purpose Full Custom Reconfigurable Processors Solutions Computing & Programmable Digital Signal CGRA FPGA Processors Increasing Flexibility Increasing Performances Reconfigurable computing architectures provide an intermediate tradeoff between flexibility and performances
  5. 5. Reconfigurable Computing FPGAs are very flexible, … Gate-level functions General routing … ,but the flexibility is very expensive FPGAs are slower than ASICs, have lower logic density and are inefficient for word operations. Long reconfiguration time CGRAs use multiple-bits wide PEs and more speed-, area- and power-efficient routing structures Compromise programmability and fixed functionality Flexible and efficient within an application domain
  6. 6. Architectural Overview Addr. Data Config. & Elab. Data Reconfigurable RAM Cell External Memory Interface PE Host Interface I/O DATA & CONFIGURATION CENTRAL CONTROLLER Lached Programmable Config. Data Elab. Data Switches RAM RAM RAM RAM RAM RAM PE PE PE PE PE PE RAM RAM RAM RAM RAM RAM PE PE PE PE PE PE RAM RAM RAM RAM RAM RAM PE PE PE PE PE PE RAM RAM RAM RAM RAM RAM PE PE PE PE PE PE RAM RAM RAM RAM RAM RAM PE PE PE PE PE PE RAM RAM RAM RAM RAM RAM PE PE PE PE PE PE Distributed small RAMs and on purpose designed interconnection scheme to achieve high performance Run-time reconfigurable cells to achieve a high flexibility within the target application domain Distributed control logic to reduce control complexity and enhancing data parallelism
  7. 7. The Reconfigurable Cell I/O interface similar to a AddrA/B_ext Data_InA/B_ext conventional RAM 2 input/output data ports Input Stage 2 input address ports Ram Interface 1 output address port control signals I/O control signals Config. Data Dual Port SRAM Dual Port SRAM (256*8-bit) Controls (256*8-bits) data memory Control Unit Signals Config. Reconfigurable 8-bit PE Mem PE (8-bit) Internal Control Unit Output Stage Two operative states Addr_Out_ext Loading Data_OutA/B_ext Executing
  8. 8. Functionality of the RC in the executing state RAM RAM RAM RAM PE PE PE PE (a) (b) (c) (d) a) feed-forward mode; b) feed-back mode; c) route-through mode; d) route-through mode (double throughput)
  9. 9. The Processing Element B-Register A-Register (8-bit) (8-bit) Single clock cycle 0001 00000001 00000001 operations S1 S3 S0 S2 ADD, SUB,ACC, 00000000 0000 0000 INC, DEC, MUL, MULT2 S6 S4 MULT1 S6 (8X4-bit) (8X4-bit) S5 MUL-ACC, SHIFT HA-based 3:2 (FA-based) Compressor (4-bit) Compressor (8-bit) Fast and low-cost Adder3 S7=cin co2 Adder2 Adder1 co1 (4-bit) (8-bit) (4-bit) Register Register Register (4-bit) (8-bit) (4-bit) O[15:12] O[3:0] O[11:4]
  10. 10. The Control Unit Instructions define the Configuaration Data execution of vector/block operations on a large data Config. Instr. stream Counter Memory Each instruction consist of several fields op_code #ops Address Descriptors op_code specifies the operation code; Hanshake & Addresses Instruction #ops specifies the Decoder Elab. Control Generator number of the operations to be performed in the AddrA_int Handshake current instruction; AddrB_int PE & I/O Signals Addr_ext control signals address descriptors specify the data organization in the memory.
  11. 11. The Address Generator subset step base_address skip step_register skip_register down counter control_signal =0 end_subset addr_register Continuous vector forward scan Continuous vector (column mode) Block scan (forward/reverse mode) (Step=1, Subset=8, Skip=0) (Step=1/-1, Subset=3, Skip=n-3/-n+3) forward/reverse scan (Step=n/-n, Subset=8, Skip=0) Continuous vector reverse scan (Step=-1, Subset=8, Skip=0) address_calculation Sparse vector forward scan _adder (Step=2, Subset=4, Skip=0) Sparse vector reverse scan Sparse vector (column mode) current_address (Step=-2, Subset=4, Skip=0) forward/reverse scan (Step=2n/-2n , Subset=4, Skip=0) Rotating vector forward scan (Step=1, Subset=8, Skip=-7) Rotating vector reverse scan (Step=-1, Subset=8, Skip=+7)
  12. 12. The Interconnection Topology N-bit NW N NE W E SW S SE neighbor interconnections interleaved interconnections 2N-bit Programmable Latched Switches
  13. 13. Applications Mapping: Block-level pipelining RAM(i-1) Load Execute Load Execute Load Execute Load RC(i-1) PE(i-1) Load Execute Load Execute Load Execute RC(i) RAM(i) Load Execute Load Execute Load RC(i+1) PE(i) The computation is organized in concurrently executing kernels Each kernel is implemented by a RC RAM(i+1) A kernel consumes a set of input data, performs one or more computations, and produces a set of output data PE(i+1) RCs communicate by sending addressed packets of data. Memory data loading of each cell is overlapped with data producing of previous cell An execution is performed as soon as all necessary data input are available Data syncronization mechanism is realized by handshake signals No explicit temporal scheduling of execution is required
  14. 14. Applications Mapping: Flexible computational load balancing Data parallel Parallelism in both vertical/temporal and RAM(1) RAM(1) horizontal/spatial directions PE(1) PE(1) Function parallel RAM(2) RAM(2) RAM(3) Horizontal comp. load balancing PE(2) achieved via data parallelism PE(2) PE(3) RAM(3) RAM(4) Vertical comp. load balancing PE(3) achieved by increasing the PE(4) number of pipeline stages RAM(4) PE(4)
  15. 15. Architecture evaluation Hardware-assisted simulation environment developed using a XILINX XC4VLX200 device The implemented system includes 64 RCs organized in 4x4 quadrants The number of the required clock cycles were precisely evaluated for different DSP benchmarks (YCbCr RGB, 2d- DCT, 2d-FIR) . Physical Evaluation for the ST 90nm CMOS technology Reconfigurable Cell Synthesis done with Synopsys Design Compiler Physical Design done with Cadence SoC Encounter, also considering manufacturing (such as DRCs and antennas) and Signal Integrity (SI) issues. Interconnections Preliminary electrical simulations were performed Obtained results were compared to 90nm CMOS Virtex-4 FPGA
  16. 16. RC Layout Input Stage Technology CMOS 90nm Dual Port SRAM (256*8-bit) Suppy voltage 1.0 V Frequency RAM Interface 1 GHz Core Area Configuration Memory 79.52 um2 Avg. Dyn. Power PE @1 GHz 20 mW Control Unit Leakage Power 627.6 uW Output Stage
  17. 17. Resources usage/energy/performance trade- off comparisons: New to Xilinx Virtex-4 Algorithm Proposed Reconfigurable Array Virtex-4 FPGA (CORE Generator) Resources/ Throughput Energy Resources / Throughput Energy Area [mm2] Area [mm2] [MOPS] Efficiency [MOPS] Efficiency (8*8-image [MOPS/W] (8*8-image [MOPS/W] block) (8*8-image block) (8*8-image block) block) Color Space 13 RCs / 13.3 45.9 436 Slices + 2 1.7 29.1 Conversion 1.034 Bram / 1.572 2D 20 RCs / 10.5 23.9 440 Slices + 2 1.3 18.4 separable 1.590 Bram/ 1.657 4x4 FIR 2D-DCT 22 RCs / 10.2 20.8 786 Slices + 3 2.1 14.2 (8x8) 1.749 Bram / 2.919 •Speedups ranging from 4.8X to 8X •Energy efficiency improvement ranging from 24% to 58% •Area saving up to 40%.
  18. 18. Conclusion Presented VLSI implementation of a new coarse-grain reconfigurable architecture optimized for high throughput DSP applications Performance improvement at a low cost Exploit spatial and temporal parallelism High arithmetic processing capability high bandwidth and low latency memory access Performance/energy/area evaluations for representative tasks belonging to the target application domain Obtained results demonstrate significative advantages with respect to conventional FPGA Speedups ranging from 4.8X to 8X Energy efficiency improvement ranging from 24% to 58% Area saving up to 40%

×