SlideShare a Scribd company logo
1 of 18
Download to read offline
Energy Efficient Coarse-Grain
Reconfigurable Array for Accelerating
     Digital Signal Processing


  Pasquale Corsonello, Fabio Frustaci, Marco Lanuzza,
             Stefania Perri, Paolo Zicari.



Department of Electronics, Computer Science and Systems (DEIS)
University of Calabria, Rende (CS)
Outline

  Motivation

  The proposed Coarse Grain Reconfigurable
  Array (CGRA)
    Architectural overview
    Computational model
    Post Layout Results
    Comparison

  Conclusion
The Challenge
Nowadays, Digital Signal Processing (DSP) is extensively used for
   several applications

    Multimedia
    Image analysis and processing
    Speech processing
    Wireless communication

These applications impose strict hardware requirements

    High performance
       Real-time operations
       High computational load
           Intensive arithmetic operations
              (add, sub, shift, mult, mult-acc)
    Energy-efficiency
       Portable devices
    Flexibility
       Support multiple applications
       Match the rapid evolving of the algorithms
Executing DSP on various architectures
                                               General Purpose
        Full Custom     Reconfigurable
                                                 Processors
         Solutions       Computing
                                                      &
                                                Programmable
                                                Digital Signal
                         CGRA FPGA
                                                 Processors


                      Increasing Flexibility

                      Increasing Performances




  Reconfigurable computing architectures provide an
  intermediate tradeoff between flexibility and performances
Reconfigurable Computing

  FPGAs are very flexible, …
    Gate-level functions
    General routing
   … ,but the flexibility is very expensive
    FPGAs are slower than ASICs, have lower logic density
    and are inefficient for word operations.
    Long reconfiguration time
  CGRAs use multiple-bits wide PEs and more
  speed-, area- and power-efficient routing
  structures
     Compromise programmability and fixed functionality
     Flexible and efficient within an application domain
Architectural Overview
                                                        Addr.
                                        Data
                Config. & Elab. Data
                                                                                       Reconfigurable
                                                                              RAM
                                                                                       Cell
                                        External Memory Interface                 PE
             Host Interface
             I/O DATA & CONFIGURATION CENTRAL CONTROLLER                               Lached
                                                                                       Programmable
                                                        Config. Data
                    Elab. Data
                                                                                       Switches


                                                                       RAM             RAM              RAM   RAM
                                                                             RAM                RAM
                                                                       PE               PE              PE    PE
                                                                             PE                 PE




                                                                       RAM             RAM              RAM
                                                                             RAM               RAM            RAM
                                                                       PE              PE               PE
                                                                             PE                 PE            PE




                                                                       RAM             RAM              RAM   RAM
                                                                             RAM               RAM
                                                                       PE              PE               PE    PE
                                                                             PE                 PE




                                                                       RAM             RAM              RAM   RAM
                                                                             RAM               RAM
                                                                       PE              PE               PE    PE
                                                                             PE                 PE




                                                                       RAM             RAM              RAM   RAM
                                                                             RAM               RAM
                                                                       PE              PE               PE    PE
                                                                             PE                 PE




                                                                       RAM                                    RAM
                                                                                       RAM     RAM      RAM
                                                                             RAM
                                                                       PE                                     PE
                                                                                       PE       PE      PE
                                                                             PE




  Distributed small RAMs and on purpose designed interconnection
scheme to achieve high performance
  Run-time reconfigurable cells to achieve a high flexibility within the
target application domain
   Distributed control logic to reduce control complexity and enhancing
data parallelism
The Reconfigurable Cell
  I/O interface similar to a
                                AddrA/B_ext Data_InA/B_ext
  conventional RAM
    2 input/output data ports          Input Stage
    2 input address ports
                                      Ram Interface
    1 output address port




                                                             control signals
    I/O control signals                                                        Config. Data
                                     Dual Port SRAM
  Dual Port SRAM                       (256*8-bit)

                                                                                               Controls
(256*8-bits) data memory                                                       Control Unit    Signals
                                                                                  Config.
  Reconfigurable 8-bit PE                                                          Mem
                                        PE (8-bit)
  Internal Control Unit
                                       Output Stage
    Two operative states
                                                                                Addr_Out_ext
       Loading
                                      Data_OutA/B_ext
       Executing
Functionality of the RC in the executing
state

                                                 RAM
           RAM          RAM          RAM



                                                 PE
            PE           PE           PE


            (a)          (b)           (c)       (d)

   a)   feed-forward mode;
   b)   feed-back mode;
   c)   route-through mode;
   d)   route-through mode (double throughput)
The Processing Element
                                       B-Register                     A-Register
                                         (8-bit)                        (8-bit)



Single clock cycle                                                                                         0001
                                                                                    00000001
                            00000001

operations                                                S1                                                   S3
                                             S0                                            S2


  ADD, SUB,ACC,                                                        00000000                            0000
                          0000


  INC, DEC, MUL,                             MULT2                             S6                                    S4
                                                                                           MULT1
                                 S6
                                            (8X4-bit)                                     (8X4-bit)                   S5

  MUL-ACC, SHIFT
                        HA-based                                3:2 (FA-based)
                     Compressor (4-bit)                        Compressor (8-bit)

Fast and low-cost          Adder3                                                                                   S7=cin
                                                    co2             Adder2                              Adder1
                                                                                         co1
                           (4-bit)                                  (8-bit)                             (4-bit)


                          Register                                  Register                           Register
                           (4-bit)                                   (8-bit)                            (4-bit)

                                 O[15:12]                                                             O[3:0]
                                                                          O[11:4]
The Control Unit

 Instructions define the                             Configuaration Data
 execution of vector/block
 operations on a large data                      Config.         Instr.
 stream                                                         Counter
                                                 Memory
 Each instruction consist of
 several fields
                                   op_code #ops      Address Descriptors
     op_code specifies the
    operation code;
                                                                       Hanshake &
                                                    Addresses
                                  Instruction
    #ops specifies the             Decoder                             Elab. Control
                                                    Generator
    number of the operations
    to be performed in the                                 AddrA_int          Handshake
    current instruction;                                   AddrB_int
                                 PE & I/O                                      Signals
                                                           Addr_ext
                               control signals
    address descriptors
    specify the data
    organization in the
    memory.
The Address Generator
                                                                   subset
                     step
     base_address                      skip
                    step_register    skip_register                 down counter

              control_signal
                                                                        =0
                                          end_subset
 addr_register
                                    Continuous vector forward scan Continuous vector (column mode)     Block scan (forward/reverse mode)
                                    (Step=1, Subset=8, Skip=0)                                         (Step=1/-1, Subset=3, Skip=n-3/-n+3)
                                                                   forward/reverse scan
                                                                   (Step=n/-n, Subset=8, Skip=0)
                                    Continuous vector reverse scan
                                     (Step=-1, Subset=8, Skip=0)
            address_calculation
                                    Sparse vector forward scan
                  _adder            (Step=2, Subset=4, Skip=0)


                                    Sparse vector reverse scan
                                                                    Sparse vector (column mode)
current_address                     (Step=-2, Subset=4, Skip=0)
                                                                    forward/reverse scan
                                                                    (Step=2n/-2n , Subset=4, Skip=0)
                                    Rotating vector forward scan
                                    (Step=1, Subset=8, Skip=-7)


                                    Rotating vector reverse scan
                                    (Step=-1, Subset=8, Skip=+7)
The Interconnection Topology
                                  N-bit




                                           NW          N         NE




                                           W                    E




                                           SW         S          SE




                                               neighbor interconnections
                                               interleaved interconnections
                                      2N-bit



  Programmable Latched Switches
Applications Mapping: Block-level pipelining

    RAM(i-1)
                         Load Execute Load Execute Load Execute Load
               RC(i-1)

    PE(i-1)


                                  Load Execute Load Execute Load Execute
                          RC(i)

    RAM(i)


                                            Load Execute Load Execute Load
                                  RC(i+1)
     PE(i)


                     The computation is organized in concurrently executing kernels
                          Each kernel is implemented by a RC
   RAM(i+1)

                     A kernel consumes a set of input data, performs one or more
                     computations, and produces a set of output data
    PE(i+1)
                          RCs communicate by sending addressed packets of data.
                          Memory data loading of each cell is overlapped with data producing of
                          previous cell

                     An execution is performed as soon as all necessary data input are
                     available
                          Data syncronization mechanism is realized by handshake signals
                          No explicit temporal scheduling of execution is required
Applications Mapping: Flexible computational
load balancing
                                                                  Data parallel
Parallelism in both
vertical/temporal and                                    RAM(1)            RAM(1)


horizontal/spatial directions                            PE(1)              PE(1)




                                     Function parallel
                                                         RAM(2)
                                                                  RAM(2)            RAM(3)
   Horizontal comp. load balancing
                                                         PE(2)
   achieved via data parallelism                                  PE(2)             PE(3)




                                                         RAM(3)
                                                                           RAM(4)
   Vertical comp. load balancing                         PE(3)

   achieved by increasing the                                              PE(4)


   number of pipeline stages                             RAM(4)


                                                         PE(4)
Architecture evaluation
  Hardware-assisted simulation environment
  developed using a XILINX XC4VLX200 device
    The implemented system includes 64 RCs organized in 4x4
    quadrants
    The number of the required clock cycles were precisely
    evaluated for different DSP benchmarks (YCbCr RGB, 2d-
    DCT, 2d-FIR) .
  Physical Evaluation for the ST 90nm CMOS
  technology
    Reconfigurable Cell
       Synthesis done with Synopsys Design Compiler
       Physical Design done with Cadence SoC Encounter, also considering
       manufacturing (such as DRCs and antennas) and Signal Integrity (SI)
       issues.
    Interconnections
       Preliminary electrical simulations were performed
  Obtained results were compared to 90nm CMOS
  Virtex-4 FPGA
RC Layout
              Input Stage
                                       Technology
                                         CMOS 90nm
         Dual Port SRAM
           (256*8-bit)
                                       Suppy voltage
                                         1.0 V
                                       Frequency
             RAM Interface
                                         1 GHz
                                       Core Area
                       Configuration
                         Memory
                                         79.52 um2
                                       Avg. Dyn. Power
    PE
                                       @1 GHz
                                         20 mW
                       Control Unit
                                       Leakage Power
                                         627.6 uW
            Output Stage
Resources usage/energy/performance trade-
off comparisons: New to Xilinx Virtex-4
 Algorithm        Proposed Reconfigurable Array
                                                          Virtex-4 FPGA (CORE Generator)
               Resources/   Throughput      Energy     Resources /      Throughput      Energy
               Area [mm2]                              Area [mm2]
                              [MOPS]      Efficiency                      [MOPS]       Efficiency
                            (8*8-image   [MOPS/W]                       (8*8-image   [MOPS/W]
                              block)      (8*8-image                      block)      (8*8-image
                                            block)                                      block)
 Color Space    13 RCs /      13.3       45.9          436 Slices + 2      1.7           29.1
 Conversion     1.034                                  Bram / 1.572

      2D        20 RCs /      10.5       23.9          440 Slices + 2      1.3          18.4
  separable      1.590                                  Bram/ 1.657
    4x4 FIR
   2D-DCT       22 RCs /      10.2       20.8          786 Slices + 3      2.1          14.2
    (8x8)        1.749                                 Bram / 2.919



•Speedups ranging from 4.8X to 8X
•Energy efficiency improvement ranging from 24% to 58%
•Area saving up to 40%.
Conclusion
  Presented VLSI implementation of a new coarse-grain
  reconfigurable architecture optimized for high throughput
  DSP applications
       Performance improvement at a low cost
           Exploit spatial and temporal parallelism
           High arithmetic processing capability
           high bandwidth and low latency memory access

  Performance/energy/area evaluations for representative
  tasks belonging to the target application domain

  Obtained results demonstrate significative advantages
  with respect to conventional FPGA
       Speedups ranging from 4.8X to 8X
       Energy efficiency improvement ranging from 24% to 58%
       Area saving up to 40%

More Related Content

More from Marco Santambrogio (20)

RCIM 2008 - - hArtes Atmel
RCIM 2008 - - hArtes AtmelRCIM 2008 - - hArtes Atmel
RCIM 2008 - - hArtes Atmel
 
DHow2 - L6 Ant
DHow2 - L6 AntDHow2 - L6 Ant
DHow2 - L6 Ant
 
RCIM 2008 - - ALaRI
RCIM 2008 - - ALaRIRCIM 2008 - - ALaRI
RCIM 2008 - - ALaRI
 
RCIM 2008 - Modello Scheduling
RCIM 2008 - Modello SchedulingRCIM 2008 - Modello Scheduling
RCIM 2008 - Modello Scheduling
 
RCIM 2008 - HLR
RCIM 2008 - HLRRCIM 2008 - HLR
RCIM 2008 - HLR
 
RCIM 2008 -- EHW
RCIM 2008 -- EHWRCIM 2008 -- EHW
RCIM 2008 -- EHW
 
RCIM 2008 - Modello Generale
RCIM 2008 - Modello GeneraleRCIM 2008 - Modello Generale
RCIM 2008 - Modello Generale
 
RCIM 2008 - Allocation Relocation
RCIM 2008 - Allocation RelocationRCIM 2008 - Allocation Relocation
RCIM 2008 - Allocation Relocation
 
RCIM 2008 - - hArtes_Ferrara
RCIM 2008 - - hArtes_FerraraRCIM 2008 - - hArtes_Ferrara
RCIM 2008 - - hArtes_Ferrara
 
RCIM 2008 - Janus
RCIM 2008 - JanusRCIM 2008 - Janus
RCIM 2008 - Janus
 
RCIM 2008 - Intro
RCIM 2008 - IntroRCIM 2008 - Intro
RCIM 2008 - Intro
 
DHow2 - L2
DHow2 - L2DHow2 - L2
DHow2 - L2
 
DHow2 - L4
DHow2 - L4DHow2 - L4
DHow2 - L4
 
DHow2 - L1
DHow2 - L1DHow2 - L1
DHow2 - L1
 
RCW@DEI - Treasure hunt
RCW@DEI - Treasure huntRCW@DEI - Treasure hunt
RCW@DEI - Treasure hunt
 
RCW@DEI - ADL
RCW@DEI - ADLRCW@DEI - ADL
RCW@DEI - ADL
 
RCW@DEI - Design Flow 4 SoPc
RCW@DEI - Design Flow 4 SoPcRCW@DEI - Design Flow 4 SoPc
RCW@DEI - Design Flow 4 SoPc
 
RCW@DEI - Real Needs And Limits
RCW@DEI - Real Needs And LimitsRCW@DEI - Real Needs And Limits
RCW@DEI - Real Needs And Limits
 
RCW@DEI - Basic Concepts
RCW@DEI - Basic ConceptsRCW@DEI - Basic Concepts
RCW@DEI - Basic Concepts
 
RCW@DEI - Reconf Comp
RCW@DEI - Reconf CompRCW@DEI - Reconf Comp
RCW@DEI - Reconf Comp
 

Recently uploaded

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

RCIM 2008 - - UniCal

  • 1. Energy Efficient Coarse-Grain Reconfigurable Array for Accelerating Digital Signal Processing Pasquale Corsonello, Fabio Frustaci, Marco Lanuzza, Stefania Perri, Paolo Zicari. Department of Electronics, Computer Science and Systems (DEIS) University of Calabria, Rende (CS)
  • 2. Outline Motivation The proposed Coarse Grain Reconfigurable Array (CGRA) Architectural overview Computational model Post Layout Results Comparison Conclusion
  • 3. The Challenge Nowadays, Digital Signal Processing (DSP) is extensively used for several applications Multimedia Image analysis and processing Speech processing Wireless communication These applications impose strict hardware requirements High performance Real-time operations High computational load Intensive arithmetic operations (add, sub, shift, mult, mult-acc) Energy-efficiency Portable devices Flexibility Support multiple applications Match the rapid evolving of the algorithms
  • 4. Executing DSP on various architectures General Purpose Full Custom Reconfigurable Processors Solutions Computing & Programmable Digital Signal CGRA FPGA Processors Increasing Flexibility Increasing Performances Reconfigurable computing architectures provide an intermediate tradeoff between flexibility and performances
  • 5. Reconfigurable Computing FPGAs are very flexible, … Gate-level functions General routing … ,but the flexibility is very expensive FPGAs are slower than ASICs, have lower logic density and are inefficient for word operations. Long reconfiguration time CGRAs use multiple-bits wide PEs and more speed-, area- and power-efficient routing structures Compromise programmability and fixed functionality Flexible and efficient within an application domain
  • 6. Architectural Overview Addr. Data Config. & Elab. Data Reconfigurable RAM Cell External Memory Interface PE Host Interface I/O DATA & CONFIGURATION CENTRAL CONTROLLER Lached Programmable Config. Data Elab. Data Switches RAM RAM RAM RAM RAM RAM PE PE PE PE PE PE RAM RAM RAM RAM RAM RAM PE PE PE PE PE PE RAM RAM RAM RAM RAM RAM PE PE PE PE PE PE RAM RAM RAM RAM RAM RAM PE PE PE PE PE PE RAM RAM RAM RAM RAM RAM PE PE PE PE PE PE RAM RAM RAM RAM RAM RAM PE PE PE PE PE PE Distributed small RAMs and on purpose designed interconnection scheme to achieve high performance Run-time reconfigurable cells to achieve a high flexibility within the target application domain Distributed control logic to reduce control complexity and enhancing data parallelism
  • 7. The Reconfigurable Cell I/O interface similar to a AddrA/B_ext Data_InA/B_ext conventional RAM 2 input/output data ports Input Stage 2 input address ports Ram Interface 1 output address port control signals I/O control signals Config. Data Dual Port SRAM Dual Port SRAM (256*8-bit) Controls (256*8-bits) data memory Control Unit Signals Config. Reconfigurable 8-bit PE Mem PE (8-bit) Internal Control Unit Output Stage Two operative states Addr_Out_ext Loading Data_OutA/B_ext Executing
  • 8. Functionality of the RC in the executing state RAM RAM RAM RAM PE PE PE PE (a) (b) (c) (d) a) feed-forward mode; b) feed-back mode; c) route-through mode; d) route-through mode (double throughput)
  • 9. The Processing Element B-Register A-Register (8-bit) (8-bit) Single clock cycle 0001 00000001 00000001 operations S1 S3 S0 S2 ADD, SUB,ACC, 00000000 0000 0000 INC, DEC, MUL, MULT2 S6 S4 MULT1 S6 (8X4-bit) (8X4-bit) S5 MUL-ACC, SHIFT HA-based 3:2 (FA-based) Compressor (4-bit) Compressor (8-bit) Fast and low-cost Adder3 S7=cin co2 Adder2 Adder1 co1 (4-bit) (8-bit) (4-bit) Register Register Register (4-bit) (8-bit) (4-bit) O[15:12] O[3:0] O[11:4]
  • 10. The Control Unit Instructions define the Configuaration Data execution of vector/block operations on a large data Config. Instr. stream Counter Memory Each instruction consist of several fields op_code #ops Address Descriptors op_code specifies the operation code; Hanshake & Addresses Instruction #ops specifies the Decoder Elab. Control Generator number of the operations to be performed in the AddrA_int Handshake current instruction; AddrB_int PE & I/O Signals Addr_ext control signals address descriptors specify the data organization in the memory.
  • 11. The Address Generator subset step base_address skip step_register skip_register down counter control_signal =0 end_subset addr_register Continuous vector forward scan Continuous vector (column mode) Block scan (forward/reverse mode) (Step=1, Subset=8, Skip=0) (Step=1/-1, Subset=3, Skip=n-3/-n+3) forward/reverse scan (Step=n/-n, Subset=8, Skip=0) Continuous vector reverse scan (Step=-1, Subset=8, Skip=0) address_calculation Sparse vector forward scan _adder (Step=2, Subset=4, Skip=0) Sparse vector reverse scan Sparse vector (column mode) current_address (Step=-2, Subset=4, Skip=0) forward/reverse scan (Step=2n/-2n , Subset=4, Skip=0) Rotating vector forward scan (Step=1, Subset=8, Skip=-7) Rotating vector reverse scan (Step=-1, Subset=8, Skip=+7)
  • 12. The Interconnection Topology N-bit NW N NE W E SW S SE neighbor interconnections interleaved interconnections 2N-bit Programmable Latched Switches
  • 13. Applications Mapping: Block-level pipelining RAM(i-1) Load Execute Load Execute Load Execute Load RC(i-1) PE(i-1) Load Execute Load Execute Load Execute RC(i) RAM(i) Load Execute Load Execute Load RC(i+1) PE(i) The computation is organized in concurrently executing kernels Each kernel is implemented by a RC RAM(i+1) A kernel consumes a set of input data, performs one or more computations, and produces a set of output data PE(i+1) RCs communicate by sending addressed packets of data. Memory data loading of each cell is overlapped with data producing of previous cell An execution is performed as soon as all necessary data input are available Data syncronization mechanism is realized by handshake signals No explicit temporal scheduling of execution is required
  • 14. Applications Mapping: Flexible computational load balancing Data parallel Parallelism in both vertical/temporal and RAM(1) RAM(1) horizontal/spatial directions PE(1) PE(1) Function parallel RAM(2) RAM(2) RAM(3) Horizontal comp. load balancing PE(2) achieved via data parallelism PE(2) PE(3) RAM(3) RAM(4) Vertical comp. load balancing PE(3) achieved by increasing the PE(4) number of pipeline stages RAM(4) PE(4)
  • 15. Architecture evaluation Hardware-assisted simulation environment developed using a XILINX XC4VLX200 device The implemented system includes 64 RCs organized in 4x4 quadrants The number of the required clock cycles were precisely evaluated for different DSP benchmarks (YCbCr RGB, 2d- DCT, 2d-FIR) . Physical Evaluation for the ST 90nm CMOS technology Reconfigurable Cell Synthesis done with Synopsys Design Compiler Physical Design done with Cadence SoC Encounter, also considering manufacturing (such as DRCs and antennas) and Signal Integrity (SI) issues. Interconnections Preliminary electrical simulations were performed Obtained results were compared to 90nm CMOS Virtex-4 FPGA
  • 16. RC Layout Input Stage Technology CMOS 90nm Dual Port SRAM (256*8-bit) Suppy voltage 1.0 V Frequency RAM Interface 1 GHz Core Area Configuration Memory 79.52 um2 Avg. Dyn. Power PE @1 GHz 20 mW Control Unit Leakage Power 627.6 uW Output Stage
  • 17. Resources usage/energy/performance trade- off comparisons: New to Xilinx Virtex-4 Algorithm Proposed Reconfigurable Array Virtex-4 FPGA (CORE Generator) Resources/ Throughput Energy Resources / Throughput Energy Area [mm2] Area [mm2] [MOPS] Efficiency [MOPS] Efficiency (8*8-image [MOPS/W] (8*8-image [MOPS/W] block) (8*8-image block) (8*8-image block) block) Color Space 13 RCs / 13.3 45.9 436 Slices + 2 1.7 29.1 Conversion 1.034 Bram / 1.572 2D 20 RCs / 10.5 23.9 440 Slices + 2 1.3 18.4 separable 1.590 Bram/ 1.657 4x4 FIR 2D-DCT 22 RCs / 10.2 20.8 786 Slices + 3 2.1 14.2 (8x8) 1.749 Bram / 2.919 •Speedups ranging from 4.8X to 8X •Energy efficiency improvement ranging from 24% to 58% •Area saving up to 40%.
  • 18. Conclusion Presented VLSI implementation of a new coarse-grain reconfigurable architecture optimized for high throughput DSP applications Performance improvement at a low cost Exploit spatial and temporal parallelism High arithmetic processing capability high bandwidth and low latency memory access Performance/energy/area evaluations for representative tasks belonging to the target application domain Obtained results demonstrate significative advantages with respect to conventional FPGA Speedups ranging from 4.8X to 8X Energy efficiency improvement ranging from 24% to 58% Area saving up to 40%