SlideShare a Scribd company logo
1 of 29
Download to read offline
Address generation unit for multimedia 
                  applications
     on application specific instruction set 
                   processors
 Marc Moreno­Berengue,  Guillermo Talavera Velilla, Aitor Rodriguez­Alsina,  
                             Jordi Carrabina
                  Universitat Autònoma de Barcelona (Spain)




                            IECON 2010
                  7–10 November – Phoenix, AZ, USA
Motivation

➢   Design a custom Address Generation Unit (AGU)
       ➢   Connected to an ASIP data­path


➢   Benefits of custom AGU design
       ➢   Previous software optimizations.
       ➢   Multimedia applications



                                               2
Structure
➢   Introduction
➢   Design
➢   Work Flow
➢   Results
➢   Conclusions


                               3
➢   Introduction
➢   Design
➢   Work Flow
➢   Results
➢   Conclusions
Multimedia applications features
➢   Multimedia applications
        ➢   Complex index manipulation
        ➢   Large number  of data access
➢   Require
        ➢   High performance 
        ➢   Low energy consumption


    It is crucial reduce these data accesses and related address 
    computations in an effective way
                                                            5
SW optimizations
Data Transfer and Storage Exploration (DTSE)* methodology 
has oriented to:
 ➢   Reduce data transfers between memories and processor
 ➢   Improve the energy efficiency
 ➢   Reduce the execution time


     SW transformations create high overhead in the address 
     generation and control flow

                      *Methodology developed at IMEC research center
                                                              6
SW optimizations
                             ...

                             for (y=0; y<=M+2; ++y){
...
                              for (x=0; x<=N+2; ++x) {
for (x=1; x<=N-2; ++x)
                                 if (x>=0&&x<N &&y>=1&&y<=M-2)
 for (y=1; y<=N-2; ++y)
                                   D[x%3] = B[(y*N+x)%8704+
  for (k=-1; k<=1; ++k){
A[x][y] += B[x+k][y]                (y*N+x)/8704*16384+7680] ;
        *C[abs(k)];
                                 if (x-1>=1&&x-1<=N-2
    A[x][y] /=tot;
                                              &&y>=1&&y<=M-2) {
}
                                    for (k=-1; k<=1; ++k)
...
                                      acc += D[(x-1+k)%3]*C[abs(k)];
                             }

                                   acc /= tot;}

                             }

                             ...
                                                              7
SW optimizations
                             ...

                             for (y=0; y<=M+2; ++y){
...
                               for (x=0; x<=N+2; ++x) {
for (x=1; x<=N-2; ++x)
                                 if (x>=0&&x<N &&y>=1&&y<=M-2)
 for (y=1; y<=N-2; ++y)
                                   D[x%3] = B[(y*N+x)%8704+
  for (k=-1; k<=1; ++k){
A[x][y] += B[x+k][y]                (y*N+x)/8704*16384+7680] ;
        *C[abs(k)];
                                 if (x-1>=1&&x-1<=N-2
    A[x][y] /=tot;
                                              &&y>=1&&y<=M-2) {
}
                                    for (k=-1; k<=1; ++k)
...     Need to be optimized          acc += D[(x-1+k)%3]*C[abs(k)];
                             }

                                   acc /= tot;}

                             }

                             ...
                                                              8
Address Generation Unit
 The Address Generation Unit (AGU) is a co­processor which use 
 the address equation (AE) to generate the address sequence (AS).


                             &X[AE]=AS 


 Example:
 B[(y*N+x)%8704+(y*N+x)/8704*16384+7680]
 AE = (y*N+x) % 8704 + (y*N+x) / 8704*16384+7680
   AS = 7680,7681,7682,7683, ...
                                                           9
➢   Introduction

➢   Design
➢   Work Flow
➢   Results
➢   Conclusions
Application specific instruction set 
             processor
Application specific instruction set processor (ASIP) 
     ➢   Extend its instruction set
     ➢   Fast interface for read/write data from/to specific 
           hardware
              ➢   1 Instruction
              ➢   1 Cycle


                                                                11
AGU design

➢   AGU attached to the ASIP data­path save execution time
        ●   1 instruction
        ●   1 cycle




                                                             12
AGU skeleton
The AGU has one control unit, 
one process unit and one FIFO
                                 Custom Instruction interface


                                                         CI unit

                                          Change AE values


                                           Read AS values




                                                         CO unit




                                            AS generation




                                                            13
AGU skeleton
The AGU has one control unit, 
one process unit and one FIFO
                                         Custom Instruction interface


  ➢   CI (custom instruction) unit                               CI unit

                                                  Change AE values
      •   AE configuration & read FIFO
                                                   Read AS values




                                                                 CO unit




                                                    AS generation




                                                                    14
AGU skeleton
The AGU has one control unit, 
one process unit and one FIFO
                                              Custom Instruction interface


  ➢   CI (custom instruction) unit                                    CI unit

                                                       Change AE values
      •   AE configuration & read FIFO
                                                        Read AS values



  ➢    CO (co­processador) unit                                       CO unit

      •   Calculate the AE to generate the 
          AS  and store all values in the                AS generation

          FIFO

                                                                         15
AGU Creator




Web based application
                        16
➢   Introduction
➢   Design

➢   Work Flow
➢   Results
➢   Conclusions
Work Flow




            18
Work Flow
      Init.c                     Opt.c                      CI_code.c
int A[70],B[70],C=0;       int A[7],B[7],C=0;         int A[7],B[7],C=0,ix,x;

...                        ...                        initAGU(); initAGU2();

for (i=7; i<70; i++)       for (i=7; i<70; i++)       ...

{                          {                          for (i=7; i<70; i++)

B[i]=A[i-7]+B[i-7];        B[i%7]=A[(i-7)%7]          {

A[i]=i;           SW Opt.        +B[(i-7)%7];         x=readAGU();

C+=B[i];          (DTSE) A[i%7]=i;                    ix=readAGU2();

}                          C+=B[i%7];                 B[x]=A[ix]+B[ix];

...                        }                   AGUs   A[x]=i;
                           ...                        C+=B[x];

                                                      }

                                                      ...              19
➢   Introduction
➢   Design
➢   Work Flow

➢   Results
➢   Conclusions
Test environment 
➢   NIOS II soft­core processor (Altera)
    ●   32 bits RISC processor
    ●   Harvard memory architecture
    ●   Data/Instructions cache 
    ●   256 Custom Instructions (Fast data­path interface)


➢   Cyclone II EP2C35 Altera FPGA




                                                             21
Test Applications

➢   Cavity Detector
    Medical imaging application to detect cavities on tomography scans


➢   Quad­tree Structured Difference Pulse Code Modulation 
    (QSDPCM)
    An inter­frame compression technique for video imaging.




                                                                         22
Speedup
      Speedup ( Cavity )                   Speedup ( QSDPCM )
1.4                                  1.4

1.2                                  1.2

 1                                    1

0.8                                  0.8

0.6                                  0.6

0.4                                  0.4

0.2                                  0.2

 0                                    0
      DTSE
       Init      AGU inclusion
                  HW AGU inclusion         DTSE
                                            Init     AGU inclusion
                                                      HW AGU inclusion



      Speedup: 1.26                        Speedup: 1.19

                                                                         23
Energy improvements 
        Energy ( Cavity )                    Energy ( QSDPCM )
 1                                      1


0.8                                    0.8


0.6                                    0.6


0.4                                    0.4


0.2                                    0.2


 0                                      0
      DTSE
        Init       AGU inclusion
                    HW AGU inclusion         DTSE
                                              Init     AGU inclusion
                                                        HW AGU inclusion




Energy reduction: 27%                  Energy reduction: 21%

                                                                           24
Area penalties

                     Cavity (LEs)   QSPCM (LEs)

NIOS-F                  2644            2644

NIOS-F +AGU             3596            3592




  The AGU inclusion in the NIOS II architecture use
     2.9% of total FPGA resources (33216LEs)


                                                      25
➢   Introduction
➢   Design
➢   Work Flow
➢   Results

➢   Conclusions
Conclusions
➢   Extend an ASIP by AGUs is an efficient way to meet the 
    performance/energy requirements of multimedia applications 
    after some SW optimizations

➢   The innovation of connecting the AGU in the processor data­
    path and working in parallel with the main processor allow 
    calculate a wide range of values before the processor needs them

➢   Use an AGU skeleton and a wizard decrease the design and 
    implementation time.


                                                               27
Future Work
➢   Improve the AGU wizard in order to:

    ●   Detect automatically AEs  and show relevant informations 
        about each AE for a given C file.
    ●   Generate the appropriate AGU for a specific set of AEs
    ●   Generate AGUs for more than one ASIP


➢   Extend the set of applications have been used in this work


                                                                 28
Thank you!!


Questions?

More Related Content

Similar to Iecon slides

デザインキット・D級アンプのスタートアップガイド
デザインキット・D級アンプのスタートアップガイドデザインキット・D級アンプのスタートアップガイド
デザインキット・D級アンプのスタートアップガイドTsuyoshi Horigome
 
Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...
Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...
Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...Takahiro Katagiri
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA Japan
 
Java parallel programming made simple
Java parallel programming made simpleJava parallel programming made simple
Java parallel programming made simpleAteji Px
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法MITSUNARI Shigeo
 
C++ AMP 실천 및 적용 전략
C++ AMP 실천 및 적용 전략 C++ AMP 실천 및 적용 전략
C++ AMP 실천 및 적용 전략 명신 김
 
Cluj.py Meetup: Extending Python in C
Cluj.py Meetup: Extending Python in CCluj.py Meetup: Extending Python in C
Cluj.py Meetup: Extending Python in CSteffen Wenz
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)RCCSRENKEI
 
Introduction to Polyhedral Compilation
Introduction to Polyhedral CompilationIntroduction to Polyhedral Compilation
Introduction to Polyhedral CompilationAkihiro Hayashi
 
Basics of programming embedded processors
Basics of programming embedded processorsBasics of programming embedded processors
Basics of programming embedded processorsMurphy Chen
 
Reducing computational complexity of Mathematical functions using FPGA
Reducing computational complexity of Mathematical functions using FPGAReducing computational complexity of Mathematical functions using FPGA
Reducing computational complexity of Mathematical functions using FPGAnehagaur339
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersKazuaki Ishizaki
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011Raymond Tay
 
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015Windows Developer
 

Similar to Iecon slides (20)

デザインキット・D級アンプのスタートアップガイド
デザインキット・D級アンプのスタートアップガイドデザインキット・D級アンプのスタートアップガイド
デザインキット・D級アンプのスタートアップガイド
 
Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...
Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...
Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
Cadancesimulation
CadancesimulationCadancesimulation
Cadancesimulation
 
Java parallel programming made simple
Java parallel programming made simpleJava parallel programming made simple
Java parallel programming made simple
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
 
Managing console
Managing consoleManaging console
Managing console
 
C++ AMP 실천 및 적용 전략
C++ AMP 실천 및 적용 전략 C++ AMP 실천 및 적용 전략
C++ AMP 실천 및 적용 전략
 
Cluj.py Meetup: Extending Python in C
Cluj.py Meetup: Extending Python in CCluj.py Meetup: Extending Python in C
Cluj.py Meetup: Extending Python in C
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)
 
Introduction to Polyhedral Compilation
Introduction to Polyhedral CompilationIntroduction to Polyhedral Compilation
Introduction to Polyhedral Compilation
 
22CS201 COA
22CS201 COA22CS201 COA
22CS201 COA
 
Basics of programming embedded processors
Basics of programming embedded processorsBasics of programming embedded processors
Basics of programming embedded processors
 
Abhi monal
Abhi monalAbhi monal
Abhi monal
 
Exploiting vectorization with ISPC
Exploiting vectorization with ISPCExploiting vectorization with ISPC
Exploiting vectorization with ISPC
 
Reducing computational complexity of Mathematical functions using FPGA
Reducing computational complexity of Mathematical functions using FPGAReducing computational complexity of Mathematical functions using FPGA
Reducing computational complexity of Mathematical functions using FPGA
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
 
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
 

Iecon slides

  • 1. Address generation unit for multimedia  applications on application specific instruction set  processors  Marc Moreno­Berengue,  Guillermo Talavera Velilla, Aitor Rodriguez­Alsina,   Jordi Carrabina Universitat Autònoma de Barcelona (Spain) IECON 2010 7–10 November – Phoenix, AZ, USA
  • 2. Motivation ➢ Design a custom Address Generation Unit (AGU) ➢ Connected to an ASIP data­path ➢ Benefits of custom AGU design ➢ Previous software optimizations. ➢ Multimedia applications 2
  • 3. Structure ➢ Introduction ➢ Design ➢ Work Flow ➢ Results ➢ Conclusions 3
  • 4. Introduction ➢ Design ➢ Work Flow ➢ Results ➢ Conclusions
  • 5. Multimedia applications features ➢ Multimedia applications ➢ Complex index manipulation ➢ Large number  of data access ➢ Require ➢ High performance  ➢ Low energy consumption It is crucial reduce these data accesses and related address  computations in an effective way 5
  • 6. SW optimizations Data Transfer and Storage Exploration (DTSE)* methodology  has oriented to: ➢ Reduce data transfers between memories and processor ➢ Improve the energy efficiency ➢ Reduce the execution time SW transformations create high overhead in the address  generation and control flow *Methodology developed at IMEC research center 6
  • 7. SW optimizations ... for (y=0; y<=M+2; ++y){ ... for (x=0; x<=N+2; ++x) { for (x=1; x<=N-2; ++x) if (x>=0&&x<N &&y>=1&&y<=M-2) for (y=1; y<=N-2; ++y) D[x%3] = B[(y*N+x)%8704+ for (k=-1; k<=1; ++k){ A[x][y] += B[x+k][y] (y*N+x)/8704*16384+7680] ; *C[abs(k)]; if (x-1>=1&&x-1<=N-2 A[x][y] /=tot; &&y>=1&&y<=M-2) { } for (k=-1; k<=1; ++k) ... acc += D[(x-1+k)%3]*C[abs(k)]; } acc /= tot;} } ... 7
  • 8. SW optimizations ... for (y=0; y<=M+2; ++y){ ... for (x=0; x<=N+2; ++x) { for (x=1; x<=N-2; ++x) if (x>=0&&x<N &&y>=1&&y<=M-2) for (y=1; y<=N-2; ++y) D[x%3] = B[(y*N+x)%8704+ for (k=-1; k<=1; ++k){ A[x][y] += B[x+k][y] (y*N+x)/8704*16384+7680] ; *C[abs(k)]; if (x-1>=1&&x-1<=N-2 A[x][y] /=tot; &&y>=1&&y<=M-2) { } for (k=-1; k<=1; ++k) ... Need to be optimized acc += D[(x-1+k)%3]*C[abs(k)]; } acc /= tot;} } ... 8
  • 9. Address Generation Unit The Address Generation Unit (AGU) is a co­processor which use  the address equation (AE) to generate the address sequence (AS). &X[AE]=AS  Example: B[(y*N+x)%8704+(y*N+x)/8704*16384+7680] AE = (y*N+x) % 8704 + (y*N+x) / 8704*16384+7680    AS = 7680,7681,7682,7683, ... 9
  • 10. Introduction ➢ Design ➢ Work Flow ➢ Results ➢ Conclusions
  • 11. Application specific instruction set  processor Application specific instruction set processor (ASIP)  ➢ Extend its instruction set ➢ Fast interface for read/write data from/to specific  hardware ➢ 1 Instruction ➢ 1 Cycle 11
  • 12. AGU design ➢ AGU attached to the ASIP data­path save execution time ● 1 instruction ● 1 cycle 12
  • 13. AGU skeleton The AGU has one control unit,  one process unit and one FIFO Custom Instruction interface CI unit Change AE values Read AS values CO unit AS generation 13
  • 14. AGU skeleton The AGU has one control unit,  one process unit and one FIFO Custom Instruction interface ➢ CI (custom instruction) unit CI unit Change AE values • AE configuration & read FIFO Read AS values CO unit AS generation 14
  • 15. AGU skeleton The AGU has one control unit,  one process unit and one FIFO Custom Instruction interface ➢ CI (custom instruction) unit CI unit Change AE values • AE configuration & read FIFO Read AS values ➢  CO (co­processador) unit CO unit • Calculate the AE to generate the  AS  and store all values in the  AS generation FIFO 15
  • 17. Introduction ➢ Design ➢ Work Flow ➢ Results ➢ Conclusions
  • 19. Work Flow Init.c Opt.c CI_code.c int A[70],B[70],C=0; int A[7],B[7],C=0; int A[7],B[7],C=0,ix,x; ... ... initAGU(); initAGU2(); for (i=7; i<70; i++) for (i=7; i<70; i++) ... { { for (i=7; i<70; i++) B[i]=A[i-7]+B[i-7]; B[i%7]=A[(i-7)%7] { A[i]=i; SW Opt. +B[(i-7)%7]; x=readAGU(); C+=B[i]; (DTSE) A[i%7]=i; ix=readAGU2(); } C+=B[i%7]; B[x]=A[ix]+B[ix]; ... } AGUs A[x]=i; ... C+=B[x]; } ... 19
  • 20. Introduction ➢ Design ➢ Work Flow ➢ Results ➢ Conclusions
  • 21. Test environment  ➢ NIOS II soft­core processor (Altera) ● 32 bits RISC processor ● Harvard memory architecture ● Data/Instructions cache  ● 256 Custom Instructions (Fast data­path interface) ➢ Cyclone II EP2C35 Altera FPGA 21
  • 22. Test Applications ➢ Cavity Detector Medical imaging application to detect cavities on tomography scans ➢ Quad­tree Structured Difference Pulse Code Modulation  (QSDPCM) An inter­frame compression technique for video imaging. 22
  • 23. Speedup Speedup ( Cavity ) Speedup ( QSDPCM ) 1.4 1.4 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 DTSE Init AGU inclusion HW AGU inclusion DTSE Init AGU inclusion HW AGU inclusion Speedup: 1.26 Speedup: 1.19 23
  • 24. Energy improvements  Energy ( Cavity ) Energy ( QSDPCM ) 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 DTSE Init AGU inclusion HW AGU inclusion DTSE Init AGU inclusion HW AGU inclusion Energy reduction: 27% Energy reduction: 21% 24
  • 25. Area penalties Cavity (LEs) QSPCM (LEs) NIOS-F 2644 2644 NIOS-F +AGU 3596 3592 The AGU inclusion in the NIOS II architecture use 2.9% of total FPGA resources (33216LEs) 25
  • 26. Introduction ➢ Design ➢ Work Flow ➢ Results ➢ Conclusions
  • 27. Conclusions ➢ Extend an ASIP by AGUs is an efficient way to meet the  performance/energy requirements of multimedia applications  after some SW optimizations ➢ The innovation of connecting the AGU in the processor data­ path and working in parallel with the main processor allow  calculate a wide range of values before the processor needs them ➢ Use an AGU skeleton and a wizard decrease the design and  implementation time. 27
  • 28. Future Work ➢ Improve the AGU wizard in order to: ● Detect automatically AEs  and show relevant informations  about each AE for a given C file. ● Generate the appropriate AGU for a specific set of AEs ● Generate AGUs for more than one ASIP ➢ Extend the set of applications have been used in this work 28