Software Parallelisation and Platform   Generation for Heterogeneous      Multicore Architectures Sven Wuytack, Erik Brock...
Multi-Core SoCs for Smart Devices                                                        ARM          ARM                 ...
No MPSoC Design Without Tools Tools at IP level (ASIP cores)    Architectural exploration    SDK generation: C compiler...
No MPSoC Design Without Tools Tools at IP level (ASIP cores)    Architectural exploration    SDK generation: C compiler...
MP Designer Tool Suite                                                                 5               May 2, 2012   © 201...
Multicore Programming Language? We prefer sequential C    Fits human way of thinking    Abundantly used MP Designer pe...
State-of-the-Art              Architecture         Parallelisation         Dependency               Performance           ...
State-of-the-Art              Architecture         Parallelisation         Dependency               Performance           ...
State-of-the-Art              Architecture         Parallelisation         Dependency                Performance          ...
Platform Model Preferred                                                                         Single Tile             ...
Example: High-Res JPEG EncodingR G B Ctl                                               Zigzag                             ...
User-Guided Parallelisation     Label C code blocks for parallelisation                void main_encoder(struct image* im...
User-Guided Parallelisation Parallelisation pragmas   processor P0 type dlx   processor P1 type dlx   processor P2 type d...
FIFO Communication Modelint A[10];                         int* A;                                 // transformedint* pA =...
Exploration     For each parallelisation, MP Designer shows      task graph with estimated processor loads     E.g.: tas...
Exploration E.g.: task graph for JPEG encoding on  5-DLX architecture    Global dependency analysis automatically ensure...
Exploration E.g.: task graph for H263 encoding on 8 cores                            Global dependency analysis         ...
Exploration  JPEG encoding on multi-DLX architectureAlgorithm       #   Parallelisation              Mcycles*        Spee...
Exploration  JPEG encoding on multi-DLX architectureAlgorithm       #   Parallelisation              Mcycles*        Spee...
Exploration  JPEG encoding on multi-DLX architectureAlgorithm       #   Parallelisation              Mcycles*        Spee...
MP Designer - Highlights Homogeneous and heterogeneous multicore SoCs User-guided parallelization (pragmas) Global data...
Alternative Heterogeneous Implementation           (IP Designer)R G B Ctl                                                 ...
Alternative Heterogeneous Implementation      (IP Designer)R G B Ctl                                                      ...
Conclusions Multicore SoCs are here to stay No (efficient) multicore SoC design without  tools Design and programming o...
Upcoming SlideShare
Loading in …5
×

Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

866 views

Published on

Gert
Goossens, Target
Compiler Technologies

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
866
On SlideShare
0
From Embeds
0
Number of Embeds
113
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

  1. 1. Software Parallelisation and Platform Generation for Heterogeneous Multicore Architectures Sven Wuytack, Erik Brockmeyer, Wouter Vermaelen, Gert Goossens Target Compiler Technologies Leuven, Belgium gert.goossens@retarget.com 1 May 2, 2012 © 2012 Target Compiler Technologies NV
  2. 2. Multi-Core SoCs for Smart Devices ARM ARM core coreASIP: Application-Specific Processor Anything between general-purpose P and hardwired data-path Flexibility through programmability and design-time reconfigurability High throughput, low energy through parallelism and specializationASIP is foundation of heterogeneous multi-core SoC Balanced SoC architecture offers best performance at lowest energy and cost 2 May 2, 2012 © 2012 Target Compiler Technologies NV
  3. 3. No MPSoC Design Without Tools Tools at IP level (ASIP cores)  Architectural exploration  SDK generation: C compiler, ISS, debugger…  RTL generation → IP Designer* * Commercially deployed since 1999 3 May 2, 2012 © 2012 Target Compiler Technologies NV
  4. 4. No MPSoC Design Without Tools Tools at IP level (ASIP cores)  Architectural exploration  SDK generation: C compiler, ISS, debugger…  RTL generation → IP Designer* * Commercially deployed since 1999 Tools at IP subsystem level (multicore)  Code parallelisation  Communication and synchronization  Multicore platform generation → MP Designer ** ** Beta now; commercial release later in 2012 4 May 2, 2012 © 2012 Target Compiler Technologies NV
  5. 5. MP Designer Tool Suite 5 May 2, 2012 © 2012 Target Compiler Technologies NV
  6. 6. Multicore Programming Language? We prefer sequential C  Fits human way of thinking  Abundantly used MP Designer performs C source-to-source transformation  Formatting preserved as much as possible MP Designer guarantees dependencies in parallelised code  Validation only required on sequential C code  Reduced design complexity and validation effort 6 May 2, 2012 © 2012 Target Compiler Technologies NV
  7. 7. State-of-the-Art Architecture Parallelisation Dependency Performance analysis analysisMP Designer • Heterogeneous • User-guided • Static global • Processor loads(Target) (multi-ASIP) and • Pragmas analysis (data & derived from ISS, homogeneous separate from C pointers) shown in task • Point-to-point & • Correctness graphs globally shared guaranteed memory 7 May 2, 2012 © 2012 Target Compiler Technologies NV
  8. 8. State-of-the-Art Architecture Parallelisation Dependency Performance analysis analysisMP Designer • Heterogeneous • User-guided • Static global • Processor loads(Target) (multi-ASIP) and • Pragmas analysis (data & derived from ISS, homogeneous separate from C pointers) shown in task • Point-to-point & • Correctness graphs globally shared guaranteed memoryOpenMP • Homogeneous • User-guided • User’s • Globally shared • Pragmas in C responsability memory 8 May 2, 2012 © 2012 Target Compiler Technologies NV
  9. 9. State-of-the-Art Architecture Parallelisation Dependency Performance analysis analysisMP Designer • Heterogeneous • User-guided • Static global • Processor loads(Target) (multi-ASIP) and • Pragmas analysis (data & derived from ISS, homogeneous separate from C pointers) shown in task • Point-to-point & • Correctness graphs globally shared guaranteed memoryOpenMP • Homogeneous • User-guided • User’s • Globally shared • Pragmas in C responsibility memoryvfEmbedded • Homogeneous • User-guided • Mixed • Processor loads(Vector (multi-ARM or • Via code GUI static/dynamic derived fromFabrics) multi-x86) analysis execution, shown • Globally shared • Correctness in code GUI memory depends on data- set usedMPA • Heterogeneous • User-guided • Static global(IMEC) • Point-to-point • Pragmas analysis for arrays separate from C (“Clean-C”) 9 May 2, 2012 © 2012 Target Compiler Technologies NV
  10. 10. Platform Model Preferred Single Tile Local  Point-to-point links between memory DM Core communicating processors ITC  Source processor writes to destination processor’s local data- memory, avoiding local copies  Write conflicts resolved through local buffering  Address decoding logic Additionally supported Core DM Core DM  Globally shared memory  In case of caches, coherency is assumed to be resolved in hardware 10 May 2, 2012 © 2012 Target Compiler Technologies NV
  11. 11. Example: High-Res JPEG EncodingR G B Ctl Zigzag Reorder RGB Sub- Column Row Quan- Write 2YUV Sample DCT DCT tise VLC Output (4:2:0) Last Coef LD DCT Q VLC 11 May 2, 2012 © 2012 Target Compiler Technologies NV
  12. 12. User-Guided Parallelisation  Label C code blocks for parallelisation void main_encoder(struct image* img) {int main(int argc,char *argv[]){ void process_DU(SBYTE* CDU, int init_all(); vlc_init: { DCY=0;DCCb=0;DCCr=0; } blk) { parsection: { for (ypos=0..height) { jpg_open: { for (xpos=0....width) SWORD DU[64]; { for (blk=0..5) { jpg_fopen(JPG_filename); fdct_main: { writeword(0xFFD8);SBYTE DU[64]; //SOI int* int_fdtbl = write_APP0info(); loading: blk_fdtbl[blk]; } load_data_unit_from_RGB_buffer(img, fdct_and_quantization(CDU, main_encoder(&in_img); xpos, ypos, blk, DU); int_fdtbl,DU); process_DU(DU,blk); } jpg_close: { } } writeword(0xFFD9); //EOI vlc_main: { } jpg_fclose(); while (i<=end0pos) { ... } } vlc_fini: { // Bit-alignment of EOI marker if (end0pos!=63) } if (bytepos>=0) { writebits(EOB); writebits((1<<(bytepos+1))-1, bytepos+1); free(in_img.RGB_buffer); } return 0; } }} } } 12 May 2, 2012 © 2012 Target Compiler Technologies NV
  13. 13. User-Guided Parallelisation Parallelisation pragmas processor P0 type dlx processor P1 type dlx processor P2 type dlx  Sample parallelisation parallel ParRegion lbl main::parsection on 3-core task LOAD architecture target P0 include lbl main_encoder::loading task DCT target P1 include lbl process_DU::fdct_main task VLC target P2 include lbl main::jpg_open include lbl main_encoder::vlc_init include lbl process_DU::vlc_main include lbl main_encoder::vlc_fini include lbl main::jpg_close 13 May 2, 2012 © 2012 Target Compiler Technologies NV
  14. 14. FIFO Communication Modelint A[10]; int* A; // transformedint* pA = A; int* pA = A; // pA = NULL !int* qA = A; int* ptr;int* ptr; DEFINE_SRC_FIFO(FIFO_A, int, 10); // src fifo defint foo(int i) { DEFINE_SRC_FIFO(FIFO_ptr, int, 1); // src fifo def return qA[i]; parsection: {} producer: {parsection: { int offs_ptr; // declaration producer: { A = FIFO_A.acq_put(); // acq put A[i] = ...; pA = A; // ptr copy ptr = (...) ? &A[x] : &A[y]; A[i] = ...; ... = pA[j]; ptr = (...) ? &A[x] : &A[y]; } offs_ptr = ptr - A; // offset calc consumer: { FIFO_ptr.put(offs_ptr); // offset put ... = A[s]; ... = pA[j]; // uses pA, A ... = foo(*ptr); FIFO_A.rel_put(); // rel put } }} }  Acquire/release interface enables use of other processor’s data memory for storage of arrays (avoiding local copies)  FIFO communication implemented in comm. library produced by platform generator  Synchronisation implemented by polling on FIFO queue’s status (empty/full) 14 May 2, 2012 © 2012 Target Compiler Technologies NV
  15. 15. Exploration  For each parallelisation, MP Designer shows task graph with estimated processor loads  E.g.: task graph for JPEG Task 2 "V LC" Proc 2 "P2" (d lx) encoding on 3-DLX architecture main ::jpg_open: 0.0 % main _encoder::vlc_init: 0.0 % process_DU::vlc_main: 16.9 % main _encoder::vlc_fini: 0.0 % Task 1 "DC T" main ::jpg_close: 0.0 % Proc 1 "P1" (dlx) *TOTAL* 16.9 % process_DU::fd ct_main: 22.2 %Task 0 "LOAD" IN: process_DU::quant_main : 25.0 %Proc 0 "P0" (dlx) JPG_filename *TOTAL* 47.1 % SOF0info.heightmain_encoder::loading:35.8 % IN: SOF0info.widthIN: in_img.height in_img.height<none> in_img.width in_img.widthpar_region [b0] par_region [b0] par_region [b0] main_encoder() [b17] main _encoder() [b17] main_encoder() [b17] for [b18] for [b18] for [b18] for [b19] [NC dep T0 -> T1 @ b2 0] for [b19] for [b19] DU (64) [FF] [NC dep T1 -> T2 @ b 22] for [b20] for [b20] DU_ZZ (1 28) [FF] for [b20] process_DU() [b22] end0pos (4 ) process_DU() [b22]OUT:<none> OUT: OUT: <none> <none> 15 May 2, 2012 © 2012 Target Compiler Technologies NV
  16. 16. Exploration E.g.: task graph for JPEG encoding on 5-DLX architecture  Global dependency analysis automatically ensures correct communication & synchronisation  Manual design would be error-prone 16 May 2, 2012 © 2012 Target Compiler Technologies NV
  17. 17. Exploration E.g.: task graph for H263 encoding on 8 cores  Global dependency analysis automatically ensures correct communication & synchronisation  Manual design would be error- prone 17 May 2, 2012 © 2012 Target Compiler Technologies NV
  18. 18. Exploration  JPEG encoding on multi-DLX architectureAlgorithm # Parallelisation Mcycles* Speed Load (%) Effi- Cores seq par up P0 P1 P2 P3 P4 ciency (%)Original 1 7.1 - 1 100 100Original 2 ld+dct+q | vlc 7.1 4.1 1.7 100 76 86Original 3 ld | dct+q | vlc 7.1 3.4 2.1 64 60 91 69Original 4 ld | dct | q | vlc 7.1 3.4 2.1 77 40 24 91 52Optimised 2 ld+dct | q+vlc 4.1 2.4 1.5 100 72 85Optimised 3 ld | dct+q | vlc 4.1 2.0 2.0 74 100 40 68Optimised 3 ld | dct | q+vlc 4.1 1.7 2.4 86 56 100 78Optimised 4 ld | dct | q | vlc 4.1 1.5 2.8 100 65 75 54 69Split quant 3 ld | dct+q0 | q1+vlc 4.3 1.6 2.6 92 100 79 87Dual load 5 ld0 | ld1 | dct | q | vlc 4.1 1.1 3.7 71 61 95 99 71 74  Objective: obtain efficient load balancing * Cycles for 256x160-pixel image  Entire exploration in only days of time 18 May 2, 2012 © 2012 Target Compiler Technologies NV
  19. 19. Exploration  JPEG encoding on multi-DLX architectureAlgorithm # Parallelisation Mcycles* Speed Load (%) Effi- Cores seq par up P0 P1 P2 P3 P4 ciency (%)Original 1 7.1 - 1 100 100Original 2 ld+dct+q | vlc 7.1 4.1 1.7 100 76 86Original 3 ld | dct+q | vlc 7.1 3.4 2.1 64 60 91 69Original 4 ld | dct | q | vlc 7.1 3.4 2.1 77 40 24 91 52Optimised 2 ld+dct | q+vlc 4.1 2.4 1.5 100 72 85Optimised 3 ld | dct+q | vlc 4.1 2.0 2.0 74 100 40 68Optimised 3 ld | dct | q+vlc 4.1 1.7 2.4 86 56 100 78Optimised 4 ld | dct | q | vlc 4.1 1.5 2.8 100 65 75 54 69Split quant 3 ld | dct+q0 | q1+vlc 4.3 1.6 2.6 92 100 79 87Dual load 5 ld0 | ld1 | dct | q | vlc 4.1 1.1 3.7 71 61 95 99 71 74 * Cycles for 256x160-pixel image  3-core analysis suggests splitting of “q” 19 May 2, 2012 © 2012 Target Compiler Technologies NV
  20. 20. Exploration  JPEG encoding on multi-DLX architectureAlgorithm # Parallelisation Mcycles* Speed Load (%) Effi- Cores seq par up P0 P1 P2 P3 P4 ciency (%)Original 1 7.1 - 1 100 100Original 2 ld+dct+q | vlc 7.1 4.1 1.7 100 76 86Original 3 ld | dct+q | vlc 7.1 3.4 2.1 64 60 91 69Original 4 ld | dct | q | vlc 7.1 3.4 2.1 77 40 24 91 52Optimised 2 ld+dct | q+vlc 4.1 2.4 1.5 100 72 85Optimised 3 ld | dct+q | vlc 4.1 2.0 2.0 74 100 40 68Optimised 3 ld | dct | q+vlc 4.1 1.7 2.4 86 56 100 78Optimised 4 ld | dct | q | vlc 4.1 1.5 2.8 100 65 75 54 69Split quant 3 ld | dct+q0 | q1+vlc 4.3 1.6 2.6 92 100 79 87Dual load 5 ld0 | ld1 | dct | q | vlc 4.1 1.1 3.7 71 61 95 99 71 74 * Cycles for 256x160-pixel image  4-core analysis suggests splitting of “ld” 20 May 2, 2012 © 2012 Target Compiler Technologies NV
  21. 21. MP Designer - Highlights Homogeneous and heterogeneous multicore SoCs User-guided parallelization (pragmas) Global dataflow analysis to check correctness of chosen parallelization Validation only required on sequential C code Graphical feedback (task graphs) enables exploration C source-to-source transformation Software code for (FIFO) communication and synchronization inserted automatically Communication fabric (platform) generated automatically, if needed 21 May 2, 2012 © 2012 Target Compiler Technologies NV
  22. 22. Alternative Heterogeneous Implementation (IP Designer)R G B Ctl JEMA Zigzag JEMB Reorder RGB Sub- Column Row Quan- Write 2YUV Sample DCT DCT tise VLC Output (4:2:0) Last Coef R P R 8x16-bit S DM1 DM3 DM2 DM T Cs 4x21-bit 8x 8x 128-bit 16- 16- VA VB 8x8x16-bit C 16-bit 24-bit 8-bit 6x128-bit 6x128-bit h bit bit (data) (zz) (ht) addr addr addr (end0) (izz) 8x16-bit di Write shr32 ALU AGU AGU1 AGU2 ALU Bits shr16 VU1 VU2 outport  JEMA: vector ASIP  JEMB: scalar ASIP  2 vector units: vadd, vsub, vscale,  Multiple data memories, special AGUs vmul, vaddsub  ALU: bit concat, Huffman addr, code calc  Transposable register file – write  Instruction-level parallelism columns, read rows Source: G. Goossens et al, Multicore Expo 2009 22 May 2, 2012 © 2012 Target Compiler Technologies NV
  23. 23. Alternative Heterogeneous Implementation (IP Designer)R G B Ctl Zigzag Reorder RGB Sub- Column Row Quan- Write 2YUV Sample DCT DCT tise VLC Output (4:2:0) Last Coef Performance  Heterogeneous dual-ASIP (IP Designer): 1 cycle/pixel  Gate count: 76K  Homogeneous 5-core (MP Designer): 9 cycles/pixel  Data splitting (i.e. process loop iterations in parallel) will result in 1 cycle/pixel on ~45 cores  Consistent with literature, e.g. Plurality requires 64 cores*  Estimated gate count: 500K *Source: Plurality, MC Expo 2009 Tradeoff between flexibility and area/power 23 May 2, 2012 © 2012 Target Compiler Technologies NV
  24. 24. Conclusions Multicore SoCs are here to stay No (efficient) multicore SoC design without tools Design and programming of individual ASIP cores  Architectural exploration, SDK generation, RTL generation Multicore parallelisation and platform generation  Parallelisation from sequential C code  Static global dependency analysis  Exploration for efficient load balancing 24 May 2, 2012 © 2012 Target Compiler Technologies NV

×