Design of Programmable Accelerators for SoCs Gert Goossens CEO Target Compiler Technologies
Abstract <ul><li>For new wireless standards like 3GPP-LTE, general-purpose processors are getting out of steam. Wisdom is ...
Agenda <ul><li>ASIPs as accelerators in SoCs </li></ul><ul><li>How to design ASIPs </li></ul><ul><li>Programmable datapath...
<ul><li>What do you do when the performance  of your main processor is insufficient? </li></ul><ul><ul><li>Go multicore? <...
<ul><li>What do you do when the performance  of your main processor is insufficient? </li></ul><ul><ul><li>ASIPs: applicat...
Agenda <ul><li>ASIPs as accelerators in SoCs </li></ul><ul><li>How to design ASIPs </li></ul><ul><li>Programmable datapath...
How to Design ASIPs? <ul><li>IP Designer tool-suite </li></ul>
How to Design ASIPs? Design step Benefits <ul><li>Algorithm defined in C </li></ul><ul><li>Raise abstraction level from RT...
How to Design ASIPs? <ul><li>Benefits </li></ul><ul><ul><li>Speed-up design Few weeks per ASIP </li></ul></ul><ul><ul><li>...
Tool Comparison <ul><li>Programmable </li></ul><ul><li>Architectural specialisation </li></ul><ul><li>Resource sharing </l...
Agenda <ul><li>ASIPs as accelerators in SoCs </li></ul><ul><li>How to design ASIPs </li></ul><ul><li>Programmable datapath...
Programmable Datapath Examples  Examples shown  Served by IP Designer
What is a Programmable Datapath? <ul><li>Hardwired datapath </li></ul><ul><ul><li>Datapath structure (hardware operators a...
Prog. Datapath Example: WLAN <ul><li>Algorithm </li></ul><ul><ul><li>Design by Motorola Labs  [1] </li></ul></ul><ul><ul><...
<ul><li>Programmable datapath design </li></ul><ul><ul><li>Sample expressions: equalisation matrix </li></ul></ul><ul><ul>...
Prog. Datapath Example: WLAN <ul><li>nML code of gmac instruction </li></ul>reg  R[8] <vcmpl>  read(tR0, tR1,  tR2, tR3, t...
Prog. Datapath Example: WLAN <ul><li>C compiler uses advanced graph matching techniques to map dataflow patterns on progra...
Prog. Datapath Example: FFT <ul><li>Algorithm </li></ul><ul><ul><li>Decimation in time </li></ul></ul><ul><ul><li>Radix-2,...
Prog. Datapath Example: FFT <ul><li>Programmable datapath design </li></ul><ul><ul><li>Datapath structure for CMPY and BFL...
Prog. Datapath Example: FFT <ul><li>Instruction-level parallelism: ILP=5 </li></ul><ul><ul><li>Efficient register allocati...
Prog. Datapath Example: FFT <ul><li>C compiler uses advanced graph search techniques to </li></ul><ul><ul><li>optimise reg...
Prog. Datapath Example: FFT <ul><li>Results </li></ul><ul><ul><li>Performance </li></ul></ul><ul><ul><ul><li>Radix-4: 4 cy...
Agenda <ul><li>ASIPs as accelerators in SoCs </li></ul><ul><li>How to design ASIPs </li></ul><ul><li>Programmable datapath...
Conclusion <ul><li>ASIPs allow to make accelerators in SoCs programmable </li></ul><ul><li>With the IP Designer tool-suite...
Upcoming SlideShare
Loading in …5
×

Chip Ex2010 Gert Goossens

540 views

Published on

Design of programmable accelerators for multicore SoCs - Gert Goossens, Target Compiler Tech.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
540
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Chip Ex2010 Gert Goossens

  1. 1. Design of Programmable Accelerators for SoCs Gert Goossens CEO Target Compiler Technologies
  2. 2. Abstract <ul><li>For new wireless standards like 3GPP-LTE, general-purpose processors are getting out of steam. Wisdom is that accelerators must be added in the form of hardwired datapaths, to deliver the required performance. However, a hardwired datapath stands for zero flexibility, reducing the capability of supporting evolutionary or multiple standards. </li></ul><ul><li>We discuss how C-programmable application-specific processors (ASIPs) can replace fixed-function accelerators without sacrificing performance (throughput, power and gate count). We review different approaches for ASIP design. We illustrate our performance claims with examples from the data-plane of wireless baseband modems. </li></ul>
  3. 3. Agenda <ul><li>ASIPs as accelerators in SoCs </li></ul><ul><li>How to design ASIPs </li></ul><ul><li>Programmable datapath examples </li></ul><ul><ul><li>WLAN </li></ul></ul><ul><ul><li>FFT </li></ul></ul><ul><li>Conclusions </li></ul>
  4. 4. <ul><li>What do you do when the performance of your main processor is insufficient? </li></ul><ul><ul><li>Go multicore? </li></ul></ul><ul><ul><ul><li>Application mapping difficult, resource utilisation unbalanced </li></ul></ul></ul><ul><ul><li>Add hardwired accelerators? </li></ul></ul><ul><ul><ul><li>Balanced but inflexible SoC </li></ul></ul></ul>SoC Design
  5. 5. <ul><li>What do you do when the performance of your main processor is insufficient? </li></ul><ul><ul><li>ASIPs: application-specific processors </li></ul></ul><ul><ul><ul><li>Anything between general-purpose uP and hardwired datapath </li></ul></ul></ul><ul><ul><ul><li>Flexibility through programmability and design-time reconfigurability </li></ul></ul></ul><ul><ul><ul><li>High-throughput and low energy, through parallelism and specialisation </li></ul></ul></ul><ul><ul><ul><li>Balanced and flexible SoC </li></ul></ul></ul>SoC Design
  6. 6. Agenda <ul><li>ASIPs as accelerators in SoCs </li></ul><ul><li>How to design ASIPs </li></ul><ul><li>Programmable datapath examples </li></ul><ul><ul><li>WLAN </li></ul></ul><ul><ul><li>FFT </li></ul></ul><ul><li>Conclusions </li></ul>
  7. 7. How to Design ASIPs? <ul><li>IP Designer tool-suite </li></ul>
  8. 8. How to Design ASIPs? Design step Benefits <ul><li>Algorithm defined in C </li></ul><ul><li>Raise abstraction level from RTL to ESL </li></ul><ul><li>Connect hardware and algorithm design teams </li></ul><ul><li>Datapath structure defined in nML </li></ul><ul><li>Much faster than RTL design, enables rapid architectural exploration </li></ul><ul><li>Designer is in control; can use architectural knowledge </li></ul><ul><li>C compiler maps algorithm onto datapath structure </li></ul><ul><li>ISS simulates generated code </li></ul><ul><li>Tools validate designer’s assumptions and performance reached </li></ul><ul><li>Profiling tool guides architectural exploration </li></ul><ul><li>Easily reprogrammable in case of bug or spec changes </li></ul><ul><li>RTL generated automatically </li></ul><ul><li>Error-free </li></ul><ul><li>Quick feedback on gate count for every design iteration </li></ul><ul><li>Low-power optimisations inserted automatically </li></ul>
  9. 9. How to Design ASIPs? <ul><li>Benefits </li></ul><ul><ul><li>Speed-up design Few weeks per ASIP </li></ul></ul><ul><ul><li>Design exploration Wide architectural scope, based on processor description language </li></ul></ul><ul><ul><li>Formal approach increases  40 production chips, 0 bugs correctness </li></ul></ul><ul><ul><li>Automatic generation of RTL Competitive to hand-coded RTL </li></ul></ul><ul><ul><li>Automatic generation of SDK C compiler “no-assembly-required” </li></ul></ul>
  10. 10. Tool Comparison <ul><li>Programmable </li></ul><ul><li>Architectural specialisation </li></ul><ul><li>Resource sharing </li></ul><ul><li>Business model </li></ul>Architectural style Example vendors Approach Yes High Yes EDA license Flexible, using processor description language Target (IP Designer), CoWare (Processor Designer) Retargetable ASIP design tools Yes Low (within template boundaries) Yes Royalties Configurable ASIP template + extension instructions Tensilica, ARC, ASIP Solutions, SiliconHive Configurable ASIP templates No High Depends on tool EDA license Hardwired datapath, no programmability Mentor (CatapultC), Forte, Synfora, Cadence (C2S) High-level synthesis from C — (*) (*) No strong focus for CoWare?
  11. 11. Agenda <ul><li>ASIPs as accelerators in SoCs </li></ul><ul><li>How to design ASIPs </li></ul><ul><li>Programmable datapath examples </li></ul><ul><ul><li>WLAN </li></ul></ul><ul><ul><li>FFT </li></ul></ul><ul><li>Conclusions </li></ul>
  12. 12. Programmable Datapath Examples  Examples shown  Served by IP Designer
  13. 13. What is a Programmable Datapath? <ul><li>Hardwired datapath </li></ul><ul><ul><li>Datapath structure (hardware operators and connectivity) mimics the algorithm’s data flow </li></ul></ul><ul><li>Hardwired datapath with resource sharing </li></ul><ul><ul><li>Superposition of multiple data-flow patterns </li></ul></ul><ul><ul><li>Hardware saving benefit, if permitted by throughput spec </li></ul></ul><ul><ul><li>Requires local modifications to datapath structure and addition of small amounts of control </li></ul></ul><ul><ul><ul><li>Modification of connectivity  multiplexers </li></ul></ul></ul><ul><ul><ul><li>Modification of operator behaviour  programmable i.s.o. fixed operators </li></ul></ul></ul><ul><ul><ul><li>Store intermediate data  local register files i.s.o. registers </li></ul></ul></ul><ul><ul><li>Controlled from FSM </li></ul></ul><ul><li>Programmable datapath </li></ul><ul><ul><li>Datapath with resource sharing, controlled from software </li></ul></ul><ul><ul><li>Microcode in ROM (design-time programmable), or RAM/flash (post-silicon programmable) </li></ul></ul>SEQ PM DEC s 0 s 1 s 2 d+=(a+b)*c; g+=(e-f)*f;
  14. 14. Prog. Datapath Example: WLAN <ul><li>Algorithm </li></ul><ul><ul><li>Design by Motorola Labs [1] </li></ul></ul><ul><ul><li>802.11n, equalisation </li></ul></ul><ul><ul><li>Characteristics </li></ul></ul><ul><ul><ul><li>Matrix calculations </li></ul></ul></ul><ul><ul><ul><li>Specialised operators in complex domain: cmpy, conjugate, sqmod </li></ul></ul></ul><ul><ul><li>Equalisation matrix: multiple dataflow patterns depending on MIMO scheme </li></ul></ul><ul><ul><ul><li>SDM </li></ul></ul></ul><ul><ul><ul><li>Symmetric SDM + STBC </li></ul></ul></ul><ul><ul><ul><li>SDM + STBC </li></ul></ul></ul>Matrix inversion Matrix inversion + Address computations Address computations Complex conjugate Square modulus [1] Medea+ project “Uppermost”
  15. 15. <ul><li>Programmable datapath design </li></ul><ul><ul><li>Sample expressions: equalisation matrix </li></ul></ul><ul><ul><li>Sample expression: matrix inversion </li></ul></ul><ul><ul><li>4 identical datapaths in SIMD unit </li></ul></ul>Prog. Datapath Example: WLAN Dual Port Memory Common Program Control GMAC 0 Dual Port Memory GMAC 1 Dual Port Memory GMAC 2 Dual Port Memory GMAC 3 Channel Estimation ASIP GMAC
  16. 16. Prog. Datapath Example: WLAN <ul><li>nML code of gmac instruction </li></ul>reg R[8] <vcmpl> read(tR0, tR1, tR2, tR3, tR4, tR5); reg ACC <vcmpl>; pipe P0 <vcmpl>; pipe P1 <vcmpl>; trn tC0 <vcmpl>; trn tC1 <vcmpl>; trn tM0 <vcmpl>; trn tM1 <vcmpl>; enum gmac_op {mpy_mpy_mac, mac, sq_sq_mac, minv, ...}; opn gmac(g:gmac_op, r0:c3, r1:c3, r2:c3, r3:c3, r4:c3, r5:c3) { action { stage E1: switch (g) { case mpy_mpy_mac: tC0 = ccnj(tR2 = R[r2]); P0 = cmpy(tR1 = R[r1], tC0); tC1 = ccnj(tR3 = R[r3]); P1 = cmpy(tR4 = R[r4], tC1 ); case mac: P0 = tR0 = R[r0]; P1 = tR5 = R[r5]; case sq_sq_mac: P0 = cmpy(tR1 = R[r1], tR2 = R[r1]); P1 = cmpy(tR4 = R[r4], tR3 = R[r4]); case minv: P0 = tR0 = R[r0]; tM0 = cmpy(tR1 = R[r1], tR2 = R[r2]); tM1 = cmpy(tR4 = R[r4], tR3 = R[r3]); P1 = csub(tM0, tM1); case ... } stage E2: tM = cmpy(P0, P1); ACC = cadd(tM, ACC); } }  Resources Instruction-set  grammar
  17. 17. Prog. Datapath Example: WLAN <ul><li>C compiler uses advanced graph matching techniques to map dataflow patterns on programmable datapath </li></ul>COMPILATION ENGINE (PHASE COUPLING) Application C Machine code Elf / Dwarf Processor model nML ISG sub_AB sub_BA add_AB add_BA A B C <<_C AR_w CDFG + << nML FRONT-END C FRONT-END SOURCE-LEVEL TRANSF. CODE SELECTION REGISTER ALLOCATION SCHEDULING CODE EMISSION
  18. 18. Prog. Datapath Example: FFT <ul><li>Algorithm </li></ul><ul><ul><li>Decimation in time </li></ul></ul><ul><ul><li>Radix-2, radix-4, mixed radix </li></ul></ul><ul><ul><li>Coefficients: complex (16,16) </li></ul></ul><ul><ul><li>Data: complex (24,24) </li></ul></ul>
  19. 19. Prog. Datapath Example: FFT <ul><li>Programmable datapath design </li></ul><ul><ul><li>Datapath structure for CMPY and BFLY can be described in nML and exposed to C compiler </li></ul></ul><ul><ul><li>CMPY and BFLY each implement a single, fixed dataflow pattern, which can alternatively be hidden in intrinsic function </li></ul></ul><ul><ul><li>Intrinsic’s behaviour is modelled in C, automatically converted to RTL </li></ul></ul>Mdata Mcoef A[4] B[4] CMPY BFLY ld A/B Ld C stA/B * * * * - + + + - -
  20. 20. Prog. Datapath Example: FFT <ul><li>Instruction-level parallelism: ILP=5 </li></ul><ul><ul><li>Efficient register allocation, scheduling and SW pipelining needed </li></ul></ul><ul><ul><li>E.g. inner-loop for radix-4 FFT </li></ul></ul><ul><ul><li>Compiled code </li></ul></ul><ul><ul><ul><li>4 cycles / iteration </li></ul></ul></ul><ul><ul><ul><li>100% resource utilisation </li></ul></ul></ul>/* 0 */ DO cnt,LE /* 1 */ /* delay slot */ /* 2 */ md=*pa(next_bfly) | *pb(+s)=b1 | mc=*pr(next_bfly_rdx4) | a2=md*mc | b3,b2=bfly(a2,a3) /* 3 */ md=*pa(+s) | *pb(+s)=b3 | mc=*pr(+s) | a3=md*mc | b1,a2=bfly(a1,a2) /* 4 */ md=*pa(+s) | *pb(+s)=b0 | mc=*pr(+s) | a1=md*mc | b0,a3=bfly(a0,a3) /* 5 */ md=*pa(+s) | *pb(next_bfly)=b2 | mc=*pr(+s) |a0=md*mc | b1,b0=bfly(b1,b0) LDA LDC MPY LDA LDC MPY LDA LDC MPY LDA LDC MPY BFLY BFLY BFLY BFLY STB STB STB STB LDA STB LDC MPY BFLY
  21. 21. Prog. Datapath Example: FFT <ul><li>C compiler uses advanced graph search techniques to </li></ul><ul><ul><li>optimise register utilisation </li></ul></ul><ul><ul><li>schedule instructions </li></ul></ul><ul><li>on programmable datapath </li></ul>COMPILATION ENGINE (PHASE COUPLING) Application C Machine code Elf / Dwarf Processor model nML ISG sub_AB sub_BA add_AB add_BA A B C <<_C AR_w CDFG + << nML FRONT-END C FRONT-END SOURCE-LEVEL TRANSF. CODE SELECTION REGISTER ALLOCATION SCHEDULING CODE EMISSION
  22. 22. Prog. Datapath Example: FFT <ul><li>Results </li></ul><ul><ul><li>Performance </li></ul></ul><ul><ul><ul><li>Radix-4: 4 cycles/ butterfly, radix-2: 2 cycles/butterfly </li></ul></ul></ul><ul><ul><ul><li>4096-point FFT (radix-4): 24,671 cycles </li></ul></ul></ul><ul><ul><ul><li>2048-point FFT (2x 1024-pt radix-4 + 1x 2048-pt radix-2): 12,288 cycles </li></ul></ul></ul><ul><ul><li>RTL metrics </li></ul></ul><ul><ul><ul><li>26K gates, 123 MHz clock, 130 nm, DesignWare Basic </li></ul></ul></ul><ul><ul><li>600 lines of nML code </li></ul></ul><ul><ul><ul><li>Custom data path, complex butterfly unit </li></ul></ul></ul>
  23. 23. Agenda <ul><li>ASIPs as accelerators in SoCs </li></ul><ul><li>How to design ASIPs </li></ul><ul><li>Programmable datapath examples </li></ul><ul><ul><li>WLAN </li></ul></ul><ul><ul><li>FFT </li></ul></ul><ul><li>Conclusions </li></ul>
  24. 24. Conclusion <ul><li>ASIPs allow to make accelerators in SoCs programmable </li></ul><ul><li>With the IP Designer tool-suite, ASIPs can be designed quickly and programmed efficiently </li></ul><ul><li>“ Programmable datapath” ASIPs offer performance, area and power comparable to hardwired accelerators </li></ul><ul><ul><li>IP Designer as an alternative to high-level synthesis </li></ul></ul><ul><li>With ASIPs, multicore SoC architectures become even more prolific </li></ul>

×