Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Towards Automated Design Space Exploration
and
Code Generation
for FPGAs
using Type Transformations
(Syed) Waqar Nabi
Wim ...
Our aim is…
 Make high-performance computing on heterogeneous
platforms more accessible for scientists
 A compiler that ...
3
4
So, FPGAs…
Why all the fuss?
 Have a lot of promise
 For the “right kind” of application, power efficiency improvements ...
FPGAs – The (Programming) Challenge
 Design-productivity gap
 Programming requires architectural insight and a hardware
...
Yet, FPGAs – Why they can still win
 Low power consumption
 Massive fine-grained parallelism
 Huge internal memory (TB/...
Programming Options for FPGAs
 Use HDLs like Verilog or VHDL
 Very low-level (the assembly language of FPGAs)
 Use a fa...
TheTyTraProject
9
The flow we will discuss today…
10
Our bottom-up approach to
raising programming
abstraction for FPGAs has
gotten us this ...
THE DESIGN SPACE AND THE
ESTIMATION SPACE
getting our heads round the problem
11
The cunning plan…
12
The cunning plan…
 Create an abstraction for the design-space
 Create a light-weight cost-model that maps a point in the...
And you may very well ask…
14
Well, the jury is still out on this one…
The design space
15
The estimation space
16
What we need…
 …is a way to:
 Generate program-variants that correspond to different design
configurations on the FPGA
...
THE TYTRA HIGH-LEVEL
LANGUAGE
ENTER:
18
The TyTra-HLL with Higher-Order
Functions map/reduce
 Similar in syntax and semantics to C++
 Differences:
 Extensive u...
Vector template type
 To describe computations on finite-size ordered sets of
data
 vector < float, (ip+3)*(jp+3)*(kp+2)...
Higher order functions
 map
 vector <T2,SZ> map( T2 Function(T1), vector <T1,SZ>);
 reduce
 T2 reduce ( T2 Function(T2...
Exemplar: Successive Over-Relaxation
22
Example – The Successive Over-Relaxation
(SOR) Kernel
23
 Prepare Vectors
 pps = prepare_vectors (p,rhs,cn1, cn2l, ...);...
 1-D vector ( single execution thread)
 vector <T, (im*jm*km) > pps;
 Transformation to 2-D vector ( two concurrent e...
Uses of HLL
 Program variants that can map to different
architectural configurations on the FPGA
 Correct-by-constructio...
THE TYTRA INTERMEDIATE
LANGUAGE
26No, this is not a toy…
The platform
model for TyTra
(FPGA)
27
Why do we need a new IR language?
What are its requirements?
 Expressiveness to explore the design-space
 Convenient tar...
The TyTra-IR
 Strongly and statically typed
 All computations expressed as SSA
 Single-Static Assignments
 Largely (an...
Two components of the TyTra-IR
 Manage-IR
 Deals with
 memory objects (arrays)
 streams (loops over
arrays)
 stream-w...
The Manage-IR; Memory Objects
31
TyTra-IR OpenCL view LLVM-SPIR View Hardware (FPGA)
Cmem
Constant Memory Constant Memory ...
The Manage-IR; Stream Objects
32
 Can have a 1-1 or Many-1 relation with memory
objects
 Have a 1-1 relation with argume...
The Manage-IR; repeat blocks
 Repeatedly call a kernel without referring back to the
host (outer-loop)
 May involve bloc...
The Manage-IR; stream windows
 Access offsets in streams
 Use on-chip buffers for storing data read from memory
34
The Compute-IR
 Structural semantics
 @function_name (…args…) par
 @function_name (…args…) seq
 @function_name (…args…...
Example: Simple Vector Operation
The Kernel
Version 1 – Single Pipeline (C2)
Core_Compute
add
Wrmul add
add
Rd
lmem
a
lmem
b
lmem
c
lmem
y
StreamControl
StreamControl
Version 1 – Single Pipeline (C2)
Core_Compute
add
Wrmul add
Version 1 – Single Pipeline
add
Rd
lmem
a
lmem
b
lmem
c
lmem
y
StreamControl
StreamControl
The ...
Version 2 – 4 Parallel Pipelines (C1)
Core_Compute
add
Wrmul add
Version 2 – 4 Parallel Pipelines
add
Rd
lmem
y
StreamControl
Core_Compute
add
Wrmul add
add
Rd
...
Core_Compute
add
Wrmul add
Version 2 – 4 Parallel Pipelines
add
Rd
lmem
y
StreamControl
Core_Compute
add
Wrmul add
add
Rd
...
Version 3 – Scalar Instruction Processor
(C4)
Core_Compute
Version 3 – Scalar Instruction Processor
(C4)
lmem
a
lmem
b
lmem
c
lmem
y
StreamControl
StreamControl
PE
(Ins...
Core_Compute
Version 3 – Single Sequential Processor
lmem
a
lmem
b
lmem
c
lmem
y
StreamControl
StreamControl
Generic PE
{
...
Version 4 – Multiple Processors / Vectorization
(C5)
Core_Compute
Version 4 – Multiple Processors / Vectorization
(C5)
Generic PE
{ add
add
mul
add }
ALU
lmem
a
lmem
b
lmem
c
...
Core_Compute
Version 4 – Multiple Sequential
Processors (Vectorization)
Generic PE
{ add
add
mul
add }
ALU
lmem
a
lmem
b
l...
ESTIMATION AND CODE-
GENERATION
49
Estimation
Flow
50
Resource Estimate
 Device-specific experiments
 Look-up tables or simple formulas, along with
(mostly linear) interpolat...
Resource Estimate Example –
Integer Division
 Non-linear interpolation
52
y = 0.9978x2 + 3.683x - 10.571
R² = 1
0
500
100...
Performance (Throughput) Estimate
 EWGT: Effective Workgroup Throughput
 A coarser performance parameter than OPS/FLOPS
...
Performance (Throughput) Estimate
54
Performance (Throughput) Estimate
55
This is a generic formula for the entire design space (C0).
By setting appropriate pa...
Why it works - (till now at least!)
 TyTra-IR is sufficiently low-level to expose the
parameters needed for the EWGT calc...
The TyTra Back-end Compiler
(TyBEC)
 On-going work
 Uses Perl’s Parse::RecDescent to parse the IR
 Preliminary EWGT and...
Results
58
Estimated (E) vs actual (A) cost and throughput for C2 and C1
configurations of a very simple dummy kernel
Results
59
Estimated (E) vs actual (A) cost and throughput for C2 and C1
configurations of a more realistic Successive Ove...
WHERE DO WE GO FROM
HERE…
60
Quite a few avenues…
 Experiment with more kernels, their program-variants, estimated vs actual costs,
(correct) code-gen...
Quite a few avenues…
 Experiment with more kernels, their program-variants, estimated vs actual costs,
(correct) code-gen...
The woods are lovely, dark and deep,
But I have promises to keep,
And lines to code before I sleep,
And lines to code befo...
Arigato!
Waqar
Room 405, SAWB
Ext 2074
Syed.Nabi@glasgow.ac.uk
Upcoming SlideShare
Loading in …5
×

Towards Automated Design Space Exploration and Code Generation for FPGAs using Type Transformations

134 views

Published on

The increasing use of diverse architectures resulting in heterogeneous platforms for High-Performance Computing (HPC) presents a significant programming challenge. The resultant design productivity gap is a bottleneck to achieving the maximum possible performance. Our current work aims to address this design productivity gap specifically for FPGAs, where it is a major obstacle to their wider adoption in HPC.

We will present the TyTra design flow, which is being developed in the context of our larger project that aims to create a turn-key compiler for heterogeneous target platforms.

We will discuss an evolving custom high-level language, the TyTra language, that facilitates generation of different correct-by-construction program variants through type- transformations.

We will then talk about the custom target intermediate language for the high-level TyTra language, the Tytra-IR, which is similar to LLVM, but is extended to include explicit parallelization semantics that enable it to describe the different configurations associated with each program variant. It also allows direct association of each of them with an accurate estimate of cost and performance. We will briefly discuss this cost model and our on-going work with an estimator and code generator for FPGAs.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Towards Automated Design Space Exploration and Code Generation for FPGAs using Type Transformations

  1. 1. Towards Automated Design Space Exploration and Code Generation for FPGAs using Type Transformations (Syed) Waqar Nabi Wim Vanderbauwhede ENDS talk, 01 Apr 2015 www.tytra.org.uk
  2. 2. Our aim is…  Make high-performance computing on heterogeneous platforms more accessible for scientists  A compiler that can target heterogeneous platforms  Ideally, starting from legacy (mostly Fortran) code  Realistically, a custom HLL (discussing today) or Single Assignment C (SAC)  Focus is on FPGAs  Match architecture to algorithm  Can provide the best (FL)OPS/W  Not exactly programmer-friendly! 2
  3. 3. 3
  4. 4. 4
  5. 5. So, FPGAs… Why all the fuss?  Have a lot of promise  For the “right kind” of application, power efficiency improvements of 50x over GPUs is possible (and has been reported)  Reconfigurable architecture, match architecture to algorithm, balance between custom chip and microprocessors  Problem: The promise has been there for a long time now! Why aren’t they mainstream already…  GPUs have arrived, and provide a combination of programmability and power-efficiency that can be tough to beat.  But NOT impossible 5
  6. 6. FPGAs – The (Programming) Challenge  Design-productivity gap  Programming requires architectural insight and a hardware perspective  Not familiar to the typical software engineer  (Typically) lots of coding for very little work  E.g. 358 LOC in C  9686 LOC in Verilog  Lack of standards, templates  Device-specific specializations  implementations  Fine-grained configurability  It is the good, the bad, and the ugly  Complex design space  Many possible solutions for implementing one algorithm on an FPGA  Require a different programming paradigm  FPGAs perform best with deep, custom, optimized pipelines 6
  7. 7. Yet, FPGAs – Why they can still win  Low power consumption  Massive fine-grained parallelism  Huge internal memory (TB/s)  High power-efficiency (GFLOPS /W) 7
  8. 8. Programming Options for FPGAs  Use HDLs like Verilog or VHDL  Very low-level (the assembly language of FPGAs)  Use a familiar/standard parallel programming language like OpenCL  Both Altera and Xilinx now support it  Use a custom high-level language, friendly to software developers  TyTra Current Focus  e.g. MaxJ  Compile from legacy code directly to FPGA  TyTra Future Focus  e.g. Leg-UP 8
  9. 9. TheTyTraProject 9
  10. 10. The flow we will discuss today… 10 Our bottom-up approach to raising programming abstraction for FPGAs has gotten us this far – more or less. But we want to automate this. And, start from legacy code?
  11. 11. THE DESIGN SPACE AND THE ESTIMATION SPACE getting our heads round the problem 11
  12. 12. The cunning plan… 12
  13. 13. The cunning plan…  Create an abstraction for the design-space  Create a light-weight cost-model that maps a point in the design- space to the estimation-space.  Create an IR that:  Is able to capture the design-space  Allows a light-weight cost-model to be built around it  Make it a convenient target for front-end compiler  And, while we mull over a route from legacy code (or SAC) to IR  Design a toy HLL that provides a route from HLL  IR  HDL for proof of concept 13
  14. 14. And you may very well ask… 14 Well, the jury is still out on this one…
  15. 15. The design space 15
  16. 16. The estimation space 16
  17. 17. What we need…  …is a way to:  Generate program-variants that correspond to different design configurations on the FPGA  Find a way to cost them so that an automated flow can choose the best configuration  Have a route to automated code-generation of the chosen configuration  Enter:  TyTra HLL  TyTra IR 17
  18. 18. THE TYTRA HIGH-LEVEL LANGUAGE ENTER: 18
  19. 19. The TyTra-HLL with Higher-Order Functions map/reduce  Similar in syntax and semantics to C++  Differences:  Extensive use of higher-order functions  Use of dependent data types  size of data expressed explicitly in the type 19
  20. 20. Vector template type  To describe computations on finite-size ordered sets of data  vector < float, (ip+3)*(jp+3)*(kp+2) > p1D;  Multi-dimensional vectors are obtained through nesting:  vector < vector < float,ip+3>,(jp+3)*(kp+2) > p2D; 20
  21. 21. Higher order functions  map  vector <T2,SZ> map( T2 Function(T1), vector <T1,SZ>);  reduce  T2 reduce ( T2 Function(T2,T1), T2, vector <T1,SZ>);  Generate program variants via type-transformations 21
  22. 22. Exemplar: Successive Over-Relaxation 22
  23. 23. Example – The Successive Over-Relaxation (SOR) Kernel 23  Prepare Vectors  pps = prepare_vectors (p,rhs,cn1, cn2l, ...);  Given the prepared vector of tuples, the actual SOR computation becomes  ps = map( p_sor, pps );  Where p_sor computes the new value for pressure: float p_sor( ... ) { float reltmp = omega*(cn1*( cn2l_x *p_i_p1 + cn2s_x * p_i_m1 + cn3l_x *p_j_p1 + cn3s_x * p_j_m1 + cn4l_x *p_k_p1 + cn4s_x * p_k_m1 )-rhs_c) - p_c; return (p_c +reltmp); }
  24. 24.  1-D vector ( single execution thread)  vector <T, (im*jm*km) > pps;  Transformation to 2-D vector ( two concurrent execution threads)  vector < vector <T, (im*jm) >, km> ppst;  The corresponding program  ps = map( p_sor, pps );  Becomes  pst = map( map (p_sor), ppst );  where  ppst = reshapeTo (km, pps); 24 The Successive Over-Relaxation (SOR) Kernel – Type Transformations
  25. 25. Uses of HLL  Program variants that can map to different architectural configurations on the FPGA  Correct-by-construction transformations  High-level design entry point 25
  26. 26. THE TYTRA INTERMEDIATE LANGUAGE 26No, this is not a toy…
  27. 27. The platform model for TyTra (FPGA) 27
  28. 28. Why do we need a new IR language? What are its requirements?  Expressiveness to explore the design-space  Convenient target for a front-end compiler  Address the entire communication hierarchy  Custom number representations  Low-abstraction enough to have straightforward translation to HDL  A light-weight cost-model associated with it 28
  29. 29. The TyTra-IR  Strongly and statically typed  All computations expressed as SSA  Single-Static Assignments  Largely (and deliberately) based on the LLVM-IR  Use of metadata to create a more semantically loaded program 29
  30. 30. Two components of the TyTra-IR  Manage-IR  Deals with  memory objects (arrays)  streams (loops over arrays)  stream-windows  repeated calls to kernel (outside loops)  block-memory transfers 30  Compute-IR  Primarily follows a data- flow model  Only deals with stream abstractions  We can, if we want, hide instruction-processors inside PEs that still have streaming port abstractions
  31. 31. The Manage-IR; Memory Objects 31 TyTra-IR OpenCL view LLVM-SPIR View Hardware (FPGA) Cmem Constant Memory Constant Memory 3: Constant Imem Instruction Memory Constant Memory DistRAM / BRAM Pipemem Pipeline registers DistRAM Pmem Private Memory (Data Mem for Instruc’ Proc’) Private Memory 0: Private DistRAM Cachemem Data (and Constant) Cache DistRAM / BRAM Lmem Local (shared) memory Local Memory 4: Local M20K (BRAM) or Dist RAM Gmem Global memory Global Memory 1: Global On-board DRAM Hmem Host memory Host Memory Host communication
  32. 32. The Manage-IR; Stream Objects 32  Can have a 1-1 or Many-1 relation with memory objects  Have a 1-1 relation with arguments to pipe functions (i.e. port connections to compute-cores)
  33. 33. The Manage-IR; repeat blocks  Repeatedly call a kernel without referring back to the host (outer-loop)  May involve block memory transfers between iterations 33
  34. 34. The Manage-IR; stream windows  Access offsets in streams  Use on-chip buffers for storing data read from memory 34
  35. 35. The Compute-IR  Structural semantics  @function_name (…args…) par  @function_name (…args…) seq  @function_name (…args…) pipe  @function_name (…args…) comb  Nesting these functions gives us the expressiveness to explore various parallelism configurations  Streaming ports  Counters and nested counters  SSA data-path instructions 35
  36. 36. Example: Simple Vector Operation The Kernel
  37. 37. Version 1 – Single Pipeline (C2)
  38. 38. Core_Compute add Wrmul add add Rd lmem a lmem b lmem c lmem y StreamControl StreamControl Version 1 – Single Pipeline (C2)
  39. 39. Core_Compute add Wrmul add Version 1 – Single Pipeline add Rd lmem a lmem b lmem c lmem y StreamControl StreamControl The parser can also automatically find ILP and schedule in an ASAP fashion
  40. 40. Version 2 – 4 Parallel Pipelines (C1)
  41. 41. Core_Compute add Wrmul add Version 2 – 4 Parallel Pipelines add Rd lmem y StreamControl Core_Compute add Wrmul add add Rd Core_Compute add Wrmul add add Rd Core_Compute add Wrmul add add Rd lmem a lmem b lmem c StreamControl
  42. 42. Core_Compute add Wrmul add Version 2 – 4 Parallel Pipelines add Rd lmem y StreamControl Core_Compute add Wrmul add add Rd Core_Compute add Wrmul add add Rd Core_Compute add Wrmul add add Rd lmem a lmem b lmem c StreamControl
  43. 43. Version 3 – Scalar Instruction Processor (C4)
  44. 44. Core_Compute Version 3 – Scalar Instruction Processor (C4) lmem a lmem b lmem c lmem y StreamControl StreamControl PE (Instruction Processor) { add add mul add } ALU The ALU would be customized for the instructions mapped to this PE at compile-time
  45. 45. Core_Compute Version 3 – Single Sequential Processor lmem a lmem b lmem c lmem y StreamControl StreamControl Generic PE { add add mul add } ALU
  46. 46. Version 4 – Multiple Processors / Vectorization (C5)
  47. 47. Core_Compute Version 4 – Multiple Processors / Vectorization (C5) Generic PE { add add mul add } ALU lmem a lmem b lmem c StreamControl lmem y StreamControl Core_Compute Generic PE { add add mul add } ALU Core_Compute Generic PE { add add mul add } ALU Core_Compute Generic PE { add add mul add } ALU
  48. 48. Core_Compute Version 4 – Multiple Sequential Processors (Vectorization) Generic PE { add add mul add } ALU lmem a lmem b lmem c StreamControl lmem y StreamControl Core_Compute Generic PE { add add mul add } ALU Core_Compute Generic PE { add add mul add } ALU Core_Compute Generic PE { add add mul add } ALU Note the continued use of stream abstractions even through the PEs are Instruction Processors now
  49. 49. ESTIMATION AND CODE- GENERATION 49
  50. 50. Estimation Flow 50
  51. 51. Resource Estimate  Device-specific experiments  Look-up tables or simple formulas, along with (mostly linear) interpolation  Regular structure of FPGA helps  Some formulas may be portable to other FPGA families, some may be very specific 51
  52. 52. Resource Estimate Example – Integer Division  Non-linear interpolation 52 y = 0.9978x2 + 3.683x - 10.571 R² = 1 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 0 10 20 30 40 50 60 70 Series1 Poly. (Series1) Input data points at 18, 32, and 64 bits Estimated ALUT usage for 24-bits = 654 Actual ALUT usage for 24 bits = 652
  53. 53. Performance (Throughput) Estimate  EWGT: Effective Workgroup Throughput  A coarser performance parameter than OPS/FLOPS  Should capture the entire design-space 53
  54. 54. Performance (Throughput) Estimate 54
  55. 55. Performance (Throughput) Estimate 55 This is a generic formula for the entire design space (C0). By setting appropriate parameters, configuration specific formula can be derived. E.g. For C2, limited to one pipeline lane, set NR = 1 TR = 0 NI = 1 DV = 1 L = 1
  56. 56. Why it works - (till now at least!)  TyTra-IR is sufficiently low-level to expose the parameters needed for the EWGT calculation  TyTra-IR has sufficient structural information to associate it directly with resources on an FPGA  But does depend on some upfront homework for every FGPA device  Because TyTra-IR is our language, we can ensure that:  All legal instructions (and constructions) have a cost-model associated with them  As long as the front-end compiler can target a HLL on the TyTra- IR, we can cost HL program variants 56
  57. 57. The TyTra Back-end Compiler (TyBEC)  On-going work  Uses Perl’s Parse::RecDescent to parse the IR  Preliminary EWGT and Resource Estimator operational  Preliminary code-generator almost done 57
  58. 58. Results 58 Estimated (E) vs actual (A) cost and throughput for C2 and C1 configurations of a very simple dummy kernel
  59. 59. Results 59 Estimated (E) vs actual (A) cost and throughput for C2 and C1 configurations of a more realistic Successive Over-Relaxation Kernel
  60. 60. WHERE DO WE GO FROM HERE… 60
  61. 61. Quite a few avenues…  Experiment with more kernels, their program-variants, estimated vs actual costs, (correct) code-generation. Use (CHStone) benchmarks.  Computation-aware caches, optimized for halo-based scientific computations  Integrate with Altera-OpenCL platform for host-device communication  Back-end optimizations, LLVM passes, LLVM  TyTra-IR translation  Route to TyTra-IR from SAC  Integrate Tytra-FPGA flow with SACGPU(OpenCL flow) for heterogeneous targets  Use of Multi-party Session Types to ensure correctness of transformations  Even code-generation for clusters?  Abstract descriptions of target hardware  SystemC-TLM model to profile application and high-level partitioning in a heterogeneous environment 61
  62. 62. Quite a few avenues…  Experiment with more kernels, their program-variants, estimated vs actual costs, (correct) code-generation. Use (CHStone) benchmarks.  Computation-aware caches, optimized for halo-based scientific computations  Integrate with Altera-OpenCL platform for host-device communication  Back-end optimizations, LLVM passes, LLVM  TyTra-IR translation  Route to TyTra-IR from SAC  Integrate Tytra-FPGA flow with SACGPU(OpenCL flow) for heterogeneous targets  Use of Multi-party Session Types to ensure correctness of transformations  Even code-generation for clusters?  Abstract descriptions of target hardware  SystemC-TLM model to profile application and high-level partitioning in a heterogeneous environment 62 etcetera, etcetera, etcetera
  63. 63. The woods are lovely, dark and deep, But I have promises to keep, And lines to code before I sleep, And lines to code before I sleep. 63
  64. 64. Arigato! Waqar Room 405, SAWB Ext 2074 Syed.Nabi@glasgow.ac.uk

×