A High-Level Programming Approach for using FPGAs in HPC using Functional Description, Vector Type-Transformations and Cost-Modelling
1. A HIGH-LEVEL PROGRAMMING APPROACH
FOR
USING FPGAS IN HPC
USING
FUNCTIONAL DESCRIPTION,
VECTOR TYPE-TRANSFORMATIONS AND
COST-MODELLING
S WAQAR NABI & WIM VANDERBAUWHEDE
www.tytra.org.uk
School of Informatics, University of Edinburgh,, 25 Feb 2016
2. Using Safe Transformations and a
Cost-model For HPC On FPGAs
• The TyTra project context
• Our approach, blue-sky target, down-to-earth target, where
we are now, how we are different
• Key contributions
• (1) Type transformations to create design-variants, (2) a
new Intermediate Language, and (3) an FPGA Cost model
• The cost model
• Performance and resource-usage estimates, some results
Using safe transformations and an associated light-weight cost-model opens the
route to a fully automated design-space exploration flow
3. THE CONTEXT
Our approach, blue-sky target, down-to-earth target, where we are now,
how we are different
5. Blue Sky Target
Cost Model
Legacy
Scientific Code
Heterogeneous
HPC Target
Description
Optimized HPC
solution!
The goal that keeps us motivated!
( The pragmatic target is somewhat more modest…)
6. 6
A performance portable code-base that builds on a purely software programming
paradigm.
The Cunning Plan…
7. The Cunning Plan…
1. Functional programming paradigm and
(auto) generate correct-by-constructions
program-variants through vector-
transformations
• which translate to design-variants on the
FPGA.
2. Create an Intermediate Language:
• captures design-space
• light-weight cost-model
• target for front-end compiler
3. Create a fast and accurate cost-model
that can estimate the performance and
resource-utilization for each variant.
7
A performance portable code-base that builds on a purely software programming
paradigm.
8. The Cunning Plan…
1. Functional programming paradigm and
(auto) generate correct-by-constructions
program-variants through vector-
transformations
• which translate to design-variants on the
FPGA.
2. Create an Intermediate Language:
• captures design-space
• light-weight cost-model
• target for front-end compiler
3. Create a fast and accurate cost-model
that can estimate the performance and
resource-utilization for each variant.
8
A performance portable code-base that builds on a purely software programming
paradigm.
9. And You May Very Well Ask…
9
The jury is still out…
10. Where We Are Now
Working with small but real scientific code
11. Where We Are Now Legacy Fortran
Scientific Code
Working with small but real scientific code
14. IR AND COST MODEL
(1) A custom Intermediate Language, and (2) a fast and accurate Cost
Model
15. Pre-requisite: Models
Of Abstraction
1. Platform model
2. Memory hierarchy model
3. Execution model
4. Design-space and cost-space model
5. Memory execution model
6. Data access pattern model
16. Pre-requisite: Models
Of Abstraction
1. Platform model
2. Memory hierarchy model
3. Execution model
4. Design-space model
5. Memory execution model
6. Data access pattern model
(More or less) based
on OpenCL standard
18. Pre-requisite: Models
Of Abstraction
1. Platform model
2. Memory hierarchy model
3. Execution model
4. Design-space model
5. Memory execution model
6. Data access pattern model
20. Pre-requisite: Models
Of Abstraction
1. Platform model
2. Memory hierarchy model
3. Execution model
4. Design-space model
5. Memory execution model
6. Data access pattern model
21. Performance Estimate
Dependence On Memory Execution Model
Time
Activity
Host
Device-DRAM
Device-DRAM
Device-Buffers
Device-Buffers
Offset-Buffers
Kernel Pipeline
Execution
22. Performance Estimate
Dependence On Memory Execution Model
Time
Activity
Host
Device-DRAM
Device-DRAM
Device-Buffers
Device-Buffers
Offset-Buffers
Kernel Pipeline
Execution
23. Performance Estimate
Dependence On Memory Execution Model
Time
Activity
Host
Device-DRAM
Device-DRAM
Device-Buffers
Device-Buffers
Offset-Buffers
Kernel Pipeline
Execution
Work-Instance Iterations
Form A
All iterations
24. Performance Estimate
Dependence On Memory Execution Model
Time
Activity
Host
Device-DRAM
Device-DRAM
Device-Buffers
Device-Buffers
Offset-Buffers
Kernel Pipeline
Execution
First Iteration
only
Last Iteration
only
Work-Instance Iterations
Form B
All other
iterations
25. Performance Estimate
Dependence On Memory Execution Model
Time
Activity
Host
Device-DRAM
Device-DRAM
Device-Buffers
Device-Buffers
Offset-Buffers
Kernel Pipeline
Execution
First Iteration
only
Last Iteration
only
Work-Instance Iterations
Form C
All other
iterations
Once a design-variant is categorized, performance can be estimated accordingly
26. Pre-requisite: Models
Of Abstraction
1. Platform model
2. Memory hierarchy model
3. Execution model
4. Design-space model
5. Memory execution model
6. Data access pattern model
27. Pre-requisite: Models
Of Abstraction
1. Platform model
2. Memory hierarchy model
3. Execution model
4. Design-space model
5. Memory execution model
6. Data access pattern model
1. Contiguous access
2. (Fixed) Strided access
28. The Back-end
Approach
• Use (or design) an IR that can capture all these models
• We ended up using LLVM and modifying it to fit our
purpose, effectively creating a custom IR we call the
“TyTra-IR”.
• Develop a cost-model that can evaluate the variants
expressed in the IR
30. The Tytra IR
• Strongly and statically typed - Largely based on the LLVM-IR
• All computations expressed as SSA (Single-Static
Assignments)
• Keywords pipe, par, seq and comb to indicate type of
parallelism, and nested functions of these types used to
build architectural configurations
Manage-IR
• Memory objects
• Streams
• Offset streams
Compute-IR
• Streaming datapath model
• SSA instructions
34. The Cost-model Use-case
34
A set of standardized experiments feed target-specific empirical data to the cost
model, and the rest comes from the IR descripition.
35. Resource Estimates - Example
35
Integer Division
Integer Multiplication
Light-weight cost expressions associated with every legal SSA instruction in the
TyTra-IR
37. Performance Estimate
Effective Work-Instance Throughput (EWIT)
o Work-Instance = Executing the kernel over the entire index-space
Key Determinants
o Memory execution model
o Sustained memory bandwidth for the target architecture and design-
variant
• Data-access pattern
o Design configuration of the FPGA
o Operating frequency of the FPGA
o Compute-bound or IO-bound?
37
Performance model is trickier, especially calculating estimates of sustained
memory bandwidth.
39. Performance Estimate
Effective Work-Instance Throughput (EWIT)
o Work-Instance = Executing the kernel over the entire index-space
Key Determinants
o Memory execution model
o Sustained memory bandwidth for the target architecture and design-
variant
• Data-access pattern
o Design configuration of the FPGA
o Operating frequency of the FPGA
o Compute-bound or IO-bound?
39
Performance model is trickier, especially calculating estimates of sustained
memory bandwidth.
41. Performance Estimate
Effective Work-Instance Throughput (EWIT)
o Work-Instance = Executing the kernel over the entire index-space
Key Determinants
o Memory execution model
o Sustained memory bandwidth for the target architecture and
design-variant
• Data-access pattern
o Design configuration of the FPGA
o Operating frequency of the FPGA
o Compute-bound or IO-bound?
55. The Route To Automated Design Space
Exploration On FPGAs For HPC
Applications
The larger aim is to create a turn-key compiler for:
Legacy scientific code Heterogeneous HPC Platform
o Current focus is on FPGAs, and on using a Functional
Language design entry
Our main contributions are:
o Type transformations to create design-variants,
o New Intermediate Language, and
o FPGA Cost model
Our FPGA Cost Model
o Works on the TyTra-IR, is light-weight, accurate (enough), and
allows us to evaluate design-variants
Using safe transformations on a functional language paradigm and a light-weight
cost-model to brings us closer to a turn-key HPC compiler for legacy code
56. The woods are lovely, dark and deep,
But I havepromises to keep,
And lines to code before I sleep,
And lines to code before I sleep.
56
Acknowledgement
We wish to acknowledge support
by EPSRC through grant EP/L00058X/1.
The woods are lovely, dark and deep,
But I havepromises to keep,
And lines to code before I sleep,
And lines to code before I sleep.