SlideShare a Scribd company logo
A HIGH-LEVEL PROGRAMMING APPROACH
FOR
USING FPGAS IN HPC
USING
FUNCTIONAL DESCRIPTION,
VECTOR TYPE-TRANSFORMATIONS AND
COST-MODELLING
S WAQAR NABI & WIM VANDERBAUWHEDE
www.tytra.org.uk
School of Informatics, University of Edinburgh,, 25 Feb 2016
Using Safe Transformations and a
Cost-model For HPC On FPGAs
• The TyTra project context
• Our approach, blue-sky target, down-to-earth target, where
we are now, how we are different
• Key contributions
• (1) Type transformations to create design-variants, (2) a
new Intermediate Language, and (3) an FPGA Cost model
• The cost model
• Performance and resource-usage estimates, some results
Using safe transformations and an associated light-weight cost-model opens the
route to a fully automated design-space exploration flow
THE CONTEXT
Our approach, blue-sky target, down-to-earth target, where we are now,
how we are different
Blue Sky Target
Blue Sky Target
Cost Model
Legacy
Scientific Code
Heterogeneous
HPC Target
Description
Optimized HPC
solution!
The goal that keeps us motivated!
( The pragmatic target is somewhat more modest…)
6
A performance portable code-base that builds on a purely software programming
paradigm.
The Cunning Plan…
The Cunning Plan…
1. Functional programming paradigm and
(auto) generate correct-by-constructions
program-variants through vector-
transformations
• which translate to design-variants on the
FPGA.
2. Create an Intermediate Language:
• captures design-space
• light-weight cost-model
• target for front-end compiler
3. Create a fast and accurate cost-model
that can estimate the performance and
resource-utilization for each variant.
7
A performance portable code-base that builds on a purely software programming
paradigm.
The Cunning Plan…
1. Functional programming paradigm and
(auto) generate correct-by-constructions
program-variants through vector-
transformations
• which translate to design-variants on the
FPGA.
2. Create an Intermediate Language:
• captures design-space
• light-weight cost-model
• target for front-end compiler
3. Create a fast and accurate cost-model
that can estimate the performance and
resource-utilization for each variant.
8
A performance portable code-base that builds on a purely software programming
paradigm.
And You May Very Well Ask…
9
The jury is still out…
Where We Are Now
Working with small but real scientific code
Where We Are Now Legacy Fortran
Scientific Code
Working with small but real scientific code
VECTOR
TYPE TRANSFORMATIONS
Wim’s slides
IR AND COST MODEL
(1) A custom Intermediate Language, and (2) a fast and accurate Cost
Model
Pre-requisite: Models
Of Abstraction
1. Platform model
2. Memory hierarchy model
3. Execution model
4. Design-space and cost-space model
5. Memory execution model
6. Data access pattern model
Pre-requisite: Models
Of Abstraction
1. Platform model
2. Memory hierarchy model
3. Execution model
4. Design-space model
5. Memory execution model
6. Data access pattern model
(More or less) based
on OpenCL standard
Platform And Memory Model
Pre-requisite: Models
Of Abstraction
1. Platform model
2. Memory hierarchy model
3. Execution model
4. Design-space model
5. Memory execution model
6. Data access pattern model
Design Space
Pre-requisite: Models
Of Abstraction
1. Platform model
2. Memory hierarchy model
3. Execution model
4. Design-space model
5. Memory execution model
6. Data access pattern model
Performance Estimate
Dependence On Memory Execution Model
Time
Activity
Host

Device-DRAM
Device-DRAM

Device-Buffers
Device-Buffers

Offset-Buffers
Kernel Pipeline
Execution
Performance Estimate
Dependence On Memory Execution Model
Time
Activity
Host

Device-DRAM
Device-DRAM

Device-Buffers
Device-Buffers

Offset-Buffers
Kernel Pipeline
Execution
Performance Estimate
Dependence On Memory Execution Model
Time
Activity
Host

Device-DRAM
Device-DRAM

Device-Buffers
Device-Buffers

Offset-Buffers
Kernel Pipeline
Execution
Work-Instance Iterations
Form A
All iterations
Performance Estimate
Dependence On Memory Execution Model
Time
Activity
Host

Device-DRAM
Device-DRAM

Device-Buffers
Device-Buffers

Offset-Buffers
Kernel Pipeline
Execution
First Iteration
only
Last Iteration
only
Work-Instance Iterations
Form B
All other
iterations
Performance Estimate
Dependence On Memory Execution Model
Time
Activity
Host

Device-DRAM
Device-DRAM

Device-Buffers
Device-Buffers

Offset-Buffers
Kernel Pipeline
Execution
First Iteration
only
Last Iteration
only
Work-Instance Iterations
Form C
All other
iterations
Once a design-variant is categorized, performance can be estimated accordingly
Pre-requisite: Models
Of Abstraction
1. Platform model
2. Memory hierarchy model
3. Execution model
4. Design-space model
5. Memory execution model
6. Data access pattern model
Pre-requisite: Models
Of Abstraction
1. Platform model
2. Memory hierarchy model
3. Execution model
4. Design-space model
5. Memory execution model
6. Data access pattern model
1. Contiguous access
2. (Fixed) Strided access
The Back-end
Approach
• Use (or design) an IR that can capture all these models
• We ended up using LLVM and modifying it to fit our
purpose, effectively creating a custom IR we call the
“TyTra-IR”.
• Develop a cost-model that can evaluate the variants
expressed in the IR
The IR
The Tytra IR
• Strongly and statically typed - Largely based on the LLVM-IR
• All computations expressed as SSA (Single-Static
Assignments)
• Keywords pipe, par, seq and comb to indicate type of
parallelism, and nested functions of these types used to
build architectural configurations
Manage-IR
• Memory objects
• Streams
• Offset streams
Compute-IR
• Streaming datapath model
• SSA instructions
Tytra-IR Syntax
A Typical Tytra-IR
Configuration Tree
The Cost-model
The Cost-model Use-case
34
A set of standardized experiments feed target-specific empirical data to the cost
model, and the rest comes from the IR descripition.
Resource Estimates - Example
35
Integer Division
Integer Multiplication
Light-weight cost expressions associated with every legal SSA instruction in the
TyTra-IR
Performance Estimate
Performance Estimate
 Effective Work-Instance Throughput (EWIT)
o Work-Instance = Executing the kernel over the entire index-space
 Key Determinants
o Memory execution model
o Sustained memory bandwidth for the target architecture and design-
variant
• Data-access pattern
o Design configuration of the FPGA
o Operating frequency of the FPGA
o Compute-bound or IO-bound?
37
Performance model is trickier, especially calculating estimates of sustained
memory bandwidth.
Platform And Memory Model
Performance Estimate
 Effective Work-Instance Throughput (EWIT)
o Work-Instance = Executing the kernel over the entire index-space
 Key Determinants
o Memory execution model
o Sustained memory bandwidth for the target architecture and design-
variant
• Data-access pattern
o Design configuration of the FPGA
o Operating frequency of the FPGA
o Compute-bound or IO-bound?
39
Performance model is trickier, especially calculating estimates of sustained
memory bandwidth.
Forms of Memory
Execution
Performance Estimate
 Effective Work-Instance Throughput (EWIT)
o Work-Instance = Executing the kernel over the entire index-space
 Key Determinants
o Memory execution model
o Sustained memory bandwidth for the target architecture and
design-variant
• Data-access pattern
o Design configuration of the FPGA
o Operating frequency of the FPGA
o Compute-bound or IO-bound?
Effect of Access Pattern with Different Array
Sizes
Effect of using Vector-Access Optimizations with
Different Array Sizes
Performance Estimates
Parameters that Make up the Expression
Performance Estimates
The Expressions
Performance Estimates
The Expressions
Performance Estimates
The Expressions
Performance Estimates
The Expressions
Performance Estimates
The Expressions
49
Performance Estimates
Experimental Results (Type C)
Estimated vs actual cost and throughput
(CPWI = cycles per work instance)
Does The Tytra Approach Work?
How Fast Is The Cost Model
70
0.3
0
10
20
30
40
50
60
70
80
Xilinx SDAccel toolS TyTra
Time taken to generate estimate (sec)
200x faster
Design-space Exploration?
CONCLUSION
The Route To Automated Design Space
Exploration On FPGAs For HPC
Applications
 The larger aim is to create a turn-key compiler for:
Legacy scientific code  Heterogeneous HPC Platform
o Current focus is on FPGAs, and on using a Functional
Language design entry
 Our main contributions are:
o Type transformations to create design-variants,
o New Intermediate Language, and
o FPGA Cost model
 Our FPGA Cost Model
o Works on the TyTra-IR, is light-weight, accurate (enough), and
allows us to evaluate design-variants
Using safe transformations on a functional language paradigm and a light-weight
cost-model to brings us closer to a turn-key HPC compiler for legacy code
The woods are lovely, dark and deep,
But I havepromises to keep,
And lines to code before I sleep,
And lines to code before I sleep.
56
Acknowledgement
We wish to acknowledge support
by EPSRC through grant EP/L00058X/1.
The woods are lovely, dark and deep,
But I havepromises to keep,
And lines to code before I sleep,
And lines to code before I sleep.

More Related Content

What's hot

LEGaTO: Software Stack Programming Models
LEGaTO: Software Stack Programming ModelsLEGaTO: Software Stack Programming Models
LEGaTO: Software Stack Programming ModelsLEGATO project
 
The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...NECST Lab @ Politecnico di Milano
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
 
On the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsOn the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsWim Vanderbauwhede
 
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...MLconf
 
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...MLconf
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingJonathan Dursi
 
HPCLib & Excel : An efficient way to compute with Xeon PHI
HPCLib & Excel : An efficient way to compute with Xeon PHIHPCLib & Excel : An efficient way to compute with Xeon PHI
HPCLib & Excel : An efficient way to compute with Xeon PHIANEO
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMchiportal
 
Deploy PyTorch models in Production on AWS with TorchServe
Deploy PyTorch models in Production on AWS with TorchServeDeploy PyTorch models in Production on AWS with TorchServe
Deploy PyTorch models in Production on AWS with TorchServeSuman Debnath
 
Managing Machine Learning workflows on Treasure Data
Managing Machine Learning workflows on Treasure DataManaging Machine Learning workflows on Treasure Data
Managing Machine Learning workflows on Treasure DataAki Ariga
 
Backend Cloud Storage Access in Video Streaming
Backend Cloud Storage Access in Video StreamingBackend Cloud Storage Access in Video Streaming
Backend Cloud Storage Access in Video StreamingRufael Mekuria
 
Matrix Multiplication with Ateji PX for Java
Matrix Multiplication with Ateji PX for JavaMatrix Multiplication with Ateji PX for Java
Matrix Multiplication with Ateji PX for JavaPatrick Viry
 
IS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialIS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialRoger Rafanell Mas
 
From Python to PySpark and Back Again – Unifying Single-host and Distributed ...
From Python to PySpark and Back Again – Unifying Single-host and Distributed ...From Python to PySpark and Back Again – Unifying Single-host and Distributed ...
From Python to PySpark and Back Again – Unifying Single-host and Distributed ...Databricks
 
Advancing OpenFabrics Interfaces
Advancing OpenFabrics InterfacesAdvancing OpenFabrics Interfaces
Advancing OpenFabrics Interfacesinside-BigData.com
 
TinyML as-a-Service
TinyML as-a-ServiceTinyML as-a-Service
TinyML as-a-ServiceHiroshi Doyu
 

What's hot (20)

LEGaTO: Software Stack Programming Models
LEGaTO: Software Stack Programming ModelsLEGaTO: Software Stack Programming Models
LEGaTO: Software Stack Programming Models
 
The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
 
On the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsOn the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC Applications
 
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
 
Chap6 slides
Chap6 slidesChap6 slides
Chap6 slides
 
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
 
HPCLib & Excel : An efficient way to compute with Xeon PHI
HPCLib & Excel : An efficient way to compute with Xeon PHIHPCLib & Excel : An efficient way to compute with Xeon PHI
HPCLib & Excel : An efficient way to compute with Xeon PHI
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBM
 
Deploy PyTorch models in Production on AWS with TorchServe
Deploy PyTorch models in Production on AWS with TorchServeDeploy PyTorch models in Production on AWS with TorchServe
Deploy PyTorch models in Production on AWS with TorchServe
 
Managing Machine Learning workflows on Treasure Data
Managing Machine Learning workflows on Treasure DataManaging Machine Learning workflows on Treasure Data
Managing Machine Learning workflows on Treasure Data
 
Backend Cloud Storage Access in Video Streaming
Backend Cloud Storage Access in Video StreamingBackend Cloud Storage Access in Video Streaming
Backend Cloud Storage Access in Video Streaming
 
Matrix Multiplication with Ateji PX for Java
Matrix Multiplication with Ateji PX for JavaMatrix Multiplication with Ateji PX for Java
Matrix Multiplication with Ateji PX for Java
 
IS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialIS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorial
 
From Python to PySpark and Back Again – Unifying Single-host and Distributed ...
From Python to PySpark and Back Again – Unifying Single-host and Distributed ...From Python to PySpark and Back Again – Unifying Single-host and Distributed ...
From Python to PySpark and Back Again – Unifying Single-host and Distributed ...
 
Advancing OpenFabrics Interfaces
Advancing OpenFabrics InterfacesAdvancing OpenFabrics Interfaces
Advancing OpenFabrics Interfaces
 
TinyML as-a-Service
TinyML as-a-ServiceTinyML as-a-Service
TinyML as-a-Service
 
Chap5 slides
Chap5 slidesChap5 slides
Chap5 slides
 

Similar to A High-Level Programming Approach for using FPGAs in HPC using Functional Description, Vector Type-Transformations and Cost-Modelling

Towards Automated Design Space Exploration and Code Generation using Type Tra...
Towards Automated Design Space Exploration and Code Generation using Type Tra...Towards Automated Design Space Exploration and Code Generation using Type Tra...
Towards Automated Design Space Exploration and Code Generation using Type Tra...waqarnabi
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Intel® Software
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit
 
Summer training vhdl
Summer training vhdlSummer training vhdl
Summer training vhdlArshit Rai
 
Summer training vhdl
Summer training vhdlSummer training vhdl
Summer training vhdlArshit Rai
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank
 
Software variability management - 2019
Software variability management - 2019Software variability management - 2019
Software variability management - 2019XavierDevroey
 
SERENE 2014 School: Incremental Model Queries over the Cloud
SERENE 2014 School: Incremental Model Queries over the CloudSERENE 2014 School: Incremental Model Queries over the Cloud
SERENE 2014 School: Incremental Model Queries over the CloudSERENEWorkshop
 
SERENE 2014 School: Daniel varro serene2014_school
SERENE 2014 School: Daniel varro serene2014_schoolSERENE 2014 School: Daniel varro serene2014_school
SERENE 2014 School: Daniel varro serene2014_schoolHenry Muccini
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemDatabricks
 
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V International
 
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...Ritu Arora
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryDeepak Shankar
 
BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkDESMOND YUEN
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsStijn Decubber
 
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...Daniel Varro
 
Bridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven EngineeringBridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven EngineeringRafael Ferreira da Silva
 

Similar to A High-Level Programming Approach for using FPGAs in HPC using Functional Description, Vector Type-Transformations and Cost-Modelling (20)

Towards Automated Design Space Exploration and Code Generation using Type Tra...
Towards Automated Design Space Exploration and Code Generation using Type Tra...Towards Automated Design Space Exploration and Code Generation using Type Tra...
Towards Automated Design Space Exploration and Code Generation using Type Tra...
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef Habdank
 
Summer training vhdl
Summer training vhdlSummer training vhdl
Summer training vhdl
 
Summer training vhdl
Summer training vhdlSummer training vhdl
Summer training vhdl
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
 
Software variability management - 2019
Software variability management - 2019Software variability management - 2019
Software variability management - 2019
 
SERENE 2014 School: Incremental Model Queries over the Cloud
SERENE 2014 School: Incremental Model Queries over the CloudSERENE 2014 School: Incremental Model Queries over the Cloud
SERENE 2014 School: Incremental Model Queries over the Cloud
 
SERENE 2014 School: Daniel varro serene2014_school
SERENE 2014 School: Daniel varro serene2014_schoolSERENE 2014 School: Daniel varro serene2014_school
SERENE 2014 School: Daniel varro serene2014_school
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
 
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
 
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP Library
 
BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for Spark
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
 
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...
 
Bridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven EngineeringBridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven Engineering
 
defense_PPT
defense_PPTdefense_PPT
defense_PPT
 

Recently uploaded

Arduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectArduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectRased Khan
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxViniHema
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxwendy cai
 
Online resume builder management system project report.pdf
Online resume builder management system project report.pdfOnline resume builder management system project report.pdf
Online resume builder management system project report.pdfKamal Acharya
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdfKamal Acharya
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
 
2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edgePaco Orozco
 
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical EngineeringC Sai Kiran
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdfKamal Acharya
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdfKamal Acharya
 
fundamentals of drawing and isometric and orthographic projection
fundamentals of drawing and isometric and orthographic projectionfundamentals of drawing and isometric and orthographic projection
fundamentals of drawing and isometric and orthographic projectionjeevanprasad8
 
Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdfKamal Acharya
 
Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfAbrahamGadissa
 
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docxThe Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docxCenterEnamel
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationRobbie Edward Sayers
 
İTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopİTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopEmre Günaydın
 
shape functions of 1D and 2 D rectangular elements.pptx
shape functions of 1D and 2 D rectangular elements.pptxshape functions of 1D and 2 D rectangular elements.pptx
shape functions of 1D and 2 D rectangular elements.pptxVishalDeshpande27
 
Top 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering ScientistTop 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering Scientistgettygaming1
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdfKamal Acharya
 

Recently uploaded (20)

Arduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectArduino based vehicle speed tracker project
Arduino based vehicle speed tracker project
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptx
 
Online resume builder management system project report.pdf
Online resume builder management system project report.pdfOnline resume builder management system project report.pdf
Online resume builder management system project report.pdf
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 
2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge
 
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdf
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
 
fundamentals of drawing and isometric and orthographic projection
fundamentals of drawing and isometric and orthographic projectionfundamentals of drawing and isometric and orthographic projection
fundamentals of drawing and isometric and orthographic projection
 
Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdf
 
Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdf
 
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docxThe Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
İTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopİTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering Workshop
 
shape functions of 1D and 2 D rectangular elements.pptx
shape functions of 1D and 2 D rectangular elements.pptxshape functions of 1D and 2 D rectangular elements.pptx
shape functions of 1D and 2 D rectangular elements.pptx
 
Top 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering ScientistTop 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering Scientist
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 

A High-Level Programming Approach for using FPGAs in HPC using Functional Description, Vector Type-Transformations and Cost-Modelling

  • 1. A HIGH-LEVEL PROGRAMMING APPROACH FOR USING FPGAS IN HPC USING FUNCTIONAL DESCRIPTION, VECTOR TYPE-TRANSFORMATIONS AND COST-MODELLING S WAQAR NABI & WIM VANDERBAUWHEDE www.tytra.org.uk School of Informatics, University of Edinburgh,, 25 Feb 2016
  • 2. Using Safe Transformations and a Cost-model For HPC On FPGAs • The TyTra project context • Our approach, blue-sky target, down-to-earth target, where we are now, how we are different • Key contributions • (1) Type transformations to create design-variants, (2) a new Intermediate Language, and (3) an FPGA Cost model • The cost model • Performance and resource-usage estimates, some results Using safe transformations and an associated light-weight cost-model opens the route to a fully automated design-space exploration flow
  • 3. THE CONTEXT Our approach, blue-sky target, down-to-earth target, where we are now, how we are different
  • 5. Blue Sky Target Cost Model Legacy Scientific Code Heterogeneous HPC Target Description Optimized HPC solution! The goal that keeps us motivated! ( The pragmatic target is somewhat more modest…)
  • 6. 6 A performance portable code-base that builds on a purely software programming paradigm. The Cunning Plan…
  • 7. The Cunning Plan… 1. Functional programming paradigm and (auto) generate correct-by-constructions program-variants through vector- transformations • which translate to design-variants on the FPGA. 2. Create an Intermediate Language: • captures design-space • light-weight cost-model • target for front-end compiler 3. Create a fast and accurate cost-model that can estimate the performance and resource-utilization for each variant. 7 A performance portable code-base that builds on a purely software programming paradigm.
  • 8. The Cunning Plan… 1. Functional programming paradigm and (auto) generate correct-by-constructions program-variants through vector- transformations • which translate to design-variants on the FPGA. 2. Create an Intermediate Language: • captures design-space • light-weight cost-model • target for front-end compiler 3. Create a fast and accurate cost-model that can estimate the performance and resource-utilization for each variant. 8 A performance portable code-base that builds on a purely software programming paradigm.
  • 9. And You May Very Well Ask… 9 The jury is still out…
  • 10. Where We Are Now Working with small but real scientific code
  • 11. Where We Are Now Legacy Fortran Scientific Code Working with small but real scientific code
  • 13.
  • 14. IR AND COST MODEL (1) A custom Intermediate Language, and (2) a fast and accurate Cost Model
  • 15. Pre-requisite: Models Of Abstraction 1. Platform model 2. Memory hierarchy model 3. Execution model 4. Design-space and cost-space model 5. Memory execution model 6. Data access pattern model
  • 16. Pre-requisite: Models Of Abstraction 1. Platform model 2. Memory hierarchy model 3. Execution model 4. Design-space model 5. Memory execution model 6. Data access pattern model (More or less) based on OpenCL standard
  • 18. Pre-requisite: Models Of Abstraction 1. Platform model 2. Memory hierarchy model 3. Execution model 4. Design-space model 5. Memory execution model 6. Data access pattern model
  • 20. Pre-requisite: Models Of Abstraction 1. Platform model 2. Memory hierarchy model 3. Execution model 4. Design-space model 5. Memory execution model 6. Data access pattern model
  • 21. Performance Estimate Dependence On Memory Execution Model Time Activity Host  Device-DRAM Device-DRAM  Device-Buffers Device-Buffers  Offset-Buffers Kernel Pipeline Execution
  • 22. Performance Estimate Dependence On Memory Execution Model Time Activity Host  Device-DRAM Device-DRAM  Device-Buffers Device-Buffers  Offset-Buffers Kernel Pipeline Execution
  • 23. Performance Estimate Dependence On Memory Execution Model Time Activity Host  Device-DRAM Device-DRAM  Device-Buffers Device-Buffers  Offset-Buffers Kernel Pipeline Execution Work-Instance Iterations Form A All iterations
  • 24. Performance Estimate Dependence On Memory Execution Model Time Activity Host  Device-DRAM Device-DRAM  Device-Buffers Device-Buffers  Offset-Buffers Kernel Pipeline Execution First Iteration only Last Iteration only Work-Instance Iterations Form B All other iterations
  • 25. Performance Estimate Dependence On Memory Execution Model Time Activity Host  Device-DRAM Device-DRAM  Device-Buffers Device-Buffers  Offset-Buffers Kernel Pipeline Execution First Iteration only Last Iteration only Work-Instance Iterations Form C All other iterations Once a design-variant is categorized, performance can be estimated accordingly
  • 26. Pre-requisite: Models Of Abstraction 1. Platform model 2. Memory hierarchy model 3. Execution model 4. Design-space model 5. Memory execution model 6. Data access pattern model
  • 27. Pre-requisite: Models Of Abstraction 1. Platform model 2. Memory hierarchy model 3. Execution model 4. Design-space model 5. Memory execution model 6. Data access pattern model 1. Contiguous access 2. (Fixed) Strided access
  • 28. The Back-end Approach • Use (or design) an IR that can capture all these models • We ended up using LLVM and modifying it to fit our purpose, effectively creating a custom IR we call the “TyTra-IR”. • Develop a cost-model that can evaluate the variants expressed in the IR
  • 30. The Tytra IR • Strongly and statically typed - Largely based on the LLVM-IR • All computations expressed as SSA (Single-Static Assignments) • Keywords pipe, par, seq and comb to indicate type of parallelism, and nested functions of these types used to build architectural configurations Manage-IR • Memory objects • Streams • Offset streams Compute-IR • Streaming datapath model • SSA instructions
  • 34. The Cost-model Use-case 34 A set of standardized experiments feed target-specific empirical data to the cost model, and the rest comes from the IR descripition.
  • 35. Resource Estimates - Example 35 Integer Division Integer Multiplication Light-weight cost expressions associated with every legal SSA instruction in the TyTra-IR
  • 37. Performance Estimate  Effective Work-Instance Throughput (EWIT) o Work-Instance = Executing the kernel over the entire index-space  Key Determinants o Memory execution model o Sustained memory bandwidth for the target architecture and design- variant • Data-access pattern o Design configuration of the FPGA o Operating frequency of the FPGA o Compute-bound or IO-bound? 37 Performance model is trickier, especially calculating estimates of sustained memory bandwidth.
  • 39. Performance Estimate  Effective Work-Instance Throughput (EWIT) o Work-Instance = Executing the kernel over the entire index-space  Key Determinants o Memory execution model o Sustained memory bandwidth for the target architecture and design- variant • Data-access pattern o Design configuration of the FPGA o Operating frequency of the FPGA o Compute-bound or IO-bound? 39 Performance model is trickier, especially calculating estimates of sustained memory bandwidth.
  • 41. Performance Estimate  Effective Work-Instance Throughput (EWIT) o Work-Instance = Executing the kernel over the entire index-space  Key Determinants o Memory execution model o Sustained memory bandwidth for the target architecture and design-variant • Data-access pattern o Design configuration of the FPGA o Operating frequency of the FPGA o Compute-bound or IO-bound?
  • 42. Effect of Access Pattern with Different Array Sizes
  • 43. Effect of using Vector-Access Optimizations with Different Array Sizes
  • 44. Performance Estimates Parameters that Make up the Expression
  • 50. Performance Estimates Experimental Results (Type C) Estimated vs actual cost and throughput (CPWI = cycles per work instance)
  • 51. Does The Tytra Approach Work?
  • 52. How Fast Is The Cost Model 70 0.3 0 10 20 30 40 50 60 70 80 Xilinx SDAccel toolS TyTra Time taken to generate estimate (sec) 200x faster
  • 55. The Route To Automated Design Space Exploration On FPGAs For HPC Applications  The larger aim is to create a turn-key compiler for: Legacy scientific code  Heterogeneous HPC Platform o Current focus is on FPGAs, and on using a Functional Language design entry  Our main contributions are: o Type transformations to create design-variants, o New Intermediate Language, and o FPGA Cost model  Our FPGA Cost Model o Works on the TyTra-IR, is light-weight, accurate (enough), and allows us to evaluate design-variants Using safe transformations on a functional language paradigm and a light-weight cost-model to brings us closer to a turn-key HPC compiler for legacy code
  • 56. The woods are lovely, dark and deep, But I havepromises to keep, And lines to code before I sleep, And lines to code before I sleep. 56 Acknowledgement We wish to acknowledge support by EPSRC through grant EP/L00058X/1. The woods are lovely, dark and deep, But I havepromises to keep, And lines to code before I sleep, And lines to code before I sleep.