SlideShare a Scribd company logo
1 of 41
Download to read offline
1 / 41
The CAOS framework
Democratize the acceleration of compute intensive
applications on FPGA
Marco Rabozzi: marco.rabozzi@polimi.it
Marco Domenico Santambrogio: marco.santambrogio@polimi.it
May 29th, 2018
1600 Plymouth, Mountain View, CA 94043, USA
2 / 41
Context Definition
3 / 41
Context Definition
• Next-generation Cloud and HPC applications require a great
amount of computing power
– Bioinformatics
– Deep learning
– Finance
4 / 41
Context Definition
• Next-generation Cloud and HPC applications require a great
amount of computing power
– Bioinformatics
– Deep learning
– Finance
• CPUs and GPUs do not always match the applications closely
– Performance requirements not met
– Energy inefficient
exec.
time
energy consumption
5 / 41
Opportunity
6 / 41
Opportunity
• Application-specific Integrated Circuit (ASIC)
+ Well suited for specific algorithms
+ Suitable for high production volumes
– High risk investment
– Fixed architecture
7 / 41
Opportunity
• Application-specific Integrated Circuit (ASIC)
+ Well suited for specific algorithms
+ Suitable for high production volumes
– High risk investment
– Fixed architecture
• Field Programmable Gate-Array (FPGA)
+ Performance / power efficient solution
+ Flexible reconfigurable & runtime adaptable architecture
– Complex and hard to program
– Large HW / SW co-design solution space
8 / 41
CAOS framework objectives
1. guide the application designer in the
implementation of efficient hardware-software
solutions for high performance FPGA-based systems
2. promote open research by allowing external
researchers to easily integrate and benchmark their
algorithms
9 / 41
The proposed CAOS framework
______
__________
___
________
______
__________
___
________
CAOSFlowManager
Backend
Runtime generation – function synthesis –
floorplanning – map. & scheduling – bit. gen.
Functions Optimization
HW res. estimation – static code analysis –
performance estimation – code opt. / DSE
Frontend
IR generation – profiling – templates
applicability check – HW/SW partitioning
<system
>
…
</system
>
System
description
Profiling
datasets
______
__________
___
________
Application code
(C, C++)
1110010110
1010101111
1111010101
0101010101
0101010101
0111111
FPGAs
bitstreams
______
__________
___
________
System
runtime
ArchitecturalTemplates
10 / 41
The proposed CAOS framework
______
__________
___
________
______
__________
___
________
CAOSFlowManager
Backend
Runtime generation – function synthesis –
floorplanning – map. & scheduling – bit. gen.
Functions Optimization
HW res. estimation – static code analysis –
performance estimation – code opt. / DSE
Frontend
IR generation – profiling – templates
applicability check – HW/SW partitioning
<system
>
…
</system
>
System
description
Profiling
datasets
______
__________
___
________
Application code
(C, C++)
1110010110
1010101111
1111010101
0101010101
0101010101
0111111
FPGAs
bitstreams
______
__________
___
________
System
runtime
ArchitecturalTemplates
SST
Output Generation
Master /
Slave
Dataflow MaxCompiler
11 / 41
CAOS Infrastructure
CAOS Flow Manager
Module A Module B Module C Module D Module E
Frontend Flow Function Optimization
Flow
Backend
Flow
Web Application
…
…
JSON +
File archives
REST
interface
12 / 41
CAOS Infrastructure
CAOS Flow Manager
Module A Module B Module C Module D Module E
Frontend Flow Function Optimization
Flow
Backend
Flow
Web Application
…
…
JSON +
File archives
REST
interface
F1
13 / 41
Architectural templates
• SST (Single Streaming Timestep)
– Scientific applications with a regular execution flow in which
a N-dimensional data structure is updated over time
– Example: Heat flow simulation, Gauss-Seidel linear systems solver
• Dataflow
– Applications with a static compute graph where the same
computation is repeated for multiple input items
– Example: Asian Option Pricing, Image processing filters
• Master / Slave
– More general applications with a regular execution flow whose
working set can be effectively tiled
– Example: N-Body Physics Simulation
14 / 41
Architectural templates
• SST (Single Streaming Timestep)
– Scientific applications with a regular execution flow in which
a N-dimensional data structure is updated over time
– Example: Heat flow simulation, Gauss-Seidel linear systems solver
• Dataflow
– Applications with a static compute graph where the same
computation is repeated for multiple input items
– Example: Asian Option Pricing, Image processing filters
• Master / Slave
– More general applications with a regular execution flow whose
working set can be effectively tiled
– Example: N-Body Physics Simulation
15 / 41
SST background
• Iterative Stencil Loop (ISL) Algorithm
[1] Natale, G., Stramondo, G., Bressana, P., Cattaneo, R., Sciuto, D., & Santambrogio, M. D.. A polyhedral model-based framework for dataflow
implementation on FPGA devices of Iterative Stencil Loops. In Computer-Aided Design (ICCAD), 2016 International Conference on (pp. 1-8). IEEE.
Tim
e
Steps
16 / 41
SST background
• Iterative Stencil Loop (ISL) Algorithm
[1] Natale, G., Stramondo, G., Bressana, P., Cattaneo, R., Sciuto, D., & Santambrogio, M. D.. A polyhedral model-based framework for dataflow
implementation on FPGA devices of Iterative Stencil Loops. In Computer-Aided Design (ICCAD), 2016 International Conference on (pp. 1-8). IEEE.
Tim
e
Steps
Single
Tim
e
Step
• FPGA-based Stencil Time-Step (SST) Architecture [1]
17 / 41
Design Space Exploration in CAOS
• Problem:
– Identify an optimal implementation of
an SST-based design on a target FPGA
# SSTClock frequency
18 / 41
Design Space Exploration in CAOS
• Problem:
– Identify an optimal implementation of
an SST-based design on a target FPGA
# SSTClock frequency
Floorplan
• Proposed Approach - leverage floorplanning to:
– Determine the maximum number of SST that can be instantiated
– Optimize SSTs placement to ease timing closure
19 / 41
CAOS experimental result
With floorplan
# SSTs
90 (+2.3%)
Frequency
228 MHz (+10.7%)
Design time
~25 hours (16x less)
Without floorplan
# SSTs
88
Frequency
206 MHz
Design time
~400 hours
Algorithm: Jacobi2D
Target FPGA: Xilinx Virtex xc7vx485
Rabozzi, M., Natale, G., Festa, B., Miele, A., & Santambrogio, M. D. (2017, September). Optimizing streaming stencil time-step designs via FPGA
floorplanning. In Field Programmable Logic and Applications (FPL), 2017 27th International Conference on (pp. 1-4).
20 / 41
Architectural templates
• SST (Single Streaming Timestep)
– Scientific applications with a regular execution flow in which
a N-dimensional data structure is updated over time
– Example: Heat flow simulation, Gauss-Seidel linear systems solver
• Dataflow
– Applications with a static compute graph where the same
computation is repeated for multiple input items
– Example: Asian Option Pricing, Image processing filters
• Master / Slave
– More general applications with a regular execution flow whose
working set can be effectively tiled
– Example: N-Body Physics Simulation
21 / 41
void foo(type_1* in_1, type_2* in_2, scalar_type_1* v1, type_1* out_1){
for(int i = offs; i < I_SIZE; i++){
S1: …statements…
for(int j = …; j < 15; j++){
S2: …statements…
}
S3: …statements…
for(int j = … ){
S4: …statements…
}
}
for(int i = offs_3; … ){
S5: …statements…
}
}
Supported code & dataflow IR
OUTERSTREAMINGLOOPS
22 / 41
void foo(type_1* in_1, type_2* in_2, scalar_type_1* v1, type_1* out_1){
for(int i = offs; i < I_SIZE; i++){
S1: …statements…
for(int j = …; j < 15; j++){
S2: …statements…
}
S3: …statements…
for(int j = … ){
S4: …statements…
}
}
for(int i = offs_3; … ){
S5: …statements…
}
}
Supported code & dataflow IR
NESTED
LOOP
NODES
LOOP CARRIED
DEPENDENCY ARC
OUTERSTREAMINGLOOPS
23 / 41
Design optimizations
REROLLING:
• Nested loop are unrolled by default
• Rerolling can be applied only to loops without carried dependencies
Resources driven design
Throughput driven design
Unoptimized design
Throughput
HW resources
VECTORIZATION
REROLLING
24 / 41
Which optimization to apply?
25 / 41
Which optimization to apply?
We need:
• Resource estimation model
• Performance estimation model
26 / 41
Which optimization to apply?
We need:
• Resource estimation model
• Performance estimation model
Tailored for each
optimization
27 / 41
Asian Option Pricing
28 / 41
CAOS experimental results
Rerol.
Factor
Speedup
DSP/LUT
Balance
30 15.83 1.0
15 31.22 1.0
10 45.95 0.1
8 56.91 1.0
6 74.35 0.1
5 88.10 0.1
Total Resources
DSP BRAM FF LUT
39.06 37.65 21.86 37.15
55.08 42.09 24.47 40.71
29.30 41.50 31.55 52.32
87.11 54.30 29.60 47.73
41.02 59.13 39.51 64.31
46.88 63.96 43.31 70.09
• System implemented with MaxCompiler on a Galava MAX4
• Baseline software implementation executed on an Intel(R) Core(TM) i7-
6700 CPU at 3.40GHz, compiled with gcc 4.4.7 and -O3 optimization
Peverelli F., Rabozzi M., Del Sozzo E., & Santambrogio, M.D. (2018, May). OXiGen: A tool for automatic acceleration of C functions into dataflow
FPGA-based kernels. In Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2018 IEEE International.
29 / 41
CAOS vs Hand-tuned implementation
Execution time
CAOS [1] Hand-tuned [2]
1.78s 1.25s
Testbench parameters:
• Averaging points window: 30
• Asian Options in portfolio: 10000
• Market scenarios: 5000
[1] Peverelli F., Rabozzi M., Del Sozzo E., & Santambrogio, M.D. (2018, May). OXiGen: A tool for automatic acceleration of C functions into dataflow FPGA-
based kernels. In Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2018 IEEE International.
[2] Nestorov, A. M., Reggiani, E., Palikareva, H., Burovskiy, P., Becker, T., & Santambrogio, M. D. (2017, May). A Scalable Dataflow Implementation of Curran's
Approximation Algorithm. In Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International (pp. 150-157). IEEE.
30 / 41
CAOS vs Hand-tuned implementation
About a day of work vs several weeks!
Execution time
CAOS [1] Hand-tuned [2]
1.78s 1.25s
Testbench parameters:
• Averaging points window: 30
• Asian Options in portfolio: 10000
• Market scenarios: 5000
[1] Peverelli F., Rabozzi M., Del Sozzo E., & Santambrogio, M.D. (2018, May). OXiGen: A tool for automatic acceleration of C functions into dataflow FPGA-
based kernels. In Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2018 IEEE International.
[2] Nestorov, A. M., Reggiani, E., Palikareva, H., Burovskiy, P., Becker, T., & Santambrogio, M. D. (2017, May). A Scalable Dataflow Implementation of Curran's
Approximation Algorithm. In Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International (pp. 150-157). IEEE.
31 / 41
Architectural templates
• SST (Single Streaming Timestep)
– Scientific applications with a regular execution flow in which
a N-dimensional data structure is updated over time
– Example: Heat flow simulation, Gauss-Seidel linear systems solver
• Dataflow
– Applications with a static compute graph where the same
computation is repeated for multiple input items
– Example: Asian Option Pricing, Image processing filters
• Master / Slave
– More general applications with a regular execution flow whose
working set can be effectively tiled
– Example: N-Body Physics Simulation
32 / 41
Master / Slave background
• Support a larger class of source codes, ideally, close to
the same set of codes supported by High Level Synthesis
tools (e.g. Vivado HLS)
Local FPGA
memory (BRAM)
DDR
• Tiled computation:
– Application’s memory transferred in
blocks from DDR to local FPGA
memory
– Computation performed on a block
of data at a time
33 / 41
Master / Slave architectural template
• Current Implementation
– Relies on Vivado HLS for resource and performance estimation
– Explore potential optimizations by testing HLS directives
(e.g. #pragma HLS UNROLL, #pragma HLS PIPELINE, …)
– Quite accurate but slow (HLS synthesis time may be high)
• On going work:
– Leverage and extend the classic roofline model for CPU to
predict performance and potential optimizations
– Less accurate but faster, allowing to explore a larger
optimization space
34 / 41
Roofline model for FPGA
35 / 41
Roofline model for FPGA
Instructions count from
static code analysis
36 / 41
Roofline model for FPGA
HW operators for ideal
performance
(within resource constraints)
37 / 41
Roofline model for FPGA
Estimated HW
resources breakdown
38 / 41
Roofline model for FPGA
Peak performance
(Ops / sec)
with fully pipelined
operators
@ target frequency
39 / 41
Roofline model for FPGA
Operational intensity
derived from static code
analysis
40 / 41
Conclusion & ongoing work
• We presented CAOS:
– A CAD framework that simplifies the acceleration of
applications on FPGA-based systems
– A unifying framework to stimulate research on FPGA-based
systems
• On going work:
– Performance estimation via Roofline Model
– Enhance CAOS functions optimization process
41 / 41
The CAOS framework
Democratize the acceleration of compute intensive
applications on FPGAThank you!
Marco Rabozzi: marco.rabozzi@polimi.it
Marco Domenico Santambrogio: marco.santambrogio@polimi.it

More Related Content

What's hot

Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit Ganesan Narayanasamy
 
Hardware simulation for exponential blind equal throughput algorithm using sy...
Hardware simulation for exponential blind equal throughput algorithm using sy...Hardware simulation for exponential blind equal throughput algorithm using sy...
Hardware simulation for exponential blind equal throughput algorithm using sy...IJECEIAES
 
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...Edge AI and Vision Alliance
 
Fpga implementation of multilayer feed forward neural network architecture us...
Fpga implementation of multilayer feed forward neural network architecture us...Fpga implementation of multilayer feed forward neural network architecture us...
Fpga implementation of multilayer feed forward neural network architecture us...Ece Rljit
 
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...Arun Joseph
 
B Eng Final Year Project Presentation
B Eng Final Year Project PresentationB Eng Final Year Project Presentation
B Eng Final Year Project Presentationjesujoseph
 
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...AMD Developer Central
 
Low Power High-Performance Computing on the BeagleBoard Platform
Low Power High-Performance Computing on the BeagleBoard PlatformLow Power High-Performance Computing on the BeagleBoard Platform
Low Power High-Performance Computing on the BeagleBoard Platforma3labdsp
 
Improvements in space radiation-tolerant FPGA implementation of land surface ...
Improvements in space radiation-tolerant FPGA implementation of land surface ...Improvements in space radiation-tolerant FPGA implementation of land surface ...
Improvements in space radiation-tolerant FPGA implementation of land surface ...IJECEIAES
 
VLSI-Physical Design- Tool Terminalogy
VLSI-Physical Design- Tool TerminalogyVLSI-Physical Design- Tool Terminalogy
VLSI-Physical Design- Tool TerminalogyMurali Rai
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersJen Aman
 
FPGA Implementation of FIR Filter using Various Algorithms: A Retrospective
FPGA Implementation of FIR Filter using Various Algorithms: A RetrospectiveFPGA Implementation of FIR Filter using Various Algorithms: A Retrospective
FPGA Implementation of FIR Filter using Various Algorithms: A RetrospectiveIJORCS
 

What's hot (20)

Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
 
Hardware simulation for exponential blind equal throughput algorithm using sy...
Hardware simulation for exponential blind equal throughput algorithm using sy...Hardware simulation for exponential blind equal throughput algorithm using sy...
Hardware simulation for exponential blind equal throughput algorithm using sy...
 
Inputs of physical design
Inputs of physical designInputs of physical design
Inputs of physical design
 
slides
slidesslides
slides
 
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
 
Fpga implementation of multilayer feed forward neural network architecture us...
Fpga implementation of multilayer feed forward neural network architecture us...Fpga implementation of multilayer feed forward neural network architecture us...
Fpga implementation of multilayer feed forward neural network architecture us...
 
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...
 
Dl2 computing gpu
Dl2 computing gpuDl2 computing gpu
Dl2 computing gpu
 
B Eng Final Year Project Presentation
B Eng Final Year Project PresentationB Eng Final Year Project Presentation
B Eng Final Year Project Presentation
 
Physical design
Physical design Physical design
Physical design
 
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
 
Low Power High-Performance Computing on the BeagleBoard Platform
Low Power High-Performance Computing on the BeagleBoard PlatformLow Power High-Performance Computing on the BeagleBoard Platform
Low Power High-Performance Computing on the BeagleBoard Platform
 
Improvements in space radiation-tolerant FPGA implementation of land surface ...
Improvements in space radiation-tolerant FPGA implementation of land surface ...Improvements in space radiation-tolerant FPGA implementation of land surface ...
Improvements in space radiation-tolerant FPGA implementation of land surface ...
 
VLSI-Physical Design- Tool Terminalogy
VLSI-Physical Design- Tool TerminalogyVLSI-Physical Design- Tool Terminalogy
VLSI-Physical Design- Tool Terminalogy
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
3D-DRESD R4R
3D-DRESD R4R3D-DRESD R4R
3D-DRESD R4R
 
Real-Time API
Real-Time APIReal-Time API
Real-Time API
 
FPGA Implementation of FIR Filter using Various Algorithms: A Retrospective
FPGA Implementation of FIR Filter using Various Algorithms: A RetrospectiveFPGA Implementation of FIR Filter using Various Algorithms: A Retrospective
FPGA Implementation of FIR Filter using Various Algorithms: A Retrospective
 
Phd Defense 2007
Phd Defense 2007Phd Defense 2007
Phd Defense 2007
 
Gene's law
Gene's lawGene's law
Gene's law
 

Similar to The CAOS framework: democratize the acceleration of compute intensive applications on FPGA

byteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA SolutionsbyteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA SolutionsbyteLAKE
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryDeepak Shankar
 
From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...
From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...
From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...NECST Lab @ Politecnico di Milano
 
Introduction to FPGA acceleration
Introduction to FPGA accelerationIntroduction to FPGA acceleration
Introduction to FPGA accelerationMarco77328
 
Varun Gatne - Resume - Final
Varun Gatne - Resume - FinalVarun Gatne - Resume - Final
Varun Gatne - Resume - FinalVarun Gatne
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...NECST Lab @ Politecnico di Milano
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGATO project
 
Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...
Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...
Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...ijceronline
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaFacultad de Informática UCM
 
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...IRJET Journal
 
Automatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPCAutomatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPCFacultad de Informática UCM
 
System on Chip Design and Modelling Dr. David J Greaves
System on Chip Design and Modelling   Dr. David J GreavesSystem on Chip Design and Modelling   Dr. David J Greaves
System on Chip Design and Modelling Dr. David J GreavesSatya Harish
 
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...NECST Lab @ Politecnico di Milano
 

Similar to The CAOS framework: democratize the acceleration of compute intensive applications on FPGA (20)

byteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA SolutionsbyteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA Solutions
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP Library
 
From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...
From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...
From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...
 
Introduction to FPGA acceleration
Introduction to FPGA accelerationIntroduction to FPGA acceleration
Introduction to FPGA acceleration
 
Varun Gatne - Resume - Final
Varun Gatne - Resume - FinalVarun Gatne - Resume - Final
Varun Gatne - Resume - Final
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
 
Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...
Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...
Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...
 
Automatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPCAutomatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPC
 
System on Chip Design and Modelling Dr. David J Greaves
System on Chip Design and Modelling   Dr. David J GreavesSystem on Chip Design and Modelling   Dr. David J Greaves
System on Chip Design and Modelling Dr. David J Greaves
 
Shantanu's Resume
Shantanu's ResumeShantanu's Resume
Shantanu's Resume
 
Dsp lab manual 15 11-2016
Dsp lab manual 15 11-2016Dsp lab manual 15 11-2016
Dsp lab manual 15 11-2016
 
computer architecture.
computer architecture.computer architecture.
computer architecture.
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Ch1
Ch1Ch1
Ch1
 
Ch1
Ch1Ch1
Ch1
 
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
 

More from NECST Lab @ Politecnico di Milano

Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingNECST Lab @ Politecnico di Milano
 
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...NECST Lab @ Politecnico di Milano
 
EMPhASIS - An EMbedded Public Attention Stress Identification System
 EMPhASIS - An EMbedded Public Attention Stress Identification System EMPhASIS - An EMbedded Public Attention Stress Identification System
EMPhASIS - An EMbedded Public Attention Stress Identification SystemNECST Lab @ Politecnico di Milano
 
Maeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingMaeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingNECST Lab @ Politecnico di Milano
 

More from NECST Lab @ Politecnico di Milano (20)

Mesticheria Team - WiiReflex
Mesticheria Team - WiiReflexMesticheria Team - WiiReflex
Mesticheria Team - WiiReflex
 
Punto e virgola Team - Stressometro
Punto e virgola Team - StressometroPunto e virgola Team - Stressometro
Punto e virgola Team - Stressometro
 
BitIt Team - Stay.straight
BitIt Team - Stay.straight BitIt Team - Stay.straight
BitIt Team - Stay.straight
 
BabYodini Team - Talking Gloves
BabYodini Team - Talking GlovesBabYodini Team - Talking Gloves
BabYodini Team - Talking Gloves
 
printf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTonprintf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTon
 
BlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking PlatformBlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking Platform
 
#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome
 
Flipflops Team - Wave U
Flipflops Team - Wave UFlipflops Team - Wave U
Flipflops Team - Wave U
 
Bug(atta) Team - Little Brother
Bug(atta) Team - Little BrotherBug(atta) Team - Little Brother
Bug(atta) Team - Little Brother
 
#NECSTCamp: come partecipare
#NECSTCamp: come partecipare#NECSTCamp: come partecipare
#NECSTCamp: come partecipare
 
NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1
 
NECSTLab101 2020.2021
NECSTLab101 2020.2021NECSTLab101 2020.2021
NECSTLab101 2020.2021
 
TreeHouse, nourish your community
TreeHouse, nourish your communityTreeHouse, nourish your community
TreeHouse, nourish your community
 
TiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architectureTiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architecture
 
Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposing
 
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
 
EMPhASIS - An EMbedded Public Attention Stress Identification System
 EMPhASIS - An EMbedded Public Attention Stress Identification System EMPhASIS - An EMbedded Public Attention Stress Identification System
EMPhASIS - An EMbedded Public Attention Stress Identification System
 
Luns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural networkLuns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural network
 
BlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAsBlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAs
 
Maeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingMaeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matching
 

Recently uploaded

main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 

Recently uploaded (20)

main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 

The CAOS framework: democratize the acceleration of compute intensive applications on FPGA

  • 1. 1 / 41 The CAOS framework Democratize the acceleration of compute intensive applications on FPGA Marco Rabozzi: marco.rabozzi@polimi.it Marco Domenico Santambrogio: marco.santambrogio@polimi.it May 29th, 2018 1600 Plymouth, Mountain View, CA 94043, USA
  • 2. 2 / 41 Context Definition
  • 3. 3 / 41 Context Definition • Next-generation Cloud and HPC applications require a great amount of computing power – Bioinformatics – Deep learning – Finance
  • 4. 4 / 41 Context Definition • Next-generation Cloud and HPC applications require a great amount of computing power – Bioinformatics – Deep learning – Finance • CPUs and GPUs do not always match the applications closely – Performance requirements not met – Energy inefficient exec. time energy consumption
  • 6. 6 / 41 Opportunity • Application-specific Integrated Circuit (ASIC) + Well suited for specific algorithms + Suitable for high production volumes – High risk investment – Fixed architecture
  • 7. 7 / 41 Opportunity • Application-specific Integrated Circuit (ASIC) + Well suited for specific algorithms + Suitable for high production volumes – High risk investment – Fixed architecture • Field Programmable Gate-Array (FPGA) + Performance / power efficient solution + Flexible reconfigurable & runtime adaptable architecture – Complex and hard to program – Large HW / SW co-design solution space
  • 8. 8 / 41 CAOS framework objectives 1. guide the application designer in the implementation of efficient hardware-software solutions for high performance FPGA-based systems 2. promote open research by allowing external researchers to easily integrate and benchmark their algorithms
  • 9. 9 / 41 The proposed CAOS framework ______ __________ ___ ________ ______ __________ ___ ________ CAOSFlowManager Backend Runtime generation – function synthesis – floorplanning – map. & scheduling – bit. gen. Functions Optimization HW res. estimation – static code analysis – performance estimation – code opt. / DSE Frontend IR generation – profiling – templates applicability check – HW/SW partitioning <system > … </system > System description Profiling datasets ______ __________ ___ ________ Application code (C, C++) 1110010110 1010101111 1111010101 0101010101 0101010101 0111111 FPGAs bitstreams ______ __________ ___ ________ System runtime ArchitecturalTemplates
  • 10. 10 / 41 The proposed CAOS framework ______ __________ ___ ________ ______ __________ ___ ________ CAOSFlowManager Backend Runtime generation – function synthesis – floorplanning – map. & scheduling – bit. gen. Functions Optimization HW res. estimation – static code analysis – performance estimation – code opt. / DSE Frontend IR generation – profiling – templates applicability check – HW/SW partitioning <system > … </system > System description Profiling datasets ______ __________ ___ ________ Application code (C, C++) 1110010110 1010101111 1111010101 0101010101 0101010101 0111111 FPGAs bitstreams ______ __________ ___ ________ System runtime ArchitecturalTemplates SST Output Generation Master / Slave Dataflow MaxCompiler
  • 11. 11 / 41 CAOS Infrastructure CAOS Flow Manager Module A Module B Module C Module D Module E Frontend Flow Function Optimization Flow Backend Flow Web Application … … JSON + File archives REST interface
  • 12. 12 / 41 CAOS Infrastructure CAOS Flow Manager Module A Module B Module C Module D Module E Frontend Flow Function Optimization Flow Backend Flow Web Application … … JSON + File archives REST interface F1
  • 13. 13 / 41 Architectural templates • SST (Single Streaming Timestep) – Scientific applications with a regular execution flow in which a N-dimensional data structure is updated over time – Example: Heat flow simulation, Gauss-Seidel linear systems solver • Dataflow – Applications with a static compute graph where the same computation is repeated for multiple input items – Example: Asian Option Pricing, Image processing filters • Master / Slave – More general applications with a regular execution flow whose working set can be effectively tiled – Example: N-Body Physics Simulation
  • 14. 14 / 41 Architectural templates • SST (Single Streaming Timestep) – Scientific applications with a regular execution flow in which a N-dimensional data structure is updated over time – Example: Heat flow simulation, Gauss-Seidel linear systems solver • Dataflow – Applications with a static compute graph where the same computation is repeated for multiple input items – Example: Asian Option Pricing, Image processing filters • Master / Slave – More general applications with a regular execution flow whose working set can be effectively tiled – Example: N-Body Physics Simulation
  • 15. 15 / 41 SST background • Iterative Stencil Loop (ISL) Algorithm [1] Natale, G., Stramondo, G., Bressana, P., Cattaneo, R., Sciuto, D., & Santambrogio, M. D.. A polyhedral model-based framework for dataflow implementation on FPGA devices of Iterative Stencil Loops. In Computer-Aided Design (ICCAD), 2016 International Conference on (pp. 1-8). IEEE. Tim e Steps
  • 16. 16 / 41 SST background • Iterative Stencil Loop (ISL) Algorithm [1] Natale, G., Stramondo, G., Bressana, P., Cattaneo, R., Sciuto, D., & Santambrogio, M. D.. A polyhedral model-based framework for dataflow implementation on FPGA devices of Iterative Stencil Loops. In Computer-Aided Design (ICCAD), 2016 International Conference on (pp. 1-8). IEEE. Tim e Steps Single Tim e Step • FPGA-based Stencil Time-Step (SST) Architecture [1]
  • 17. 17 / 41 Design Space Exploration in CAOS • Problem: – Identify an optimal implementation of an SST-based design on a target FPGA # SSTClock frequency
  • 18. 18 / 41 Design Space Exploration in CAOS • Problem: – Identify an optimal implementation of an SST-based design on a target FPGA # SSTClock frequency Floorplan • Proposed Approach - leverage floorplanning to: – Determine the maximum number of SST that can be instantiated – Optimize SSTs placement to ease timing closure
  • 19. 19 / 41 CAOS experimental result With floorplan # SSTs 90 (+2.3%) Frequency 228 MHz (+10.7%) Design time ~25 hours (16x less) Without floorplan # SSTs 88 Frequency 206 MHz Design time ~400 hours Algorithm: Jacobi2D Target FPGA: Xilinx Virtex xc7vx485 Rabozzi, M., Natale, G., Festa, B., Miele, A., & Santambrogio, M. D. (2017, September). Optimizing streaming stencil time-step designs via FPGA floorplanning. In Field Programmable Logic and Applications (FPL), 2017 27th International Conference on (pp. 1-4).
  • 20. 20 / 41 Architectural templates • SST (Single Streaming Timestep) – Scientific applications with a regular execution flow in which a N-dimensional data structure is updated over time – Example: Heat flow simulation, Gauss-Seidel linear systems solver • Dataflow – Applications with a static compute graph where the same computation is repeated for multiple input items – Example: Asian Option Pricing, Image processing filters • Master / Slave – More general applications with a regular execution flow whose working set can be effectively tiled – Example: N-Body Physics Simulation
  • 21. 21 / 41 void foo(type_1* in_1, type_2* in_2, scalar_type_1* v1, type_1* out_1){ for(int i = offs; i < I_SIZE; i++){ S1: …statements… for(int j = …; j < 15; j++){ S2: …statements… } S3: …statements… for(int j = … ){ S4: …statements… } } for(int i = offs_3; … ){ S5: …statements… } } Supported code & dataflow IR OUTERSTREAMINGLOOPS
  • 22. 22 / 41 void foo(type_1* in_1, type_2* in_2, scalar_type_1* v1, type_1* out_1){ for(int i = offs; i < I_SIZE; i++){ S1: …statements… for(int j = …; j < 15; j++){ S2: …statements… } S3: …statements… for(int j = … ){ S4: …statements… } } for(int i = offs_3; … ){ S5: …statements… } } Supported code & dataflow IR NESTED LOOP NODES LOOP CARRIED DEPENDENCY ARC OUTERSTREAMINGLOOPS
  • 23. 23 / 41 Design optimizations REROLLING: • Nested loop are unrolled by default • Rerolling can be applied only to loops without carried dependencies Resources driven design Throughput driven design Unoptimized design Throughput HW resources VECTORIZATION REROLLING
  • 24. 24 / 41 Which optimization to apply?
  • 25. 25 / 41 Which optimization to apply? We need: • Resource estimation model • Performance estimation model
  • 26. 26 / 41 Which optimization to apply? We need: • Resource estimation model • Performance estimation model Tailored for each optimization
  • 27. 27 / 41 Asian Option Pricing
  • 28. 28 / 41 CAOS experimental results Rerol. Factor Speedup DSP/LUT Balance 30 15.83 1.0 15 31.22 1.0 10 45.95 0.1 8 56.91 1.0 6 74.35 0.1 5 88.10 0.1 Total Resources DSP BRAM FF LUT 39.06 37.65 21.86 37.15 55.08 42.09 24.47 40.71 29.30 41.50 31.55 52.32 87.11 54.30 29.60 47.73 41.02 59.13 39.51 64.31 46.88 63.96 43.31 70.09 • System implemented with MaxCompiler on a Galava MAX4 • Baseline software implementation executed on an Intel(R) Core(TM) i7- 6700 CPU at 3.40GHz, compiled with gcc 4.4.7 and -O3 optimization Peverelli F., Rabozzi M., Del Sozzo E., & Santambrogio, M.D. (2018, May). OXiGen: A tool for automatic acceleration of C functions into dataflow FPGA-based kernels. In Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2018 IEEE International.
  • 29. 29 / 41 CAOS vs Hand-tuned implementation Execution time CAOS [1] Hand-tuned [2] 1.78s 1.25s Testbench parameters: • Averaging points window: 30 • Asian Options in portfolio: 10000 • Market scenarios: 5000 [1] Peverelli F., Rabozzi M., Del Sozzo E., & Santambrogio, M.D. (2018, May). OXiGen: A tool for automatic acceleration of C functions into dataflow FPGA- based kernels. In Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2018 IEEE International. [2] Nestorov, A. M., Reggiani, E., Palikareva, H., Burovskiy, P., Becker, T., & Santambrogio, M. D. (2017, May). A Scalable Dataflow Implementation of Curran's Approximation Algorithm. In Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International (pp. 150-157). IEEE.
  • 30. 30 / 41 CAOS vs Hand-tuned implementation About a day of work vs several weeks! Execution time CAOS [1] Hand-tuned [2] 1.78s 1.25s Testbench parameters: • Averaging points window: 30 • Asian Options in portfolio: 10000 • Market scenarios: 5000 [1] Peverelli F., Rabozzi M., Del Sozzo E., & Santambrogio, M.D. (2018, May). OXiGen: A tool for automatic acceleration of C functions into dataflow FPGA- based kernels. In Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2018 IEEE International. [2] Nestorov, A. M., Reggiani, E., Palikareva, H., Burovskiy, P., Becker, T., & Santambrogio, M. D. (2017, May). A Scalable Dataflow Implementation of Curran's Approximation Algorithm. In Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International (pp. 150-157). IEEE.
  • 31. 31 / 41 Architectural templates • SST (Single Streaming Timestep) – Scientific applications with a regular execution flow in which a N-dimensional data structure is updated over time – Example: Heat flow simulation, Gauss-Seidel linear systems solver • Dataflow – Applications with a static compute graph where the same computation is repeated for multiple input items – Example: Asian Option Pricing, Image processing filters • Master / Slave – More general applications with a regular execution flow whose working set can be effectively tiled – Example: N-Body Physics Simulation
  • 32. 32 / 41 Master / Slave background • Support a larger class of source codes, ideally, close to the same set of codes supported by High Level Synthesis tools (e.g. Vivado HLS) Local FPGA memory (BRAM) DDR • Tiled computation: – Application’s memory transferred in blocks from DDR to local FPGA memory – Computation performed on a block of data at a time
  • 33. 33 / 41 Master / Slave architectural template • Current Implementation – Relies on Vivado HLS for resource and performance estimation – Explore potential optimizations by testing HLS directives (e.g. #pragma HLS UNROLL, #pragma HLS PIPELINE, …) – Quite accurate but slow (HLS synthesis time may be high) • On going work: – Leverage and extend the classic roofline model for CPU to predict performance and potential optimizations – Less accurate but faster, allowing to explore a larger optimization space
  • 34. 34 / 41 Roofline model for FPGA
  • 35. 35 / 41 Roofline model for FPGA Instructions count from static code analysis
  • 36. 36 / 41 Roofline model for FPGA HW operators for ideal performance (within resource constraints)
  • 37. 37 / 41 Roofline model for FPGA Estimated HW resources breakdown
  • 38. 38 / 41 Roofline model for FPGA Peak performance (Ops / sec) with fully pipelined operators @ target frequency
  • 39. 39 / 41 Roofline model for FPGA Operational intensity derived from static code analysis
  • 40. 40 / 41 Conclusion & ongoing work • We presented CAOS: – A CAD framework that simplifies the acceleration of applications on FPGA-based systems – A unifying framework to stimulate research on FPGA-based systems • On going work: – Performance estimation via Roofline Model – Enhance CAOS functions optimization process
  • 41. 41 / 41 The CAOS framework Democratize the acceleration of compute intensive applications on FPGAThank you! Marco Rabozzi: marco.rabozzi@polimi.it Marco Domenico Santambrogio: marco.santambrogio@polimi.it