OXiGen: Automated FPGA design flow from C applications to dataflow kernels - pitch version

1
DIPARTIMENTO DI ELETTRONICA,
INFORMAZIONE E BIOINGEGNERIA
OXiGen
From C to FPGA dataflow kernels
Francesco Peverelli: francesco1.peverelli@mail.polimi.it
Marco Rabozzi: marco.rabozzi@polimi.it
Emanuele Del Sozzo: emanuele.delsozzo@polimi.it
Marco Domenico Santambrogio: marco.santambrogio@polimi.it
May 17th - 30th 2019
NGCX, San Francisco

2
Image property of
Design Time
Performance
FPGA with HLS
FPGA with
HLS
FPGA with RTL
FPGA with RTL
x86
GPU
DSP
x86
DSP
GPU
First working
version
Optimized version
Software project
design time limit
FPGA design methods

3
Image property of
Design Time
Performance
FPGA with HLS
FPGA with
HLS
FPGA with RTL
FPGA with RTL
x86
GPU
DSP
x86
DSP
GPU
First working
version
Optimized version
Software project
design time limit
FPGA with
OXiGen
FPGA design methods

4
The dataflow model
Dataflow
core
Dataflow
core
Dataflow
core
Dataflow
core
Dataflow
core
Dataflow
core
Dataflow
core
Memory
Data
Data

5
Contributions
PRODUCTIVITY
AUTOMATED TRANSLATION
FROM C TO DATAFLOW
DESIGN-SPACE
EXPLORATION
AUTOMATED TESTING
PERFORMANCE

6
Target architecture
• FPGA resources logically divided in two portions:
– Manager: responsible for the communication with the host
– Kernel: performs the actual dataflow computation

7
OXiGen overview
Dataflow
translator
LLVM
DSE module
Backend
translator
Backend
synthesis
tool
TECHNOLOGY LIBRARY
SYNTHESIS-READY CODE OPT. CONFIG.FPGA BITSTREAM
Frontend flow
Function optimization flow
Backend flow
DFG IR
Function
analysisLLVM IR

8
void f(int in_1, float* in_2, …, float* out_n){
for( int i = 0, i < N; i++ ) {
for(int j … ) {
… statements …
for( int k … ) { … }
}
… statements …
int a[N][M] = { … };
float s = 0;
for( int j = 0; j < M; j++ ) {
… statements …
s += a[i][j] * … ;
}
}
}
OXiGen code example

9
Optimization strategies
REROLLING:
• Nested loop are unrolled by default
Resources driven design
Throughput driven design
Unoptimized design
Throughput
HW resources
VECTORIZATION
REROLLING
CYCLIC
DFG
DATA INTERLEAVING

10
𝜃: Implementation-specific
vi: Optimization-specific
∀ Optimization:
Free variables
Dataflow IR
Resource
model
Performance
model
Mixed Integer Linear Programming (MILP)
model
መ𝜃, ො𝑣 : Optimal values
Design space exploration
Technology Library

11
Resource
model
model
Takes into account:
• BRAM use estimation
• Memory partitioning
• Technological implementation
(DSP Push)
• Rerolling factor
• Vectorization factor
• …

12
Performance
model
model
Takes into account:
• Operator latency (pipeline
Push)
• Upper bound on synthesis
frequency
• Rerolling factor
• Vectorization factor
• …
Dataflow IR Technology Library

13
• MaxCompiler
• Galava MAX4 board
• Stratix V FPGA
Experimental evaluations
• Asian option pricing 30 avg. points
• Asian option pricing 780 avg. points
• Quantum Monte Carlo
APPLICATIONS
EXPERIMENTAL SETUP

14
Results
Algorithm Reroll.
factor
Cyclic
dataflow
Data
interl.
DSP
push
Pipel.
push
Freq. Speedup
w.r.t. SOA
Speedup
w.r.t. CPU
AOP 30 4 yes yes 0.1 0.3 210 1.34x w.r.t[1] 118.4x
AOP 30 4 yes yes 0.1 0.3 210 1.23x w.r.t.[2] 118.4x
AOP 780 98 yes yes 0.1 0.3 215 0.5x w.r.t.[2] 101.6x
VMC 128 yes yes 0.1 0.3 210 0.93x w.r.t.[3] 26x
The CPU baseline is a single-threaded implementation compiled with gcc 4.4.7 and –O3 optimization run on a Intel(R)
Core(TM) i7-6700 CPU @ 3.40GHz
[1] F. Peverelli, M. Rabozzi, E. Del Sozzo, and M. D. Santambrogio, “Oxigen: A tool for automatic acceleration of c
functions into dataflow fpga-based kernels,” in 2018 IEEE International Parallel and Distributed Processing Symposium
Workshops (IPDPSW). IEEE, 2018, pp. 91–98.
[2] A. M. Nestorov, E. Reggiani, H. Palikareva, P. Burovskiy, T. Becker, and M. D. Santambrogio, “A scalable dataflow
implementation of curran’s approximation algorithm,” in Parallel and Distributed Processing Symposium Workshops
(IPDPSW), 2017 IEEE International. IEEE, 2017, pp. 150–157..
[3] S. Cardamone, J. R. Kimmitt, H. G. Burton, and A. J. Thom, “Field programmable gate arrays and quantum monte
carlo: Power efficient co-processing for scalable high-performance computing,” arXiv preprintarXiv:1808.02402, 2018.

15
DIPARTIMENTO DI ELETTRONICA,
INFORMAZIONE E BIOINGEGNERIA
Thank you!
Francesco Peverelli: francesco1.peverelli@mail.polimi.it
Marco Rabozzi: marco.rabozzi@polimi.it
Emanuele Del Sozzo: emanuele.delsozzo@polimi.it
Marco Domenico Santambrogio: marco.santambrogio@polimi.it
https://www.slideshare.net/necstlab https://necst.it/

OXiGen: Automated FPGA design flow from C applications to dataflow kernels - pitch version

More Related Content

What's hot

Similar to OXiGen: Automated FPGA design flow from C applications to dataflow kernels - pitch version

More from NECST Lab @ Politecnico di Milano

Recently uploaded

OXiGen: Automated FPGA design flow from C applications to dataflow kernels - pitch version