1
DIPARTIMENTO DI ELETTRONICA,
INFORMAZIONE E BIOINGEGNERIA
OXiGen
From C to FPGA dataflow kernels
Francesco Peverelli: francesco1.peverelli@mail.polimi.it
Marco Rabozzi: marco.rabozzi@polimi.it
Emanuele Del Sozzo: emanuele.delsozzo@polimi.it
Marco Domenico Santambrogio: marco.santambrogio@polimi.it
May 17th - 30th 2019
NGCX, San Francisco
2
Image property of
Design Time
Performance
FPGA with HLS
FPGA with
HLS
FPGA with RTL
FPGA with RTL
x86
GPU
DSP
x86
DSP
GPU
First working
version
Optimized version
Software project
design time limit
FPGA design methods
3
Image property of
Design Time
Performance
FPGA with HLS
FPGA with
HLS
FPGA with RTL
FPGA with RTL
x86
GPU
DSP
x86
DSP
GPU
First working
version
Optimized version
Software project
design time limit
FPGA with
OXiGen
FPGA design methods
4
The dataflow model
Dataflow
core
Dataflow
core
Dataflow
core
Dataflow
core
Dataflow
core
Dataflow
core
Dataflow
core
Memory
Data
Data
5
Contributions
PRODUCTIVITY
AUTOMATED TRANSLATION
FROM C TO DATAFLOW
DESIGN-SPACE
EXPLORATION
AUTOMATED TESTING
PERFORMANCE
6
Target architecture
• FPGA resources logically divided in two portions:
– Manager: responsible for the communication with the host
– Kernel: performs the actual dataflow computation
7
OXiGen overview
Dataflow
translator
LLVM
DSE module
Backend
translator
Backend
synthesis
tool
TECHNOLOGY LIBRARY
SYNTHESIS-READY CODE OPT. CONFIG.FPGA BITSTREAM
Frontend flow
Function optimization flow
Backend flow
DFG IR
Function
analysisLLVM IR
8
void f(int in_1, float* in_2, …, float* out_n){
for( int i = 0, i < N; i++ ) {
for(int j … ) {
… statements …
for( int k … ) { … }
}
… statements …
int a[N][M] = { … };
float s = 0;
for( int j = 0; j < M; j++ ) {
… statements …
s += a[i][j] * … ;
}
}
}
OXiGen code example
9
Optimization strategies
REROLLING:
• Nested loop are unrolled by default
Resources driven design
Throughput driven design
Unoptimized design
Throughput
HW resources
VECTORIZATION
REROLLING
CYCLIC
DFG
DATA INTERLEAVING
10
𝜃: Implementation-specific
vi: Optimization-specific
∀ Optimization:
Free variables
Dataflow IR
Resource
model
Performance
model
Mixed Integer Linear Programming (MILP)
model
መ𝜃, ො𝑣 : Optimal values
Design space exploration
Technology Library
11
Resource
model
Mixed Integer Linear Programming (MILP)
model
መ𝜃, ො𝑣 : Optimal values
Design space exploration
Takes into account:
• BRAM use estimation
• Memory partitioning
• Technological implementation
(DSP Push)
• Rerolling factor
• Vectorization factor
• …
12
Performance
model
Mixed Integer Linear Programming (MILP)
model
መ𝜃, ො𝑣 : Optimal values
Design space exploration
Takes into account:
• Operator latency (pipeline
Push)
• Upper bound on synthesis
frequency
• Rerolling factor
• Vectorization factor
• …
Dataflow IR Technology Library
13
• MaxCompiler
• Galava MAX4 board
• Stratix V FPGA
Experimental evaluations
• Asian option pricing 30 avg. points
• Asian option pricing 780 avg. points
• Quantum Monte Carlo
APPLICATIONS
EXPERIMENTAL SETUP
14
Results
Algorithm Reroll.
factor
Cyclic
dataflow
Data
interl.
DSP
push
Pipel.
push
Freq. Speedup
w.r.t. SOA
Speedup
w.r.t. CPU
AOP 30 4 yes yes 0.1 0.3 210 1.34x w.r.t[1] 118.4x
AOP 30 4 yes yes 0.1 0.3 210 1.23x w.r.t.[2] 118.4x
AOP 780 98 yes yes 0.1 0.3 215 0.5x w.r.t.[2] 101.6x
VMC 128 yes yes 0.1 0.3 210 0.93x w.r.t.[3] 26x
The CPU baseline is a single-threaded implementation compiled with gcc 4.4.7 and –O3 optimization run on a Intel(R)
Core(TM) i7-6700 CPU @ 3.40GHz
[1] F. Peverelli, M. Rabozzi, E. Del Sozzo, and M. D. Santambrogio, “Oxigen: A tool for automatic acceleration of c
functions into dataflow fpga-based kernels,” in 2018 IEEE International Parallel and Distributed Processing Symposium
Workshops (IPDPSW). IEEE, 2018, pp. 91–98.
[2] A. M. Nestorov, E. Reggiani, H. Palikareva, P. Burovskiy, T. Becker, and M. D. Santambrogio, “A scalable dataflow
implementation of curran’s approximation algorithm,” in Parallel and Distributed Processing Symposium Workshops
(IPDPSW), 2017 IEEE International. IEEE, 2017, pp. 150–157..
[3] S. Cardamone, J. R. Kimmitt, H. G. Burton, and A. J. Thom, “Field programmable gate arrays and quantum monte
carlo: Power efficient co-processing for scalable high-performance computing,” arXiv preprintarXiv:1808.02402, 2018.
15
DIPARTIMENTO DI ELETTRONICA,
INFORMAZIONE E BIOINGEGNERIA
Thank you!
Francesco Peverelli: francesco1.peverelli@mail.polimi.it
Marco Rabozzi: marco.rabozzi@polimi.it
Emanuele Del Sozzo: emanuele.delsozzo@polimi.it
Marco Domenico Santambrogio: marco.santambrogio@polimi.it
https://www.slideshare.net/necstlab https://necst.it/

OXiGen: Automated FPGA design flow from C applications to dataflow kernels - pitch version

  • 1.
    1 DIPARTIMENTO DI ELETTRONICA, INFORMAZIONEE BIOINGEGNERIA OXiGen From C to FPGA dataflow kernels Francesco Peverelli: francesco1.peverelli@mail.polimi.it Marco Rabozzi: marco.rabozzi@polimi.it Emanuele Del Sozzo: emanuele.delsozzo@polimi.it Marco Domenico Santambrogio: marco.santambrogio@polimi.it May 17th - 30th 2019 NGCX, San Francisco
  • 2.
    2 Image property of DesignTime Performance FPGA with HLS FPGA with HLS FPGA with RTL FPGA with RTL x86 GPU DSP x86 DSP GPU First working version Optimized version Software project design time limit FPGA design methods
  • 3.
    3 Image property of DesignTime Performance FPGA with HLS FPGA with HLS FPGA with RTL FPGA with RTL x86 GPU DSP x86 DSP GPU First working version Optimized version Software project design time limit FPGA with OXiGen FPGA design methods
  • 4.
  • 5.
    5 Contributions PRODUCTIVITY AUTOMATED TRANSLATION FROM CTO DATAFLOW DESIGN-SPACE EXPLORATION AUTOMATED TESTING PERFORMANCE
  • 6.
    6 Target architecture • FPGAresources logically divided in two portions: – Manager: responsible for the communication with the host – Kernel: performs the actual dataflow computation
  • 7.
    7 OXiGen overview Dataflow translator LLVM DSE module Backend translator Backend synthesis tool TECHNOLOGYLIBRARY SYNTHESIS-READY CODE OPT. CONFIG.FPGA BITSTREAM Frontend flow Function optimization flow Backend flow DFG IR Function analysisLLVM IR
  • 8.
    8 void f(int in_1,float* in_2, …, float* out_n){ for( int i = 0, i < N; i++ ) { for(int j … ) { … statements … for( int k … ) { … } } … statements … int a[N][M] = { … }; float s = 0; for( int j = 0; j < M; j++ ) { … statements … s += a[i][j] * … ; } } } OXiGen code example
  • 9.
    9 Optimization strategies REROLLING: • Nestedloop are unrolled by default Resources driven design Throughput driven design Unoptimized design Throughput HW resources VECTORIZATION REROLLING CYCLIC DFG DATA INTERLEAVING
  • 10.
    10 𝜃: Implementation-specific vi: Optimization-specific ∀Optimization: Free variables Dataflow IR Resource model Performance model Mixed Integer Linear Programming (MILP) model መ𝜃, ො𝑣 : Optimal values Design space exploration Technology Library
  • 11.
    11 Resource model Mixed Integer LinearProgramming (MILP) model መ𝜃, ො𝑣 : Optimal values Design space exploration Takes into account: • BRAM use estimation • Memory partitioning • Technological implementation (DSP Push) • Rerolling factor • Vectorization factor • …
  • 12.
    12 Performance model Mixed Integer LinearProgramming (MILP) model መ𝜃, ො𝑣 : Optimal values Design space exploration Takes into account: • Operator latency (pipeline Push) • Upper bound on synthesis frequency • Rerolling factor • Vectorization factor • … Dataflow IR Technology Library
  • 13.
    13 • MaxCompiler • GalavaMAX4 board • Stratix V FPGA Experimental evaluations • Asian option pricing 30 avg. points • Asian option pricing 780 avg. points • Quantum Monte Carlo APPLICATIONS EXPERIMENTAL SETUP
  • 14.
    14 Results Algorithm Reroll. factor Cyclic dataflow Data interl. DSP push Pipel. push Freq. Speedup w.r.t.SOA Speedup w.r.t. CPU AOP 30 4 yes yes 0.1 0.3 210 1.34x w.r.t[1] 118.4x AOP 30 4 yes yes 0.1 0.3 210 1.23x w.r.t.[2] 118.4x AOP 780 98 yes yes 0.1 0.3 215 0.5x w.r.t.[2] 101.6x VMC 128 yes yes 0.1 0.3 210 0.93x w.r.t.[3] 26x The CPU baseline is a single-threaded implementation compiled with gcc 4.4.7 and –O3 optimization run on a Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz [1] F. Peverelli, M. Rabozzi, E. Del Sozzo, and M. D. Santambrogio, “Oxigen: A tool for automatic acceleration of c functions into dataflow fpga-based kernels,” in 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2018, pp. 91–98. [2] A. M. Nestorov, E. Reggiani, H. Palikareva, P. Burovskiy, T. Becker, and M. D. Santambrogio, “A scalable dataflow implementation of curran’s approximation algorithm,” in Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International. IEEE, 2017, pp. 150–157.. [3] S. Cardamone, J. R. Kimmitt, H. G. Burton, and A. J. Thom, “Field programmable gate arrays and quantum monte carlo: Power efficient co-processing for scalable high-performance computing,” arXiv preprintarXiv:1808.02402, 2018.
  • 15.
    15 DIPARTIMENTO DI ELETTRONICA, INFORMAZIONEE BIOINGEGNERIA Thank you! Francesco Peverelli: francesco1.peverelli@mail.polimi.it Marco Rabozzi: marco.rabozzi@polimi.it Emanuele Del Sozzo: emanuele.delsozzo@polimi.it Marco Domenico Santambrogio: marco.santambrogio@polimi.it https://www.slideshare.net/necstlab https://necst.it/