3. : DNN-HW
FPGA
n Verilog HDL/VHDL
l L
n C/C++ (High Level Synthesis)
l L L
l C/C++ Tensorflow L
n DNN
l HLS (Vivado HLS, Intel HLS):
ü L L
ü L L
3
DNN-HW
4. n DNN
l Tensorflow
ü HW
n
l HLS Veriloggen
https://github.com/PyHDI/veriloggen
l HLS C++/C HDL
n
l
l Veriloggen.Thread, Veriloggen.Stream
4
5. : NNgen
n DNN
IP
n :
Tensorflow
n : RTL + IP
l Veriloggen Object
l Verilog HDL
l IP-XACT
5
Model Definition
layer0 = ng.conv2d(a0, w0, ...)
NNgen
Scheduler
Graph Optimization
Task Scheduling
Allocator
RAM Assignment
Stream-Op Assignment
Pipeline Synthesis
Building Stream-Op via
Veriloggen.Stream API
Control Synthesis
Building FSM via
Veriloggen.Thread API
Code Synthesis
RTL and IP-XACT generation via Veriloggen/IPgen
Pyverilog
Verilog HDL AST Abstraction
IPgen
RTL to IP-XACT
Veriloggen
Veriloggen.Thread
Procedural HLS:
Python Source Code
-> AST -> FSM
Veriloggen.Stream
Dataflow HLS:
Dataflow Definition
-> Scheduled Pipeline
Veriloggen.Core
Verilog HDL Abstraction and Meta-Programing API
13. NNgen-DNN
13
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
ThreadArg
Stream
ThreadArg
Stream
ThreadArg
Stream
Main Thread
SubstreamInterconnect
BRAM
Width:
16x4-bit
BRAM
Width:
16x4-bit
BRAM
MemoryInterconnect
DMAInterconnect
DMAController
AXI4MasterI/FAXI4SlaveI/F
Config Register
AXI4Interconnect
NNgen IP-core (IP-XACT)
DRAM
14. NNgen-DNN
14
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
ThreadArg
Stream
ThreadArg
Stream
ThreadArg
Stream
Main Thread
SubstreamInterconnect
BRAM
Width:
16x4-bit
BRAM
Width:
16x4-bit
BRAM
MemoryInterconnect
DMAInterconnect
DMAController
AXI4MasterI/FAXI4SlaveI/F
Config Register
AXI4Interconnect
NNgen IP-core (IP-XACT)
DRAM
OP
15. NNgen-DNN
15
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
ThreadArg
Stream
ThreadArg
Stream
ThreadArg
Stream
Main Thread
SubstreamInterconnect
BRAM
Width:
16x4-bit
BRAM
Width:
16x4-bit
BRAM
MemoryInterconnect
DMAInterconnect
DMAController
AXI4MasterI/FAXI4SlaveI/F
Config Register
AXI4Interconnect
NNgen IP-core (IP-XACT)
DRAM
OP
16. NNgen-DNN
16
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
ThreadArg
Stream
ThreadArg
Stream
ThreadArg
Stream
Main Thread
SubstreamInterconnect
BRAM
Width:
16x4-bit
BRAM
Width:
16x4-bit
BRAM
MemoryInterconnect
DMAInterconnect
DMAController
AXI4MasterI/FAXI4SlaveI/F
Config Register
AXI4Interconnect
NNgen IP-core (IP-XACT)
DRAM
NoC
17. NNgen-DNN
17
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
ThreadArg
Stream
ThreadArg
Stream
ThreadArg
Stream
Main Thread
SubstreamInterconnect
BRAM
Width:
16x4-bit
BRAM
Width:
16x4-bit
BRAM
MemoryInterconnect
DMAInterconnect
DMAController
AXI4MasterI/FAXI4SlaveI/F
Config Register
AXI4Interconnect
NNgen IP-core (IP-XACT)
DRAM
RAM
18. NNgen-DNN
18
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
ThreadArg
Stream
ThreadArg
Stream
ThreadArg
Stream
Main Thread
SubstreamInterconnect
BRAM
Width:
16x4-bit
BRAM
Width:
16x4-bit
BRAM
MemoryInterconnect
DMAInterconnect
DMAController
AXI4MasterI/FAXI4SlaveI/F
Config Register
AXI4Interconnect
NNgen IP-core (IP-XACT)
DRAM
RAM NoC
19. NNgen-DNN
19
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
ThreadArg
Stream
ThreadArg
Stream
ThreadArg
Stream
Main Thread
SubstreamInterconnect
BRAM
Width:
16x4-bit
BRAM
Width:
16x4-bit
BRAM
MemoryInterconnect
DMAInterconnect
DMAController
AXI4MasterI/FAXI4SlaveI/F
Config Register
AXI4Interconnect
NNgen IP-core (IP-XACT)
DRAM
AXI4-Master + DMA
20. NNgen-DNN
20
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
ThreadArg
Stream
ThreadArg
Stream
ThreadArg
Stream
Main Thread
SubstreamInterconnect
BRAM
Width:
16x4-bit
BRAM
Width:
16x4-bit
BRAM
MemoryInterconnect
DMAInterconnect
DMAController
AXI4MasterI/FAXI4SlaveI/F
Config Register
AXI4Interconnect
NNgen IP-core (IP-XACT)
DRAM
FSM
FSMFSM
21. NNgen-DNN
21
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
ThreadArg
Stream
ThreadArg
Stream
ThreadArg
Stream
Main Thread
SubstreamInterconnect
BRAM
Width:
16x4-bit
BRAM
Width:
16x4-bit
BRAM
MemoryInterconnect
DMAInterconnect
DMAController
AXI4MasterI/FAXI4SlaveI/F
Config Register
AXI4Interconnect
NNgen IP-core (IP-XACT)
DRAM
35. n
FSM NNgen
n RTL + IP
Veriloggen
n FSM
Python
l
35
Model Definition
layer0 = ng.conv2d(a0, w0, ...)
NNgen
Scheduler
Graph Optimization
Task Scheduling
Allocator
RAM Assignment
Stream-Op Assignment
Pipeline Synthesis
Building Stream-Op via
Veriloggen.Stream API
Control Synthesis
Building FSM via
Veriloggen.Thread API
Code Synthesis
RTL and IP-XACT generation via Veriloggen/IPgen
Pyverilog
Verilog HDL AST Abstraction
IPgen
RTL to IP-XACT
Veriloggen
Veriloggen.Thread
Procedural HLS:
Python Source Code
-> AST -> FSM
Veriloggen.Stream
Dataflow HLS:
Dataflow Definition
-> Scheduled Pipeline
Veriloggen.Core
Verilog HDL Abstraction and Meta-Programing API
36. Veriloggen:
Python RTL
36
Design Generator by Python
from veriloggen import *
m = Module('blinkled')
clk = m.Input('CLK')
led = m.Output('LED', 8)
count = m.Reg('count', 32)
m.Assign( led(count[31:24]) )
m.Always(Posedge(clk)(
count( count + 1 ) )
hdl = m.to_verilog()
print(hdl)
blinkled
CLK RST
LED count
assign
always
Veriloggen Object
module blinkled (
input CLK,
output [7:0] LED
);
reg [31:0] count;
assign LED = count[31:24];
always @(posedge CLK) begin
count <= count + 1;
end
endmodule
Verilog Source Code
module
input
CLK
input
RST
blinkled
Verilog AST
to_verilog()
Verilog
AST
Generator
Verilog
Code
Generator
Run on Python Interpreter
Verilog HDL
Python
Verilog HDL
43. n Veriloggen HW
l LLVM Veriloggen
l NNgen Veriloggen.Thread Veriloggen.Stream
API
n : Veriloggen.Stream API
l : c = a + b, z = x * y
n : Veriloggen.Thread API FSM
l NNgen FSM
l RAM
43
44. n NNgen:
DNN-HW
n
l 8bit
l
l
ü ONNX, TVM
l
44
Model Definition
layer0 = ng.conv2d(a0, w0, ...)
NNgen
Scheduler
Graph Optimization
Task Scheduling
Allocator
RAM Assignment
Stream-Op Assignment
Pipeline Synthesis
Building Stream-Op via
Veriloggen.Stream API
Control Synthesis
Building FSM via
Veriloggen.Thread API
Code Synthesis
RTL and IP-XACT generation via Veriloggen/IPgen
Pyverilog
Verilog HDL AST Abstraction
IPgen
RTL to IP-XACT
Veriloggen
Veriloggen.Thread
Procedural HLS:
Python Source Code
-> AST -> FSM
Veriloggen.Stream
Dataflow HLS:
Dataflow Definition
-> Scheduled Pipeline
Veriloggen.Core
Verilog HDL Abstraction and Meta-Programing API