Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発

3,054 views

Published on

ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発(2018年9月17日 電子情報通信学会リコンフィギャラブルシステム研究会 (RECONF) at LINE Fukuoka)

Published in: Technology
  • Get access to 16,000 woodworking plans, Download 50 FREE Plans... ◆◆◆ http://t.cn/A6hKwqcb
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • -- DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT -- ......................................................................................................................... ......................................................................................................................... Download FULL PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... (Unlimited)
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • -- DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT -- ......................................................................................................................... ......................................................................................................................... Download FULL PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... (Unlimited)
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • If you want to download or read this book, copy link or url below in the New tab ......................................................................................................................... DOWNLOAD FULL PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download EPUB Ebook here { http://bit.ly/2m6jJ5M } .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THAT BOOKS/FILE INTO AVAILABLE FORMAT - (Unlimited) ......................................................................................................................... ......................................................................................................................... Download FULL PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... accessibility Books Library allowing access to top content, including thousands of title from favorite author, plus the ability to read or download a huge selection of books for your pc or smartphone within minutes Christian, Classics, Comics, Contemporary, Cookbooks, Art, Biography, Business, Chick Lit, Children's, Manga, Memoir, Music, Science, Science Fiction, Self Help, History, Horror, Humor And Comedy, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発

  1. 1. † † ‡ ‡ † † ‡ 2018 9 17 13:25-13:50 IEICE-RECONF @LINE Fukuoka
  2. 2. FPGA DNN n FPGA: HW n CPU/GPU n DNN J = DNN L 2 SB CB CB LB SB CB CB LB SB CB CB LB SB CB SB CB CB LB SB CB CB LB SB CB CB LB SB CB SB CB CB LB SB CB CB LB SB CB CB LB SB CB SB CB SB CB SB CB SB IOB IOB IOB IOB IOB IOB IOBIOBIOB IOBIOBIOB FPGA
  3. 3. : DNN-HW FPGA n Verilog HDL/VHDL l L n C/C++ (High Level Synthesis) l L L l C/C++ Tensorflow L n DNN l HLS (Vivado HLS, Intel HLS): ü L L ü L L 3 DNN-HW
  4. 4. n DNN l Tensorflow ü HW n l HLS Veriloggen https://github.com/PyHDI/veriloggen l HLS C++/C HDL n l l Veriloggen.Thread, Veriloggen.Stream 4
  5. 5. : NNgen n DNN IP n : Tensorflow n : RTL + IP l Veriloggen Object l Verilog HDL l IP-XACT 5 Model Definition layer0 = ng.conv2d(a0, w0, ...) NNgen Scheduler Graph Optimization Task Scheduling Allocator RAM Assignment Stream-Op Assignment Pipeline Synthesis Building Stream-Op via Veriloggen.Stream API Control Synthesis Building FSM via Veriloggen.Thread API Code Synthesis RTL and IP-XACT generation via Veriloggen/IPgen Pyverilog Verilog HDL AST Abstraction IPgen RTL to IP-XACT Veriloggen Veriloggen.Thread Procedural HLS: Python Source Code -> AST -> FSM Veriloggen.Stream Dataflow HLS: Dataflow Definition -> Scheduled Pipeline Veriloggen.Core Verilog HDL Abstraction and Meta-Programing API
  6. 6. NNgen DNN-HW 6 NNgen Tensorflow
  7. 7. 7 placeholder: DNN
  8. 8. 8 placeholder: DNN conv2d: 2D (w/ ReLU) DNN RAM HW
  9. 9. 9 placeholder: DNN conv2d 1 max_pool: DNN RAM HW
  10. 10. 10 placeholder: DNN conv2d 1 max_pool reshape: numpy.reshape tf.reshape
  11. 11. 11 placeholder: DNN conv2d 1 max_pool reshape: 4Dà 1D matmul: DNN RAM HW
  12. 12. 12 max_pool reshape: 4Dà 1D matmul: DNN DNN Veriloggen ≒RTL IP-XACT
  13. 13. NNgen-DNN 13 CPU Substream Pool Computing Unit Pool RAM Pool Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Acc Acc Acc Acc AddTree AddTree AddTree AddTree conv2d 3x3 Parallel: 3x3x4x4 max_pool 2x2 Parallel: 4 matmul Parallel: 4x4 ThreadArg Stream ThreadArg Stream ThreadArg Stream Main Thread SubstreamInterconnect BRAM Width: 16x4-bit BRAM Width: 16x4-bit BRAM MemoryInterconnect DMAInterconnect DMAController AXI4MasterI/FAXI4SlaveI/F Config Register AXI4Interconnect NNgen IP-core (IP-XACT) DRAM
  14. 14. NNgen-DNN 14 CPU Substream Pool Computing Unit Pool RAM Pool Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Acc Acc Acc Acc AddTree AddTree AddTree AddTree conv2d 3x3 Parallel: 3x3x4x4 max_pool 2x2 Parallel: 4 matmul Parallel: 4x4 ThreadArg Stream ThreadArg Stream ThreadArg Stream Main Thread SubstreamInterconnect BRAM Width: 16x4-bit BRAM Width: 16x4-bit BRAM MemoryInterconnect DMAInterconnect DMAController AXI4MasterI/FAXI4SlaveI/F Config Register AXI4Interconnect NNgen IP-core (IP-XACT) DRAM OP
  15. 15. NNgen-DNN 15 CPU Substream Pool Computing Unit Pool RAM Pool Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Acc Acc Acc Acc AddTree AddTree AddTree AddTree conv2d 3x3 Parallel: 3x3x4x4 max_pool 2x2 Parallel: 4 matmul Parallel: 4x4 ThreadArg Stream ThreadArg Stream ThreadArg Stream Main Thread SubstreamInterconnect BRAM Width: 16x4-bit BRAM Width: 16x4-bit BRAM MemoryInterconnect DMAInterconnect DMAController AXI4MasterI/FAXI4SlaveI/F Config Register AXI4Interconnect NNgen IP-core (IP-XACT) DRAM OP
  16. 16. NNgen-DNN 16 CPU Substream Pool Computing Unit Pool RAM Pool Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Acc Acc Acc Acc AddTree AddTree AddTree AddTree conv2d 3x3 Parallel: 3x3x4x4 max_pool 2x2 Parallel: 4 matmul Parallel: 4x4 ThreadArg Stream ThreadArg Stream ThreadArg Stream Main Thread SubstreamInterconnect BRAM Width: 16x4-bit BRAM Width: 16x4-bit BRAM MemoryInterconnect DMAInterconnect DMAController AXI4MasterI/FAXI4SlaveI/F Config Register AXI4Interconnect NNgen IP-core (IP-XACT) DRAM NoC
  17. 17. NNgen-DNN 17 CPU Substream Pool Computing Unit Pool RAM Pool Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Acc Acc Acc Acc AddTree AddTree AddTree AddTree conv2d 3x3 Parallel: 3x3x4x4 max_pool 2x2 Parallel: 4 matmul Parallel: 4x4 ThreadArg Stream ThreadArg Stream ThreadArg Stream Main Thread SubstreamInterconnect BRAM Width: 16x4-bit BRAM Width: 16x4-bit BRAM MemoryInterconnect DMAInterconnect DMAController AXI4MasterI/FAXI4SlaveI/F Config Register AXI4Interconnect NNgen IP-core (IP-XACT) DRAM RAM
  18. 18. NNgen-DNN 18 CPU Substream Pool Computing Unit Pool RAM Pool Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Acc Acc Acc Acc AddTree AddTree AddTree AddTree conv2d 3x3 Parallel: 3x3x4x4 max_pool 2x2 Parallel: 4 matmul Parallel: 4x4 ThreadArg Stream ThreadArg Stream ThreadArg Stream Main Thread SubstreamInterconnect BRAM Width: 16x4-bit BRAM Width: 16x4-bit BRAM MemoryInterconnect DMAInterconnect DMAController AXI4MasterI/FAXI4SlaveI/F Config Register AXI4Interconnect NNgen IP-core (IP-XACT) DRAM RAM NoC
  19. 19. NNgen-DNN 19 CPU Substream Pool Computing Unit Pool RAM Pool Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Acc Acc Acc Acc AddTree AddTree AddTree AddTree conv2d 3x3 Parallel: 3x3x4x4 max_pool 2x2 Parallel: 4 matmul Parallel: 4x4 ThreadArg Stream ThreadArg Stream ThreadArg Stream Main Thread SubstreamInterconnect BRAM Width: 16x4-bit BRAM Width: 16x4-bit BRAM MemoryInterconnect DMAInterconnect DMAController AXI4MasterI/FAXI4SlaveI/F Config Register AXI4Interconnect NNgen IP-core (IP-XACT) DRAM AXI4-Master + DMA
  20. 20. NNgen-DNN 20 CPU Substream Pool Computing Unit Pool RAM Pool Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Acc Acc Acc Acc AddTree AddTree AddTree AddTree conv2d 3x3 Parallel: 3x3x4x4 max_pool 2x2 Parallel: 4 matmul Parallel: 4x4 ThreadArg Stream ThreadArg Stream ThreadArg Stream Main Thread SubstreamInterconnect BRAM Width: 16x4-bit BRAM Width: 16x4-bit BRAM MemoryInterconnect DMAInterconnect DMAController AXI4MasterI/FAXI4SlaveI/F Config Register AXI4Interconnect NNgen IP-core (IP-XACT) DRAM FSM FSMFSM
  21. 21. NNgen-DNN 21 CPU Substream Pool Computing Unit Pool RAM Pool Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Acc Acc Acc Acc AddTree AddTree AddTree AddTree conv2d 3x3 Parallel: 3x3x4x4 max_pool 2x2 Parallel: 4 matmul Parallel: 4x4 ThreadArg Stream ThreadArg Stream ThreadArg Stream Main Thread SubstreamInterconnect BRAM Width: 16x4-bit BRAM Width: 16x4-bit BRAM MemoryInterconnect DMAInterconnect DMAController AXI4MasterI/FAXI4SlaveI/F Config Register AXI4Interconnect NNgen IP-core (IP-XACT) DRAM
  22. 22. conv2d 22 Act (rank: 4) [Bat][Height][Width][Ch] Weight (rank: 4) [OutCh][Kh][Kw][InCh] Width Height (Input) C hannel Batch Kernel-W Kernel-H Input C hannel Output Channel
  23. 23. conv2d 23 Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Weight BRAM 16x4-bit Weight BRAM 16x4-bit Weight BRAM 16x4-bit Weight BRAM 16x4-bit AddTree(4x3x3input) AddTree(4x3x3input) AddTree(4x3x3input) AddTree(4x3x3input) Acc Acc Acc Acc Out BRAM 16x4-bit rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift Bias BRAM 16x4-bit rshift rshift rshift Add Add Add Add Mul Mul Mul Mul rshift rshift rshift rshift Scale BRAM 16x4-bit OutCh 0 OutCh 1 OutCh 2 OutCh 3 InCh 0 InCh 1 InCh 2 InCh 3 ReLU ReLU ReLU ReLU Pixel 0 Act (3, 3) BRAM 16x4-bit Act (1,1) BRAM 16x4-bit 2-stage Xbar : Substream 4 4
  24. 24. conv2d 24 Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Weight BRAM 16x4-bit Weight BRAM 16x4-bit Weight BRAM 16x4-bit Weight BRAM 16x4-bit AddTree(4x3x3input) AddTree(4x3x3input) AddTree(4x3x3input) AddTree(4x3x3input) Acc Acc Acc Acc Out BRAM 16x4-bit rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift Bias BRAM 16x4-bit rshift rshift rshift Add Add Add Add Mul Mul Mul Mul rshift rshift rshift rshift Scale BRAM 16x4-bit OutCh 0 OutCh 1 OutCh 2 OutCh 3 InCh 0 InCh 1 InCh 2 InCh 3 ReLU ReLU ReLU ReLU Pixel 0 Act (3, 3) BRAM 16x4-bit Act (1,1) BRAM 16x4-bit 2-stage Xbar : Substream RAM
  25. 25. conv2d 25 Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Weight BRAM 16x4-bit Weight BRAM 16x4-bit Weight BRAM 16x4-bit Weight BRAM 16x4-bit AddTree(4x3x3input) AddTree(4x3x3input) AddTree(4x3x3input) AddTree(4x3x3input) Acc Acc Acc Acc Out BRAM 16x4-bit rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift Bias BRAM 16x4-bit rshift rshift rshift Add Add Add Add Mul Mul Mul Mul rshift rshift rshift rshift Scale BRAM 16x4-bit OutCh 0 OutCh 1 OutCh 2 OutCh 3 InCh 0 InCh 1 InCh 2 InCh 3 ReLU ReLU ReLU ReLU Pixel 0 Act (3, 3) BRAM 16x4-bit Act (1,1) BRAM 16x4-bit 2-stage Xbar : Substream Substream Pool
  26. 26. conv2d 26 Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Weight BRAM 16x4-bit Weight BRAM 16x4-bit Weight BRAM 16x4-bit Weight BRAM 16x4-bit AddTree(4x3x3input) AddTree(4x3x3input) AddTree(4x3x3input) AddTree(4x3x3input) Acc Acc Acc Acc Out BRAM 16x4-bit rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift Bias BRAM 16x4-bit rshift rshift rshift Add Add Add Add Mul Mul Mul Mul rshift rshift rshift rshift Scale BRAM 16x4-bit OutCh 0 OutCh 1 OutCh 2 OutCh 3 InCh 0 InCh 1 InCh 2 InCh 3 ReLU ReLU ReLU ReLU Pixel 0 Act (3, 3) BRAM 16x4-bit Act (1,1) BRAM 16x4-bit 2-stage Xbar : Substream Substream Pool
  27. 27. conv2d 27 Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Weight BRAM 16x4-bit Weight BRAM 16x4-bit Weight BRAM 16x4-bit Weight BRAM 16x4-bit AddTree(4x3x3input) AddTree(4x3x3input) AddTree(4x3x3input) AddTree(4x3x3input) Acc Acc Acc Acc Out BRAM 16x4-bit rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift Bias BRAM 16x4-bit rshift rshift rshift Add Add Add Add Mul Mul Mul Mul rshift rshift rshift rshift Scale BRAM 16x4-bit OutCh 0 OutCh 1 OutCh 2 OutCh 3 InCh 0 InCh 1 InCh 2 InCh 3 ReLU ReLU ReLU ReLU Pixel 0 Act (3, 3) BRAM 16x4-bit Act (1,1) BRAM 16x4-bit 2-stage Xbar : Substream Substream Pool
  28. 28. conv2d 28 Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Weight BRAM 16x4-bit Weight BRAM 16x4-bit Weight BRAM 16x4-bit Weight BRAM 16x4-bit AddTree(4x3x3input) AddTree(4x3x3input) AddTree(4x3x3input) AddTree(4x3x3input) Acc Acc Acc Acc Out BRAM 16x4-bit rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift Bias BRAM 16x4-bit rshift rshift rshift Add Add Add Add Mul Mul Mul Mul rshift rshift rshift rshift Scale BRAM 16x4-bit OutCh 0 OutCh 1 OutCh 2 OutCh 3 InCh 0 InCh 1 InCh 2 InCh 3 ReLU ReLU ReLU ReLU Pixel 0 Act (3, 3) BRAM 16x4-bit Act (1,1) BRAM 16x4-bit 2-stage Xbar : Substream Batch Normalization)
  29. 29. conv2d 29 Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Weight BRAM 16x4-bit Weight BRAM 16x4-bit Weight BRAM 16x4-bit Weight BRAM 16x4-bit AddTree(4x3x3input) AddTree(4x3x3input) AddTree(4x3x3input) AddTree(4x3x3input) Acc Acc Acc Acc Out BRAM 16x4-bit rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift Bias BRAM 16x4-bit rshift rshift rshift Add Add Add Add Mul Mul Mul Mul rshift rshift rshift rshift Scale BRAM 16x4-bit OutCh 0 OutCh 1 OutCh 2 OutCh 3 InCh 0 InCh 1 InCh 2 InCh 3 ReLU ReLU ReLU ReLU Pixel 0 Act (3, 3) BRAM 16x4-bit Act (1,1) BRAM 16x4-bit 2-stage Xbar : Substream
  30. 30. conv2d 30 Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Weight BRAM 16x4-bit Weight BRAM 16x4-bit Weight BRAM 16x4-bit Weight BRAM 16x4-bit AddTree(4x3x3input) AddTree(4x3x3input) AddTree(4x3x3input) AddTree(4x3x3input) Acc Acc Acc Acc Out BRAM 16x4-bit rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift rshift Bias BRAM 16x4-bit rshift rshift rshift Add Add Add Add Mul Mul Mul Mul rshift rshift rshift rshift Scale BRAM 16x4-bit OutCh 0 OutCh 1 OutCh 2 OutCh 3 InCh 0 InCh 1 InCh 2 InCh 3 ReLU ReLU ReLU ReLU Pixel 0 Act (3, 3) BRAM 16x4-bit Act (1,1) BRAM 16x4-bit 2-stage Xbar : Substream
  31. 31. 31 IP
  32. 32. 32
  33. 33. 33 RAM FSM
  34. 34. 34 AXI4-slave
  35. 35. n FSM NNgen n RTL + IP Veriloggen n FSM Python l 35 Model Definition layer0 = ng.conv2d(a0, w0, ...) NNgen Scheduler Graph Optimization Task Scheduling Allocator RAM Assignment Stream-Op Assignment Pipeline Synthesis Building Stream-Op via Veriloggen.Stream API Control Synthesis Building FSM via Veriloggen.Thread API Code Synthesis RTL and IP-XACT generation via Veriloggen/IPgen Pyverilog Verilog HDL AST Abstraction IPgen RTL to IP-XACT Veriloggen Veriloggen.Thread Procedural HLS: Python Source Code -> AST -> FSM Veriloggen.Stream Dataflow HLS: Dataflow Definition -> Scheduled Pipeline Veriloggen.Core Verilog HDL Abstraction and Meta-Programing API
  36. 36. Veriloggen: Python RTL 36 Design Generator by Python from veriloggen import * m = Module('blinkled') clk = m.Input('CLK') led = m.Output('LED', 8) count = m.Reg('count', 32) m.Assign( led(count[31:24]) ) m.Always(Posedge(clk)( count( count + 1 ) ) hdl = m.to_verilog() print(hdl) blinkled CLK RST LED count assign always Veriloggen Object module blinkled ( input CLK, output [7:0] LED ); reg [31:0] count; assign LED = count[31:24]; always @(posedge CLK) begin count <= count + 1; end endmodule Verilog Source Code module input CLK input RST blinkled Verilog AST to_verilog() Verilog AST Generator Verilog Code Generator Run on Python Interpreter Verilog HDL Python Verilog HDL
  37. 37. Veriloggen: 37 Veriloggen.Core (RTL) Thread RAM Thread RAM Stream Stream Computing Unit Thread Python-to-FSM Stream Control Thread Bus + DMA (AXI4 Master/Slave) AXI4 Interconnect DRAMCPU RTL Control Intrinsic RTL RTL Control DMA Control DMA Burst Transfer
  38. 38. Thread: Python-to-FSM LED 38 Module I/O Verilog ( : CLK, RST, LED ) Thread Python I/O Thread FSM
  39. 39. Stream: 39 ram_a ram_b ram_c ACC+ source/sink run/join RAM
  40. 40. Substream: Stream 40 MAC Stream
  41. 41. ) 1 41 ram_a ram_b ram_c ACC+ Mult Add
  42. 42. DMA 42 I/F RAM: RAM AXIM: AXI4 IF AXIS: AXI4 IF Thread DMA (Async ) dma_read: read dma_write: write Burst Read Burst Write
  43. 43. n Veriloggen HW l LLVM Veriloggen l NNgen Veriloggen.Thread Veriloggen.Stream API n : Veriloggen.Stream API l : c = a + b, z = x * y n : Veriloggen.Thread API FSM l NNgen FSM l RAM 43
  44. 44. n NNgen: DNN-HW n l 8bit l l ü ONNX, TVM l 44 Model Definition layer0 = ng.conv2d(a0, w0, ...) NNgen Scheduler Graph Optimization Task Scheduling Allocator RAM Assignment Stream-Op Assignment Pipeline Synthesis Building Stream-Op via Veriloggen.Stream API Control Synthesis Building FSM via Veriloggen.Thread API Code Synthesis RTL and IP-XACT generation via Veriloggen/IPgen Pyverilog Verilog HDL AST Abstraction IPgen RTL to IP-XACT Veriloggen Veriloggen.Thread Procedural HLS: Python Source Code -> AST -> FSM Veriloggen.Stream Dataflow HLS: Dataflow Definition -> Scheduled Pipeline Veriloggen.Core Verilog HDL Abstraction and Meta-Programing API

×