Politecnico di Milano
Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB)
marco.bacis@mail.polimi.it
Marco Bacis, Giuseppe Natale, Emanuele Del Sozzo,
Marco Domenico Santambrogio
CNN Dataflow implementation on
FPGAs
QUID @ San Francisco
Tuesday, 6th June 2017
Introduction 2
Issues
3
Challenges
Huge set of weights and data
Memory bounded computation
Need to have a scalable design in terms of
memory and resources
without losing in performance
+
=
● Exploitation of the dataflow pattern of CNN operations
● Independent modules with parametric level of parallelism
● Streaming + Dataflow computational paradigm with
efficient memory access
Our Solution 4
Methodology for CNN acceleration on FPGA with
5
Iterative Stencil Loops
Spatial dependencies
Memory bound
Enable efficient solutions in term of
performance and power
6
● Independent modules communicating over FIFOs
● Concurrent memory access and optimal full buffering
● Scalable without increasing external memory use
Streaming StencilTimestep
7
Proposed Approach
Implementation 8
1. Convolution Module Structure
2. Fully Connected Module Structure
3. Network Design
Convolution Module Structure 9
SST
Convolution Module - Parameters 10
• Input/Output Height
• Input/OutputWidth
• Number of Input Feature Maps
• Number of Output Feature Maps
Convolution Module - Parameters 11
• Kernel Height
• KernelWidth
• Number of Input Ports
• Number of Output Ports
# Input FMs received per cycle
# Output FMs sent per cycle
Implementation 12
1. Convolution Module Structure
2. Fully Connected Module Structure
3. Network Design
Fully Connected Module Structure 13
● Treated as a 1x1 convolution
● “Compressed” streaming approach
● 1 input port, 1 output port
● Low latency Floating point accumulation
● Issue for pipelining
● Multiple accumulators + Loop Unrolling
Implementation 14
1. Convolution Module Structure
2. Fully Connected Module Structure
3. Network Design
Network Design 15
● Convolutional Module
● Memory structure based on I/O ports
● Single vs Multi channel memory cores
● Pooling Module
● Independent from channel
● One module for each previous output port
● Fully-Connected Module -> single pipelined core
Experimental Evaluation 16
● Two evaluation designs
● CIFAR-10 network
Conv -> Pool -> Conv -> Pool -> Lin ->Lin
● USPS network
Conv -> Pool -> Conv -> Lin
● Different design choices as a proof-of-concept of the
methodology
● Tested on a XilinxVC707 board
CIFAR-10 Network 17
5 x 5
3 in FMs
12 out FMs
32 x 32
Conv 1
2 x 2
12 in FMs
12 out FMs
28 x 28
Pool 1
5 x 5
12 in FMs
36 out FMs
14 x 14
Conv 2
2 x 2
36 in FMs
36 out FMs
10 x 10
Pool 2
900 in
36 out
Lin 1
36 in
10 out
Lin 2
USPS Network 18
5 x 5
1 in FMs
6 out FMs
16 x 16
Conv 1
2 x 2
6 in FMs
6 out FMs
12 x 12
Pool 1
5 x 5
6 in FMs
16 out FMs
6 x 6
Conv 2
64 in
10 out
Lin 1
Experimental Results 19
Performance improvements with increased batch size
Experimental Results 20
Dataset GFLOPS GFLOPS/W Images/s
Test Case 1 USPS 5.2 0.25 172414
Test Case 2 CIFAR-10 28.4 1.19 7809
MSR Work [1] CIFAR-10 - - 2318
Flips Flops LUTs BRAM DSP Slices
Test Case 1 41.10% 50.86% 3.50% 55.04%
Test Case 2 61.77% 71.24% 22.82% 74.32%
Performances and Power Efficiency Results
FPGA Resources Usage
[1] K. Ovtcharov et al., “Accelerating deep convolutional neural network using specialized hardware”, Microsoft Research
Whitepaper, 2015
Discussion 21
● Performance improvement over large batches
● High level pipeline between layers
● Improved memory bandwidth utilization
● High scalability given limited resources
Conclusions & FutureWork 22
● Modular and scalar methodology to accelerate CNNs on
FPGAs using a dataflow approach
● Tunable level of parallelism allows for further studies
and (automated) Design Space Exploration
● Further studies to carry over different fixed precision
types and a Multi-FPGA scalable approach
23
Questions?
Marco Bacis
M. Bacis, G. Natale, E. Del Sozzo, and M. D. Santambrogio
“A Pipelined and Scalable Dataflow Implementation of Convolutional Neural Networks on FPGA”
IPDPS Workshops (RAW), May 2017
M. Bacis, G. Natale, and M. D. Santambrogio
“On how to design dataflow FPGA-based accelerators for Convolutional Neural Networks”
ISVLSI Conference, July 2017 – To Appear
References
marco.bacis@mail.polimi.it
CNN Acceleration 24
GPU ASIC FPGA
High Performance
General architecture
High power consumption
Fast development
Best Performance
Custom architecture
Low Power Consumption
Long/Expensive design
High Performance
Reconfigurable architecture
Low Power Consumption
Fast design prototyping
RelatedWorks● Previously based on
○ Matrix of Processing Elements [1]
○ LoopTiling + Roofline Model [1,2]
○ Parallel fixed convolvers [3]
[1] C.Zhang et al., “Optimizing fpga-based accelerator design for deep convolutional neural networks”, ISFPGA 2015
[2] M.Peemen et al., “Memory-centric accelerator design for convolutional neural networks”, ICCD 2013
[3] M.Sankaradas et al., “A massively parallel coprocessor for convolutional neural networks”, ASAP 2009
25
RelatedWorks
● Issues
○ Communication overhead
○ Suboptimal exploitation of on-chip memory
○ Execution in time (control flow) vs space (dataflow)
Convolutional Neural Networks 26
● State-of-art image recognition/classification component
● Chain of multiple layers to extract, transform and merge
features from the data
● Unfortunately, they are compute intensive and require
specific optimizations

CNN Dataflow Implementation on FPGAs

  • 1.
    Politecnico di Milano Dipartimentodi Elettronica, Informazione e Bioingegneria (DEIB) marco.bacis@mail.polimi.it Marco Bacis, Giuseppe Natale, Emanuele Del Sozzo, Marco Domenico Santambrogio CNN Dataflow implementation on FPGAs QUID @ San Francisco Tuesday, 6th June 2017
  • 2.
  • 3.
    Issues 3 Challenges Huge set ofweights and data Memory bounded computation Need to have a scalable design in terms of memory and resources without losing in performance + =
  • 4.
    ● Exploitation ofthe dataflow pattern of CNN operations ● Independent modules with parametric level of parallelism ● Streaming + Dataflow computational paradigm with efficient memory access Our Solution 4 Methodology for CNN acceleration on FPGA with
  • 5.
    5 Iterative Stencil Loops Spatialdependencies Memory bound Enable efficient solutions in term of performance and power
  • 6.
    6 ● Independent modulescommunicating over FIFOs ● Concurrent memory access and optimal full buffering ● Scalable without increasing external memory use Streaming StencilTimestep
  • 7.
  • 8.
    Implementation 8 1. ConvolutionModule Structure 2. Fully Connected Module Structure 3. Network Design
  • 9.
  • 10.
    Convolution Module -Parameters 10 • Input/Output Height • Input/OutputWidth • Number of Input Feature Maps • Number of Output Feature Maps
  • 11.
    Convolution Module -Parameters 11 • Kernel Height • KernelWidth • Number of Input Ports • Number of Output Ports # Input FMs received per cycle # Output FMs sent per cycle
  • 12.
    Implementation 12 1. ConvolutionModule Structure 2. Fully Connected Module Structure 3. Network Design
  • 13.
    Fully Connected ModuleStructure 13 ● Treated as a 1x1 convolution ● “Compressed” streaming approach ● 1 input port, 1 output port ● Low latency Floating point accumulation ● Issue for pipelining ● Multiple accumulators + Loop Unrolling
  • 14.
    Implementation 14 1. ConvolutionModule Structure 2. Fully Connected Module Structure 3. Network Design
  • 15.
    Network Design 15 ●Convolutional Module ● Memory structure based on I/O ports ● Single vs Multi channel memory cores ● Pooling Module ● Independent from channel ● One module for each previous output port ● Fully-Connected Module -> single pipelined core
  • 16.
    Experimental Evaluation 16 ●Two evaluation designs ● CIFAR-10 network Conv -> Pool -> Conv -> Pool -> Lin ->Lin ● USPS network Conv -> Pool -> Conv -> Lin ● Different design choices as a proof-of-concept of the methodology ● Tested on a XilinxVC707 board
  • 17.
    CIFAR-10 Network 17 5x 5 3 in FMs 12 out FMs 32 x 32 Conv 1 2 x 2 12 in FMs 12 out FMs 28 x 28 Pool 1 5 x 5 12 in FMs 36 out FMs 14 x 14 Conv 2 2 x 2 36 in FMs 36 out FMs 10 x 10 Pool 2 900 in 36 out Lin 1 36 in 10 out Lin 2
  • 18.
    USPS Network 18 5x 5 1 in FMs 6 out FMs 16 x 16 Conv 1 2 x 2 6 in FMs 6 out FMs 12 x 12 Pool 1 5 x 5 6 in FMs 16 out FMs 6 x 6 Conv 2 64 in 10 out Lin 1
  • 19.
    Experimental Results 19 Performanceimprovements with increased batch size
  • 20.
    Experimental Results 20 DatasetGFLOPS GFLOPS/W Images/s Test Case 1 USPS 5.2 0.25 172414 Test Case 2 CIFAR-10 28.4 1.19 7809 MSR Work [1] CIFAR-10 - - 2318 Flips Flops LUTs BRAM DSP Slices Test Case 1 41.10% 50.86% 3.50% 55.04% Test Case 2 61.77% 71.24% 22.82% 74.32% Performances and Power Efficiency Results FPGA Resources Usage [1] K. Ovtcharov et al., “Accelerating deep convolutional neural network using specialized hardware”, Microsoft Research Whitepaper, 2015
  • 21.
    Discussion 21 ● Performanceimprovement over large batches ● High level pipeline between layers ● Improved memory bandwidth utilization ● High scalability given limited resources
  • 22.
    Conclusions & FutureWork22 ● Modular and scalar methodology to accelerate CNNs on FPGAs using a dataflow approach ● Tunable level of parallelism allows for further studies and (automated) Design Space Exploration ● Further studies to carry over different fixed precision types and a Multi-FPGA scalable approach
  • 23.
    23 Questions? Marco Bacis M. Bacis,G. Natale, E. Del Sozzo, and M. D. Santambrogio “A Pipelined and Scalable Dataflow Implementation of Convolutional Neural Networks on FPGA” IPDPS Workshops (RAW), May 2017 M. Bacis, G. Natale, and M. D. Santambrogio “On how to design dataflow FPGA-based accelerators for Convolutional Neural Networks” ISVLSI Conference, July 2017 – To Appear References marco.bacis@mail.polimi.it
  • 24.
    CNN Acceleration 24 GPUASIC FPGA High Performance General architecture High power consumption Fast development Best Performance Custom architecture Low Power Consumption Long/Expensive design High Performance Reconfigurable architecture Low Power Consumption Fast design prototyping
  • 25.
    RelatedWorks● Previously basedon ○ Matrix of Processing Elements [1] ○ LoopTiling + Roofline Model [1,2] ○ Parallel fixed convolvers [3] [1] C.Zhang et al., “Optimizing fpga-based accelerator design for deep convolutional neural networks”, ISFPGA 2015 [2] M.Peemen et al., “Memory-centric accelerator design for convolutional neural networks”, ICCD 2013 [3] M.Sankaradas et al., “A massively parallel coprocessor for convolutional neural networks”, ASAP 2009 25 RelatedWorks ● Issues ○ Communication overhead ○ Suboptimal exploitation of on-chip memory ○ Execution in time (control flow) vs space (dataflow)
  • 26.
    Convolutional Neural Networks26 ● State-of-art image recognition/classification component ● Chain of multiple layers to extract, transform and merge features from the data ● Unfortunately, they are compute intensive and require specific optimizations