SlideShare a Scribd company logo
1 of 25
Politecnico di Milano
Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB)
marco.bacis@mail.polimi.it
Marco Bacis, Giuseppe Natale, Emanuele Del Sozzo,
Marco Domenico Santambrogio
CNN Dataflow implementation on
FPGAs
Samsung Research
Friday, 9th June 2017
Introduction 2
Issues
3
Challenges
Huge set of weights and data
Memory bounded computation
Need to have a scalable design in terms of
memory and resources
without losing in performance
+
=
● Exploitation of the dataflow pattern of CNN operations
● Independent modules with parametric level of parallelism
● Streaming + Dataflow computational paradigm with
efficient memory access
Our Solution 4
Methodology for CNN acceleration on FPGA with
5
Iterative Stencil Loops
Spatial dependencies
Memory bound
Enable efficient solutions in term of
performance and power
6
● Independent modules communicating over FIFOs
● Concurrent memory access and optimal full buffering
● Scalable without increasing external memory use
Streaming StencilTimestep
7
Proposed Approach
Implementation 8
1. Convolution Module Structure
2. Fully Connected Module Structure
3. Network Design
Convolution Module Structure 9
SST
Convolution Module - Parameters 10
• Input/Output Height
• Input/OutputWidth
• Number of Input Feature Maps
• Number of Output Feature Maps
Convolution Module - Parameters 11
• Kernel Height
• KernelWidth
• Number of Input Ports
• Number of Output Ports
# Input FMs received per cycle
# Output FMs sent per cycle
Implementation 12
1. Convolution Module Structure
2. Fully Connected Module Structure
3. Network Design
Fully Connected Module Structure 13
● Treated as a 1x1 convolution
● “Compressed” streaming approach
● 1 input port, 1 output port
● Low latency Floating point accumulation
● Issue for pipelining
● Multiple accumulators + Loop Unrolling
Implementation 14
1. Convolution Module Structure
2. Fully Connected Module Structure
3. Network Design
Network Design 15
● Convolutional Module
● Memory structure based on I/O ports
● Single vs Multi channel memory cores
● Pooling Module
● Independent from channel
● One module for each previous output port
● Fully-Connected Module -> single pipelined core
Experimental Evaluation 16
● Two evaluation designs
● CIFAR-10 network
Conv -> Pool -> Conv -> Pool -> Lin ->Lin
● USPS network
Conv -> Pool -> Conv -> Lin
● Different design choices as a proof-of-concept of the
methodology
● Tested on a XilinxVC707 board
CIFAR-10 Network 17
5 x 5
3 in FMs
12 out FMs
32 x 32
Conv 1
2 x 2
12 in FMs
12 out FMs
28 x 28
Pool 1
5 x 5
12 in FMs
36 out FMs
14 x 14
Conv 2
2 x 2
36 in FMs
36 out FMs
10 x 10
Pool 2
900 in
36 out
Lin 1
36 in
10 out
Lin 2
USPS Network 18
5 x 5
1 in FMs
6 out FMs
16 x 16
Conv 1
2 x 2
6 in FMs
6 out FMs
12 x 12
Pool 1
5 x 5
6 in FMs
16 out FMs
6 x 6
Conv 2
64 in
10 out
Lin 1
Experimental Results 19
Performance improvements with increased batch size
Experimental Results 20
Dataset GFLOPS GFLOPS/W Images/s
Test Case 1 USPS 5.2 0.25 172414
Test Case 2 CIFAR-10 28.4 1.19 7809
MSR Work [1] CIFAR-10 - - 2318
Flips Flops LUTs BRAM DSP Slices
Test Case 1 41.10% 50.86% 3.50% 55.04%
Test Case 2 61.77% 71.24% 22.82% 74.32%
Performances and Power Efficiency Results
FPGA Resources Usage
[1] K. Ovtcharov et al., “Accelerating deep convolutional neural network using specialized hardware”, Microsoft Research
Whitepaper, 2015
Conclusions 21
● Modular and scalar methodology to accelerate CNNs on
FPGAs using a dataflow approach
● Performance improvement over large batches
● High level pipeline between layers
● Improved memory bandwidth utilization
● High scalability given limited resources
FutureWorks 22
Multi-FPGA / Split layers approach
Automatic DSE / CADTool
Different precision / data type
23
Questions?
Marco Bacis
M. Bacis, G. Natale, E. Del Sozzo, and M. D. Santambrogio
“A Pipelined and Scalable Dataflow Implementation of Convolutional Neural Networks on FPGA”
IPDPS Workshops (RAW), May 2017
M. Bacis, G. Natale, and M. D. Santambrogio
“On how to design dataflow FPGA-based accelerators for Convolutional Neural Networks”
ISVLSI Conference, July 2017 – To Appear
References
marco.bacis@mail.polimi.it
RelatedWorks● Previously based on
○ Matrix of Processing Elements [1]
○ LoopTiling + Roofline Model [1,2]
○ Parallel fixed convolvers [3]
[1] C.Zhang et al., “Optimizing fpga-based accelerator design for deep convolutional neural networks”, ISFPGA 2015
[2] M.Peemen et al., “Memory-centric accelerator design for convolutional neural networks”, ICCD 2013
[3] M.Sankaradas et al., “A massively parallel coprocessor for convolutional neural networks”, ASAP 2009
24
RelatedWorks
● Issues
○ Communication overhead
○ Suboptimal exploitation of on-chip memory
○ Execution in time (control flow) vs space (dataflow)
Convolutional Neural Networks 25
● State-of-art image recognition/classification component
● Chain of multiple layers to extract, transform and merge
features from the data
● Unfortunately, they are compute intensive and require
specific optimizations

More Related Content

Similar to CNN Dataflow Implementation on FPGAs

Accumulo and the Convergence of Machine Learning, Big Data, and Supercomputing
Accumulo and the Convergence of Machine Learning, Big Data, and SupercomputingAccumulo and the Convergence of Machine Learning, Big Data, and Supercomputing
Accumulo and the Convergence of Machine Learning, Big Data, and Supercomputing
Accumulo Summit
 
Evaluating UCIe based multi-die SoC to meet timing and power
Evaluating UCIe based multi-die SoC to meet timing and power Evaluating UCIe based multi-die SoC to meet timing and power
Evaluating UCIe based multi-die SoC to meet timing and power
Deepak Shankar
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
inside-BigData.com
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passing
eSAT Journals
 
Exploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spaceExploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design space
jsvetter
 
Machine Learning @NECST
Machine Learning @NECSTMachine Learning @NECST
Machine Learning @NECST
NECST Lab @ Politecnico di Milano
 

Similar to CNN Dataflow Implementation on FPGAs (20)

CNN Dataflow Implementation on FPGAs
CNN Dataflow Implementation on FPGAsCNN Dataflow Implementation on FPGAs
CNN Dataflow Implementation on FPGAs
 
Deep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLabDeep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLab
 
Accumulo and the Convergence of Machine Learning, Big Data, and Supercomputing
Accumulo and the Convergence of Machine Learning, Big Data, and SupercomputingAccumulo and the Convergence of Machine Learning, Big Data, and Supercomputing
Accumulo and the Convergence of Machine Learning, Big Data, and Supercomputing
 
Expectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software researchExpectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software research
 
Evaluating UCIe based multi-die SoC to meet timing and power
Evaluating UCIe based multi-die SoC to meet timing and power Evaluating UCIe based multi-die SoC to meet timing and power
Evaluating UCIe based multi-die SoC to meet timing and power
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
2017 09-ohkawa-MCSoC2017-presen
2017 09-ohkawa-MCSoC2017-presen2017 09-ohkawa-MCSoC2017-presen
2017 09-ohkawa-MCSoC2017-presen
 
Network on Chip Architecture and Routing Techniques: A survey
Network on Chip Architecture and Routing Techniques: A surveyNetwork on Chip Architecture and Routing Techniques: A survey
Network on Chip Architecture and Routing Techniques: A survey
 
IEEE HPSR 2017 Keynote: Softwarized Dataplanes and the P^3 trade-offs: Progra...
IEEE HPSR 2017 Keynote: Softwarized Dataplanes and the P^3 trade-offs: Progra...IEEE HPSR 2017 Keynote: Softwarized Dataplanes and the P^3 trade-offs: Progra...
IEEE HPSR 2017 Keynote: Softwarized Dataplanes and the P^3 trade-offs: Progra...
 
Implementing AI: Hardware Challenges
Implementing AI: Hardware ChallengesImplementing AI: Hardware Challenges
Implementing AI: Hardware Challenges
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passing
 
SoC Solutions Enabling Server-Based Networking
SoC Solutions Enabling Server-Based NetworkingSoC Solutions Enabling Server-Based Networking
SoC Solutions Enabling Server-Based Networking
 
Trends and challenges in IP based SOC design
Trends and challenges in IP based SOC designTrends and challenges in IP based SOC design
Trends and challenges in IP based SOC design
 
数据中心网络研究:机遇与挑战
数据中心网络研究:机遇与挑战数据中心网络研究:机遇与挑战
数据中心网络研究:机遇与挑战
 
Exploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spaceExploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design space
 
Shantanu's Resume
Shantanu's ResumeShantanu's Resume
Shantanu's Resume
 
Resume_Aney N Khatavkar
Resume_Aney N KhatavkarResume_Aney N Khatavkar
Resume_Aney N Khatavkar
 
Resume_Aney N Khatavkar
Resume_Aney N KhatavkarResume_Aney N Khatavkar
Resume_Aney N Khatavkar
 
Machine Learning @NECST
Machine Learning @NECSTMachine Learning @NECST
Machine Learning @NECST
 
Performance analysis and implementation of modified sdm based noc for mpsoc o...
Performance analysis and implementation of modified sdm based noc for mpsoc o...Performance analysis and implementation of modified sdm based noc for mpsoc o...
Performance analysis and implementation of modified sdm based noc for mpsoc o...
 

More from NECST Lab @ Politecnico di Milano

Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposing
NECST Lab @ Politecnico di Milano
 

More from NECST Lab @ Politecnico di Milano (20)

Mesticheria Team - WiiReflex
Mesticheria Team - WiiReflexMesticheria Team - WiiReflex
Mesticheria Team - WiiReflex
 
Punto e virgola Team - Stressometro
Punto e virgola Team - StressometroPunto e virgola Team - Stressometro
Punto e virgola Team - Stressometro
 
BitIt Team - Stay.straight
BitIt Team - Stay.straight BitIt Team - Stay.straight
BitIt Team - Stay.straight
 
BabYodini Team - Talking Gloves
BabYodini Team - Talking GlovesBabYodini Team - Talking Gloves
BabYodini Team - Talking Gloves
 
printf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTonprintf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTon
 
BlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking PlatformBlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking Platform
 
#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome
 
Flipflops Team - Wave U
Flipflops Team - Wave UFlipflops Team - Wave U
Flipflops Team - Wave U
 
Bug(atta) Team - Little Brother
Bug(atta) Team - Little BrotherBug(atta) Team - Little Brother
Bug(atta) Team - Little Brother
 
#NECSTCamp: come partecipare
#NECSTCamp: come partecipare#NECSTCamp: come partecipare
#NECSTCamp: come partecipare
 
NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1
 
NECSTLab101 2020.2021
NECSTLab101 2020.2021NECSTLab101 2020.2021
NECSTLab101 2020.2021
 
TreeHouse, nourish your community
TreeHouse, nourish your communityTreeHouse, nourish your community
TreeHouse, nourish your community
 
TiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architectureTiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architecture
 
Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposing
 
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
 
EMPhASIS - An EMbedded Public Attention Stress Identification System
 EMPhASIS - An EMbedded Public Attention Stress Identification System EMPhASIS - An EMbedded Public Attention Stress Identification System
EMPhASIS - An EMbedded Public Attention Stress Identification System
 
Luns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural networkLuns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural network
 
BlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAsBlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAs
 
Maeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingMaeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matching
 

Recently uploaded

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 

Recently uploaded (20)

UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 

CNN Dataflow Implementation on FPGAs

  • 1. Politecnico di Milano Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB) marco.bacis@mail.polimi.it Marco Bacis, Giuseppe Natale, Emanuele Del Sozzo, Marco Domenico Santambrogio CNN Dataflow implementation on FPGAs Samsung Research Friday, 9th June 2017
  • 3. Issues 3 Challenges Huge set of weights and data Memory bounded computation Need to have a scalable design in terms of memory and resources without losing in performance + =
  • 4. ● Exploitation of the dataflow pattern of CNN operations ● Independent modules with parametric level of parallelism ● Streaming + Dataflow computational paradigm with efficient memory access Our Solution 4 Methodology for CNN acceleration on FPGA with
  • 5. 5 Iterative Stencil Loops Spatial dependencies Memory bound Enable efficient solutions in term of performance and power
  • 6. 6 ● Independent modules communicating over FIFOs ● Concurrent memory access and optimal full buffering ● Scalable without increasing external memory use Streaming StencilTimestep
  • 8. Implementation 8 1. Convolution Module Structure 2. Fully Connected Module Structure 3. Network Design
  • 10. Convolution Module - Parameters 10 • Input/Output Height • Input/OutputWidth • Number of Input Feature Maps • Number of Output Feature Maps
  • 11. Convolution Module - Parameters 11 • Kernel Height • KernelWidth • Number of Input Ports • Number of Output Ports # Input FMs received per cycle # Output FMs sent per cycle
  • 12. Implementation 12 1. Convolution Module Structure 2. Fully Connected Module Structure 3. Network Design
  • 13. Fully Connected Module Structure 13 ● Treated as a 1x1 convolution ● “Compressed” streaming approach ● 1 input port, 1 output port ● Low latency Floating point accumulation ● Issue for pipelining ● Multiple accumulators + Loop Unrolling
  • 14. Implementation 14 1. Convolution Module Structure 2. Fully Connected Module Structure 3. Network Design
  • 15. Network Design 15 ● Convolutional Module ● Memory structure based on I/O ports ● Single vs Multi channel memory cores ● Pooling Module ● Independent from channel ● One module for each previous output port ● Fully-Connected Module -> single pipelined core
  • 16. Experimental Evaluation 16 ● Two evaluation designs ● CIFAR-10 network Conv -> Pool -> Conv -> Pool -> Lin ->Lin ● USPS network Conv -> Pool -> Conv -> Lin ● Different design choices as a proof-of-concept of the methodology ● Tested on a XilinxVC707 board
  • 17. CIFAR-10 Network 17 5 x 5 3 in FMs 12 out FMs 32 x 32 Conv 1 2 x 2 12 in FMs 12 out FMs 28 x 28 Pool 1 5 x 5 12 in FMs 36 out FMs 14 x 14 Conv 2 2 x 2 36 in FMs 36 out FMs 10 x 10 Pool 2 900 in 36 out Lin 1 36 in 10 out Lin 2
  • 18. USPS Network 18 5 x 5 1 in FMs 6 out FMs 16 x 16 Conv 1 2 x 2 6 in FMs 6 out FMs 12 x 12 Pool 1 5 x 5 6 in FMs 16 out FMs 6 x 6 Conv 2 64 in 10 out Lin 1
  • 19. Experimental Results 19 Performance improvements with increased batch size
  • 20. Experimental Results 20 Dataset GFLOPS GFLOPS/W Images/s Test Case 1 USPS 5.2 0.25 172414 Test Case 2 CIFAR-10 28.4 1.19 7809 MSR Work [1] CIFAR-10 - - 2318 Flips Flops LUTs BRAM DSP Slices Test Case 1 41.10% 50.86% 3.50% 55.04% Test Case 2 61.77% 71.24% 22.82% 74.32% Performances and Power Efficiency Results FPGA Resources Usage [1] K. Ovtcharov et al., “Accelerating deep convolutional neural network using specialized hardware”, Microsoft Research Whitepaper, 2015
  • 21. Conclusions 21 ● Modular and scalar methodology to accelerate CNNs on FPGAs using a dataflow approach ● Performance improvement over large batches ● High level pipeline between layers ● Improved memory bandwidth utilization ● High scalability given limited resources
  • 22. FutureWorks 22 Multi-FPGA / Split layers approach Automatic DSE / CADTool Different precision / data type
  • 23. 23 Questions? Marco Bacis M. Bacis, G. Natale, E. Del Sozzo, and M. D. Santambrogio “A Pipelined and Scalable Dataflow Implementation of Convolutional Neural Networks on FPGA” IPDPS Workshops (RAW), May 2017 M. Bacis, G. Natale, and M. D. Santambrogio “On how to design dataflow FPGA-based accelerators for Convolutional Neural Networks” ISVLSI Conference, July 2017 – To Appear References marco.bacis@mail.polimi.it
  • 24. RelatedWorks● Previously based on ○ Matrix of Processing Elements [1] ○ LoopTiling + Roofline Model [1,2] ○ Parallel fixed convolvers [3] [1] C.Zhang et al., “Optimizing fpga-based accelerator design for deep convolutional neural networks”, ISFPGA 2015 [2] M.Peemen et al., “Memory-centric accelerator design for convolutional neural networks”, ICCD 2013 [3] M.Sankaradas et al., “A massively parallel coprocessor for convolutional neural networks”, ASAP 2009 24 RelatedWorks ● Issues ○ Communication overhead ○ Suboptimal exploitation of on-chip memory ○ Execution in time (control flow) vs space (dataflow)
  • 25. Convolutional Neural Networks 25 ● State-of-art image recognition/classification component ● Chain of multiple layers to extract, transform and merge features from the data ● Unfortunately, they are compute intensive and require specific optimizations