CNN Dataflow implementation on FPGAs

•Download as PPTX, PDF•

1 like•104 views

NECST Lab @ Politecnico di Milano

NGC17 Talk @ Xilinx - June 8, 2017

Engineering

Politecnico di Milano
Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB)
marco.bacis@mail.polimi.it
Marco Bacis, Giuseppe Natale, Emanuele Del Sozzo,
Marco Domenico Santambrogio
CNN Dataflow implementation on
FPGAs
Xilinx HQ @ San Jose
Thursday, 8th June 2017

Issues
3
Challenges
Huge set of weights and data
Memory bounded computation
Need to have a scalable design in terms of
memory and resources
without losing in performance
+
=

4
Our Approach
Iterative Stencil Loops
Streaming StencilTimestep (SST)
Spatial dependencies
Memory bound
CNN DataflowAcceleration

Convolution Module - Parameters 5
• Kernel Height
• KernelWidth
• Number of Input Ports
• Number of Output Ports
# Input FMs received per cycle
# Output FMs sent per cycle
Input Feature Maps
Output Feature Maps

Network Design & Choices 6
● Convolutional Module
● Memory structure based on I/O ports
● Single vs Multi channel memory cores
● Pooling Module
● Independent from channel
● One module for each previous output port
● Fully-Connected Module
● Treated as 1x1 convolution
● Fast and pipelined set of accumulators

Experimental Evaluation 7
5 x 5
3 in FMs
12 out FMs
32 x 32
Conv 1
2 x 2
12 in FMs
12 out FMs
28 x 28
Pool 1
5 x 5
12 in FMs
36 out FMs
14 x 14
Conv 2
2 x 2
36 in FMs
36 out FMs
10 x 10
Pool 2
900 in
36 out
Lin 1
36 in
10 out
Lin 2
5 x 5
1 in FMs
6 out FMs
16 x 16
Conv 1
2 x 2
6 in FMs
6 out FMs
12 x 12
Pool 1
5 x 5
6 in FMs
16 out FMs
6 x 6
Conv 2
64 in
10 out
Lin 1

Experimental Results 8
Dataset GFLOPS GFLOPS/W Images/s
Test Case 1 USPS 5.2 0.25 172414
Test Case 2 CIFAR-10 28.4 1.19 7809
MSR Work [1] CIFAR-10 - - 2318
Flips Flops LUTs BRAM DSP Slices
Test Case 1 41.10% 50.86% 3.50% 55.04%
Test Case 2 61.77% 71.24% 22.82% 74.32%
Performances and Power Efficiency Results
FPGA Resources Usage
[1] K. Ovtcharov et al., “Accelerating deep convolutional neural network using specialized hardware”, Microsoft Research
Whitepaper, 2015

Experimental Results 9
Performance improvement over large batches

Conclusions 10
● Modular and scalar methodology to accelerate CNNs on
FPGAs using a dataflow approach
● Performance improvement over large batches
● High level pipeline between layers
● Improved memory bandwidth utilization
● High scalability given limited resources

FutureWorks 11
Multi-FPGA / Split layers approach
Automatic DSE / CADTool
Different precision / data type

12
Questions?
Marco Bacis
M. Bacis, G. Natale, E. Del Sozzo, and M. D. Santambrogio
“A Pipelined and Scalable Dataflow Implementation of Convolutional Neural Networks on FPGA”
IPDPS Workshops (RAW), May 2017
M. Bacis, G. Natale, and M. D. Santambrogio
“On how to design dataflow FPGA-based accelerators for Convolutional Neural Networks”
ISVLSI Conference, July 2017 – To Appear
References
marco.bacis@mail.polimi.it

Similar to CNN Dataflow implementation on FPGAs

Accelerating Deep Learning Inference  on Mobile SystemsDarian Frajberg

NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...NECST Lab @ Politecnico di Milano

CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...NECST Lab @ Politecnico di Milano

Network-aware Data Management for Large Scale Distributed Applications, IBM R...balmanme

Performance Optimization of CGYRO for Multiscale Turbulence SimulationsIgor Sfiligoi

Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo

PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...Sunghoon Joo

Barcelona Supercomputing Center, Generador de RiquezaFacultad de Informática UCM

Blue Waters and Resource Management - Now and in the Futureinside-BigData.com

Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...balmanme

Accumulo and the Convergence of Machine Learning, Big Data, and SupercomputingAccumulo Summit

40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com

Streaming exa-scale data over 100Gbps networksbalmanme

Convolutional neural networks for speech controlled prosthetic handsMohsen Jafarzadeh

Manycores for the MassesIntel® Software

Pruning convolutional neural networks for resource efficient inferenceKaushalya Madhawa

Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...InVID Project

10 Abundant-Data ComputingRCCSRENKEI

Hardware architecture of Summit SupercomputerVigneshwarRamaswamy

Java Thread and Process Performance for Parallel Machine Learning on Multicor...Saliya Ekanayake

Similar to CNN Dataflow implementation on FPGAs (20)

Accelerating Deep Learning Inference  on Mobile Systems

NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...

CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...

Network-aware Data Management for Large Scale Distributed Applications, IBM R...

Performance Optimization of CGYRO for Multiscale Turbulence Simulations

Lightweight DNN Processor Design (based on NVDLA)

PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...

Barcelona Supercomputing Center, Generador de Riqueza

Blue Waters and Resource Management - Now and in the Future

Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...

Accumulo and the Convergence of Machine Learning, Big Data, and Supercomputing

40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility

Streaming exa-scale data over 100Gbps networks

Convolutional neural networks for speech controlled prosthetic hands

Manycores for the Masses

Pruning convolutional neural networks for resource efficient inference

Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...

10 Abundant-Data Computing

Hardware architecture of Summit Supercomputer

Java Thread and Process Performance for Parallel Machine Learning on Multicor...

Recently uploaded

CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani

Application of Residue Theorem to evaluate real integrations.pptx959SahilShah

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat

Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR9953056974 Low Rate Call Girls In Saket, Delhi NCR

VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor

Past, Present and Future of Generative AIabhishek36461

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat

Architect Hassan Khalil Portfolio for 2024hassan khalil

★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR9953056974 Low Rate Call Girls In Saket, Delhi NCR

GDSC ASEB Gen AI study jams presentationGDSCAESB

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

Artificial-Intelligence-in-Electronics (K).pptxbritheesh05

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor

ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE

HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal

HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95

9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Low Rate Call Girls In Saket, Delhi NCR

IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst

APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3

Recently uploaded (20)

CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf

Application of Residue Theorem to evaluate real integrations.pptx

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts

Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR

VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130

Past, Present and Future of Generative AI

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts

Architect Hassan Khalil Portfolio for 2024

★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR

GDSC ASEB Gen AI study jams presentation

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts

Artificial-Intelligence-in-Electronics (K).pptx

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130

ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...

HARMONY IN THE HUMAN BEING - Unit-II UHV-2

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...

HARMONY IN THE NATURE AND EXISTENCE - Unit-IV

9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf

IVE Industry Focused Event - Defence Sector 2024

APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS

CNN Dataflow implementation on FPGAs

1. Politecnico di Milano Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB) marco.bacis@mail.polimi.it Marco Bacis, Giuseppe Natale, Emanuele Del Sozzo, Marco Domenico Santambrogio CNN Dataflow implementation on FPGAs Xilinx HQ @ San Jose Thursday, 8th June 2017

2. Introduction 2

3. Issues 3 Challenges Huge set of weights and data Memory bounded computation Need to have a scalable design in terms of memory and resources without losing in performance + =

4. 4 Our Approach Iterative Stencil Loops Streaming StencilTimestep (SST) Spatial dependencies Memory bound CNN DataflowAcceleration

5. Convolution Module - Parameters 5 • Kernel Height • KernelWidth • Number of Input Ports • Number of Output Ports # Input FMs received per cycle # Output FMs sent per cycle Input Feature Maps Output Feature Maps

6. Network Design & Choices 6 ● Convolutional Module ● Memory structure based on I/O ports ● Single vs Multi channel memory cores ● Pooling Module ● Independent from channel ● One module for each previous output port ● Fully-Connected Module ● Treated as 1x1 convolution ● Fast and pipelined set of accumulators

7. Experimental Evaluation 7 5 x 5 3 in FMs 12 out FMs 32 x 32 Conv 1 2 x 2 12 in FMs 12 out FMs 28 x 28 Pool 1 5 x 5 12 in FMs 36 out FMs 14 x 14 Conv 2 2 x 2 36 in FMs 36 out FMs 10 x 10 Pool 2 900 in 36 out Lin 1 36 in 10 out Lin 2 5 x 5 1 in FMs 6 out FMs 16 x 16 Conv 1 2 x 2 6 in FMs 6 out FMs 12 x 12 Pool 1 5 x 5 6 in FMs 16 out FMs 6 x 6 Conv 2 64 in 10 out Lin 1

8. Experimental Results 8 Dataset GFLOPS GFLOPS/W Images/s Test Case 1 USPS 5.2 0.25 172414 Test Case 2 CIFAR-10 28.4 1.19 7809 MSR Work [1] CIFAR-10 - - 2318 Flips Flops LUTs BRAM DSP Slices Test Case 1 41.10% 50.86% 3.50% 55.04% Test Case 2 61.77% 71.24% 22.82% 74.32% Performances and Power Efficiency Results FPGA Resources Usage [1] K. Ovtcharov et al., “Accelerating deep convolutional neural network using specialized hardware”, Microsoft Research Whitepaper, 2015

9. Experimental Results 9 Performance improvement over large batches

10. Conclusions 10 ● Modular and scalar methodology to accelerate CNNs on FPGAs using a dataflow approach ● Performance improvement over large batches ● High level pipeline between layers ● Improved memory bandwidth utilization ● High scalability given limited resources

11. FutureWorks 11 Multi-FPGA / Split layers approach Automatic DSE / CADTool Different precision / data type

12. 12 Questions? Marco Bacis M. Bacis, G. Natale, E. Del Sozzo, and M. D. Santambrogio “A Pipelined and Scalable Dataflow Implementation of Convolutional Neural Networks on FPGA” IPDPS Workshops (RAW), May 2017 M. Bacis, G. Natale, and M. D. Santambrogio “On how to design dataflow FPGA-based accelerators for Convolutional Neural Networks” ISVLSI Conference, July 2017 – To Appear References marco.bacis@mail.polimi.it

CNN Dataflow implementation on FPGAs

Recommended

Recommended

More Related Content

Similar to CNN Dataflow implementation on FPGAs

Similar to CNN Dataflow implementation on FPGAs (20)

More from NECST Lab @ Politecnico di Milano

More from NECST Lab @ Politecnico di Milano (20)

Recently uploaded

Recently uploaded (20)

CNN Dataflow implementation on FPGAs