Fast Algorithms for Quantized Convolutional Neural Networks

•Download as PPTX, PDF•

0 likes•48 views

This document proposes using Number Theoretic Transforms (NTTs) to accelerate quantized convolutional neural networks. NTTs, which are discrete Fourier transforms over finite fields, allow convolutional operations to be computed using fast Fourier transform-like algorithms by exploiting the circular convolution property. The approach was tested on a Raspberry Pi using Fermat number transforms to perform quantized convolutions more efficiently than a naive implementation. Future work includes optimizing for smaller kernel sizes and implementing on FPGAs.

Engineering

1
Fast Algorithms for Quantized
Convolutional Neural Networks
Alessandro Pappalardo alessandro1.pappalardo@mail.polimi.it
NECSTLab, Politecnico di Milano
Oracle
06/07/2017

2
Introduction
Convolutional neural
networks are at the
forefront of big data
processing.
Embedded devices and
smartphones are at the
heart of big data.
Image: Dumoulin, Vincent, and Francesco Visin. "A guide to convolution
arithmetic for deep learning." arXiv preprint arXiv:1603.07285 (2016).

3
Problem
Convolutional neural networks are both
memory 💽 and computational 🖥🖥 intensive.
❤️

4
💽 ➡️ Quantized Convnets
• Precision of convolution reduced to b bits.
• Arrays of unsigned integers in [0, 2b - 1].
• Quantization scheme: map minimum to 0, maximum to 2b –
1.
• Memory savings.

5
🖥🖥 ➡️ Question
Can we can take advantage of the pre-
determined finiteness of the possible
values assumed by the convolution
operands to gain computational savings
at an algorithmic level

6
Solution
• Number Theoretic Transforms (NTTs) are DFTs defined
on a finite field.
• NTTs hold the Circular Convolution Property (CCP) and
can be computed through FFT-like fast algorithms.
Reason in terms of finite fields.

7
Supported parameters
FNT Kernel Min Block Max Block binput,max bkernel,max
F3 3x3 8x8 16x16 3 2
F3 5x5 8x8 16x16 2 2
F3 7x7,9x9 16x16 16x16 2 1
F4 3x3 8x8 32x32 6 6
F4 5x5 8x8 32x32 6 5
F4 7x7 16x16 32x32 5 5
F4 9x9 16x16 32x32 5 4
F4 11x11 32x32 32x32 5 4
RNS(F3,F4) 3x3,5x5 8x8 16x16 8 8
RNS(F3,F4) 7x7,9x9 16x16 16x16 8 8

8
Theoretical costs comparison
Kernel
Naïve Int
Ops/Output Block
Valid
Output
NTT Modulo Ops/Output
Add Mul Add Mul Mul by const Shift
3x3 9 9
8x8 6x6 21.34 1.78 1 7.12
16x16 14x14 20.89 1.31 1 7.84
32x32 30x30 22.76 1.14 1 9.10
5x5 25 25
8x8 4x4 48 4 1 16
16x16 12x12 28.44 1.78 1 10.66
32x32 28x28 26.12 1.31 1 10.45
7x7 49 49
16x16 10x10 40.96 2.56 1 15.36
32x32 26x26 30.29 1.51 1 12.12
9x9 81 81
16x16 8x8 64 4 1 24
32x32 24x24 35.56 1.78 1 14.22
11x11 121 121 32x32 22x22 42.31 2.12 1 16.93

9
Theoretical costs comparison
Kernel
Naïve Int
Ops/Output Block
Valid
Output
NTT Modulo Ops/Output
Add Mul Add Mul Mul by const Shift
3x3 9 9
8x8 6x6 21.34 1.78 1 7.12
16x16 14x14 20.89 1.31 1 7.84
32x32 30x30 22.76 1.14 1 9.10
5x5 25 25
8x8 4x4 48 4 1 16
16x16 12x12 28.44 1.78 1 10.66
32x32 28x28 26.12 1.31 1 10.45
7x7 49 49
16x16 10x10 40.96 2.56 1 15.36
32x32 26x26 30.29 1.51 1 12.12
9x9 81 81
16x16 8x8 64 4 1 24
32x32 24x24 35.56 1.78 1 14.22
11x11 121 121 32x32 22x22 42.31 2.12 1 16.93

10
Benchmarks setting
• Qconv: a C99 compliant and portable NTT
implementation against a naïve convolution
implementation.
• Platform of choice: Raspberry Pi Zero.

12
Future works
• Implement SIMD subroutines for forward and inverse
transforms.
• Map element-wise products to finite field matrix
multiplication in FFLAS.
• FPGA hardware implementation.
• Optimize for 3x3 filters.

14
Fermat Number Transforms
FNTs are NTTs mod 𝒑 = 𝟐 𝟐 𝒕
+ 𝟏.
1. Support FFT algorithms.
2. For a length N up to 2 𝑡+1, the forward and inverse
transforms requires only modular adds and shifts.
3. Reduction mod a Fermat prime p:
logical AND, right unsigned shift, modular subtraction

15
Methodology
FNT m 𝐹4 = 65537 with
blocks:
• 8x8
• 16x16
• 32x32
RNS of 𝐹3 and 𝐹4 to increase the maximum output bitwidth to:
𝑙𝑜𝑔2 𝐹3 ∙ 𝐹4 ≅ 24 bits
Overlap-and-save algorithm
FNT mod 𝐹3 = 257 with blocks:
• 8x8
• 16x16

16
Optimizations
• Forward DIF FFT, inverse DIT transforms FFT.
• Precomputed power-of-two twiddle factors .
• Switch the order of the two inner-most loop of the FFT.
• Avoid useless transpositions.
• Normalize the non discarded output only.

Similar to Fast Algorithms for Quantized Convolutional Neural Networks

Ann model and its applicationmilan107

Deep Neural Networks for Computer VisionAlex Conway

Minimum Complexity Decoupling Networks for Arbitrary Coupled LoadsDing Nie

AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE

Deep Learning Initiative @ NECSTLabNECST Lab @ Politecnico di Milano

[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...KAIST

Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...Gurbinder Gill

Enabling Power-Efficient AI Through QuantizationQualcomm Research

Protein Secondary Structure Prediction using Deep Learning methodsChrysoula Kosma

Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya

10 Abundant-Data ComputingRCCSRENKEI

Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya

Hand Written Digit Classificationijtsrd

Novi sad ai event 3-2018Jovan Stojanovic

1.pptxSwatiMahale4

Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya

Neural Networks and Deep Learning for PhysicistsHéloïse Nonne

Improving Hardware Efficiency for DNN ApplicationsChester Chen

Pres Tesi LM-2016+transcript_engDaniele Ciriello

Similar to Fast Algorithms for Quantized Convolutional Neural Networks (20)

Ann model and its application

Deep Neural Networks for Computer Vision

Minimum Complexity Decoupling Networks for Arbitrary Coupled Loads

AI optimizing HPC simulations (presentation from 6th EULAG Workshop)

Deep Learning Initiative @ NECSTLab

[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...

Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...

Enabling Power-Efficient AI Through Quantization

Protein Secondary Structure Prediction using Deep Learning methods

Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020

10 Abundant-Data Computing

Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020

Hand Written Digit Classification

Novi sad ai event 3-2018

1.pptx

Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)

Neural Networks and Deep Learning for Physicists

Improving Hardware Efficiency for DNN Applications

Pres Tesi LM-2016+transcript_eng

Recently uploaded

Courier management system project report.pdfKamal Acharya

RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfKamal Acharya

Explosives Industry manufacturing process.pdf884710SadaqatAli

KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and VisualizationDr. Radhey Shyam

Toll tax management system project report..pdfKamal Acharya

fundamentals of drawing and isometric and orthographic projectionjeevanprasad8

Fruit shop management system project report.pdfKamal Acharya

fluid mechanics gate notes . gate all pyqs answerapareshmondalnita

Architectural Portfolio Sean Lockwoodseandesed

The Benefits and Techniques of Trenchless Pipe Repair.pdfPipe Restoration Solutions

Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya

RS Khurmi Machine Design Clutch and Brake Exercise Numerical SolutionsAtif Razi

Scaling in conventional MOSFET for constant electric field and constant voltageRCC Institute of Information Technology

KIT-601 Lecture Notes-UNIT-4.pdf Frequent Itemsets and ClusteringDr. Radhey Shyam

Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton

Peek implant persentation - Copy (1).pdfAyahmorsy

Introduction to Machine Learning Unit-5 Notes for II-II Mechanical EngineeringC Sai Kiran

ASME IX(9) 2007 Full Version .pdfAhmedHussein950959

Natalia Rutkowska - BIM School Course in Krakówbim.edu.pl

CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult

Recently uploaded (20)

Courier management system project report.pdf

RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf

Explosives Industry manufacturing process.pdf

KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization

Toll tax management system project report..pdf

fundamentals of drawing and isometric and orthographic projection

Fruit shop management system project report.pdf

fluid mechanics gate notes . gate all pyqs answer

Architectural Portfolio Sean Lockwood

The Benefits and Techniques of Trenchless Pipe Repair.pdf

Democratizing Fuzzing at Scale by Abhishek Arya

RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions

Scaling in conventional MOSFET for constant electric field and constant voltage

KIT-601 Lecture Notes-UNIT-4.pdf Frequent Itemsets and Clustering

Immunizing Image Classifiers Against Localized Adversary Attacks

Peek implant persentation - Copy (1).pdf

Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering

ASME IX(9) 2007 Full Version .pdf

Natalia Rutkowska - BIM School Course in Kraków

CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx

Fast Algorithms for Quantized Convolutional Neural Networks

1. 1 Fast Algorithms for Quantized Convolutional Neural Networks Alessandro Pappalardo alessandro1.pappalardo@mail.polimi.it NECSTLab, Politecnico di Milano Oracle 06/07/2017

2. 2 Introduction Convolutional neural networks are at the forefront of big data processing. Embedded devices and smartphones are at the heart of big data. Image: Dumoulin, Vincent, and Francesco Visin. "A guide to convolution arithmetic for deep learning." arXiv preprint arXiv:1603.07285 (2016).

3. 3 Problem Convolutional neural networks are both memory 💽 and computational 🖥🖥 intensive. ❤️

4. 4 💽 ➡️ Quantized Convnets • Precision of convolution reduced to b bits. • Arrays of unsigned integers in [0, 2b - 1]. • Quantization scheme: map minimum to 0, maximum to 2b – 1. • Memory savings.

5. 5 🖥🖥 ➡️ Question Can we can take advantage of the pre- determined finiteness of the possible values assumed by the convolution operands to gain computational savings at an algorithmic level

6. 6 Solution • Number Theoretic Transforms (NTTs) are DFTs defined on a finite field. • NTTs hold the Circular Convolution Property (CCP) and can be computed through FFT-like fast algorithms. Reason in terms of finite fields.

7. 7 Supported parameters FNT Kernel Min Block Max Block binput,max bkernel,max F3 3x3 8x8 16x16 3 2 F3 5x5 8x8 16x16 2 2 F3 7x7,9x9 16x16 16x16 2 1 F4 3x3 8x8 32x32 6 6 F4 5x5 8x8 32x32 6 5 F4 7x7 16x16 32x32 5 5 F4 9x9 16x16 32x32 5 4 F4 11x11 32x32 32x32 5 4 RNS(F3,F4) 3x3,5x5 8x8 16x16 8 8 RNS(F3,F4) 7x7,9x9 16x16 16x16 8 8

8. 8 Theoretical costs comparison Kernel Naïve Int Ops/Output Block Valid Output NTT Modulo Ops/Output Add Mul Add Mul Mul by const Shift 3x3 9 9 8x8 6x6 21.34 1.78 1 7.12 16x16 14x14 20.89 1.31 1 7.84 32x32 30x30 22.76 1.14 1 9.10 5x5 25 25 8x8 4x4 48 4 1 16 16x16 12x12 28.44 1.78 1 10.66 32x32 28x28 26.12 1.31 1 10.45 7x7 49 49 16x16 10x10 40.96 2.56 1 15.36 32x32 26x26 30.29 1.51 1 12.12 9x9 81 81 16x16 8x8 64 4 1 24 32x32 24x24 35.56 1.78 1 14.22 11x11 121 121 32x32 22x22 42.31 2.12 1 16.93

9. 9 Theoretical costs comparison Kernel Naïve Int Ops/Output Block Valid Output NTT Modulo Ops/Output Add Mul Add Mul Mul by const Shift 3x3 9 9 8x8 6x6 21.34 1.78 1 7.12 16x16 14x14 20.89 1.31 1 7.84 32x32 30x30 22.76 1.14 1 9.10 5x5 25 25 8x8 4x4 48 4 1 16 16x16 12x12 28.44 1.78 1 10.66 32x32 28x28 26.12 1.31 1 10.45 7x7 49 49 16x16 10x10 40.96 2.56 1 15.36 32x32 26x26 30.29 1.51 1 12.12 9x9 81 81 16x16 8x8 64 4 1 24 32x32 24x24 35.56 1.78 1 14.22 11x11 121 121 32x32 22x22 42.31 2.12 1 16.93

10. 10 Benchmarks setting • Qconv: a C99 compliant and portable NTT implementation against a naïve convolution implementation. • Platform of choice: Raspberry Pi Zero.

11. 11 ResultsInput size 224x224x1

12. 12 Future works • Implement SIMD subroutines for forward and inverse transforms. • Map element-wise products to finite field matrix multiplication in FFLAS. • FPGA hardware implementation. • Optimize for 3x3 filters.

13. Question Time

14. 14 Fermat Number Transforms FNTs are NTTs mod 𝒑 = 𝟐 𝟐 𝒕 + 𝟏. 1. Support FFT algorithms. 2. For a length N up to 2 𝑡+1, the forward and inverse transforms requires only modular adds and shifts. 3. Reduction mod a Fermat prime p: logical AND, right unsigned shift, modular subtraction

15. 15 Methodology FNT m 𝐹4 = 65537 with blocks: • 8x8 • 16x16 • 32x32 RNS of 𝐹3 and 𝐹4 to increase the maximum output bitwidth to: 𝑙𝑜𝑔2 𝐹3 ∙ 𝐹4 ≅ 24 bits Overlap-and-save algorithm FNT mod 𝐹3 = 257 with blocks: • 8x8 • 16x16

16. 16 Optimizations • Forward DIF FFT, inverse DIT transforms FFT. • Precomputed power-of-two twiddle factors . • Switch the order of the two inner-most loop of the FFT. • Avoid useless transpositions. • Normalize the non discarded output only.

Fast Algorithms for Quantized Convolutional Neural Networks

Recommended

Recommended

More Related Content

Similar to Fast Algorithms for Quantized Convolutional Neural Networks

Similar to Fast Algorithms for Quantized Convolutional Neural Networks (20)

More from NECST Lab @ Politecnico di Milano

More from NECST Lab @ Politecnico di Milano (20)

Recently uploaded

Recently uploaded (20)

Fast Algorithms for Quantized Convolutional Neural Networks