Fast Algorithms for Quantized Convolutional Neural Networks

•

0 likes•92 views

NECST Lab @ Politecnico di Milano

NGC17 Talk @ Berkeley Lab - June 5, 2017

Engineering

1
Fast Algorithms for Quantized
Convolutional Neural Networks
Alessandro Pappalardo alessandro1.pappalardo@mail.polimi.it
NECSTLab, Politecnico di Milano
Lawrence Berkeley National Laboratory
06/05/2017

2
Introduction
Convolutional neural
networks are at the forefront
of big data processing.
Embedded devices and
smartphones are at the
heart of big data.
Image: Dumoulin, Vincent, and Francesco Visin. "A guide to convolution
arithmetic for deep learning." arXiv preprint arXiv:1603.07285 (2016).

3
Problem
Convolutional neural networks are both
memory 💽 and computational 🖥️ intensive.
❤️

4
💽 ➡️ Quantized Convnets
• Precision of convolution reduced to b bits.
• Arrays of unsigned integers in [0, 2b - 1].
• Quantization scheme: map minimum to 0, maximum to 2b – 1.
• Memory savings.

5
🖥️ ➡️ Question
Can we can take advantage of the pre-
determined finiteness of the possible values
assumed by the convolution operands to gain
computational savings at an algorithmic level

6
Solution
Number Theoretic Transforms (NTTs) are• DFTs defined on a
finite field.
NTTs hold the• Circular Convolution Property (CCP) and can be
computed through FFT-like fast algorithms.
Reason in terms of finite fields.

7
Supported parameters
FNT Kernel Min Block Max Block binput,max bkernel,max
F3 3x3 8x8 16x16 3 2
F3 5x5 8x8 16x16 2 2
F3 7x7,9x9 16x16 16x16 2 1
F4 3x3 8x8 32x32 6 6
F4 5x5 8x8 32x32 6 5
F4 7x7 16x16 32x32 5 5
F4 9x9 16x16 32x32 5 4
F4 11x11 32x32 32x32 5 4
RNS(F3,F4) 3x3,5x5 8x8 16x16 8 8
RNS(F3,F4) 7x7,9x9 16x16 16x16 8 8

8
Theoretical costs comparison
Kernel
Naïve Int
Ops/Output Block
Valid
Output
NTT Modulo Ops/Output
Add Mul Add Mul Mul by const Shift
3x3 9 9
8x8 6x6 21.34 1.78 1 7.12
16x16 14x14 20.89 1.31 1 7.84
32x32 30x30 22.76 1.14 1 9.10
5x5 25 25
8x8 4x4 48 4 1 16
16x16 12x12 28.44 1.78 1 10.66
32x32 28x28 26.12 1.31 1 10.45
7x7 49 49
16x16 10x10 40.96 2.56 1 15.36
32x32 26x26 30.29 1.51 1 12.12
9x9 81 81
16x16 8x8 64 4 1 24
32x32 24x24 35.56 1.78 1 14.22
11x11 121 121 32x32 22x22 42.31 2.12 1 16.93

9
Theoretical costs comparison
Kernel
Naïve Int
Ops/Output Block
Valid
Output
NTT Modulo Ops/Output
Add Mul Add Mul Mul by const Shift
3x3 9 9
8x8 6x6 21.34 1.78 1 7.12
16x16 14x14 20.89 1.31 1 7.84
32x32 30x30 22.76 1.14 1 9.10
5x5 25 25
8x8 4x4 48 4 1 16
16x16 12x12 28.44 1.78 1 10.66
32x32 28x28 26.12 1.31 1 10.45
7x7 49 49
16x16 10x10 40.96 2.56 1 15.36
32x32 26x26 30.29 1.51 1 12.12
9x9 81 81
16x16 8x8 64 4 1 24
32x32 24x24 35.56 1.78 1 14.22
11x11 121 121 32x32 22x22 42.31 2.12 1 16.93

10
Benchmarks setting
Qconv• : a C99 compliant and portable NTT implementation
against a naïve convolution implementation.
Platform of choice:• Raspberry Pi Zero.

12
Future works
Implement SIMD subroutines for forward and inverse•
transforms.
Map element• -wise products to finite field matrix multiplication
in FFLAS.
FPGA hardware implementation.•
Binary segmentation for• 3x3 filters.

14
Fermat Number Transforms
FNTs are NTTs mod 𝒑 = 𝟐 𝟐 𝒕
+ 𝟏.
1. Support FFT algorithms.
2. For a length N up to 2 𝑡+1, the forward and inverse
transforms requires only modular adds and shifts.
3. Reduction mod a Fermat prime p:
logical AND, right unsigned shift, modular subtraction

15
Methodology
FNT m 𝐹4 = 65537 with blocks:
• 8x8
• 16x16
• 32x32
RNS of 𝐹3 and 𝐹4 to increase the maximum output bitwidth to:
𝑙𝑜𝑔2 𝐹3 ∙ 𝐹4 ≅ 24 bits
Overlap-and-save algorithm
FNT mod 𝐹3 = 257 with blocks:
• 8x8
16• x16

16
Optimizations
Forward DIF FFT, inverse DIT transforms FFT.•
Precomputed power• -of-two twiddle factors .
Switch the order of the two inner• -most loop of the FFT.
Avoid useless transpositions.•
Normalize the non discarded output only.•

Similar to Fast Algorithms for Quantized Convolutional Neural Networks

Neural Networks: Principal Component Analysis (PCA)Mostafa G. M. Mostafa

MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMLAI2

Minimum Complexity Decoupling Networks for Arbitrary Coupled LoadsDing Nie

Ann model and its applicationmilan107

Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya

Improving Hardware Efficiency for DNN ApplicationsChester Chen

Find nuclei in images with U-netDing Li

Digit recognizer by convolutional neural networkDing Li

The reversible residual networkThyrixYang1

Deep Learning Initiative @ NECSTLabNECST Lab @ Politecnico di Milano

Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya

Neural Networks and Deep Learning for PhysicistsHéloïse Nonne

[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...KAIST

Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya

Recent developments in Deep LearningBrahim HAMADICHAREF

Deep Neural Networks for Computer VisionAlex Conway

캡슐 네트워크를 이용한 엔드투엔드 음성 단어 인식, 배재성(KAIST 석사과정)NAVER Engineering

10 Abundant-Data ComputingRCCSRENKEI

AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE

Similar to Fast Algorithms for Quantized Convolutional Neural Networks (20)

Neural Networks: Principal Component Analysis (PCA)

MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures

Minimum Complexity Decoupling Networks for Arbitrary Coupled Loads

Ann model and its application

Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020

Improving Hardware Efficiency for DNN Applications

Find nuclei in images with U-net

Digit recognizer by convolutional neural network

The reversible residual network

Deep Learning Initiative @ NECSTLab

Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)

Neural Networks and Deep Learning for Physicists

[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...

Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020

Recent developments in Deep Learning

Deep Neural Networks for Computer Vision

캡슐 네트워크를 이용한 엔드투엔드 음성 단어 인식, 배재성(KAIST 석사과정)

10 Abundant-Data Computing

AI optimizing HPC simulations (presentation from 6th EULAG Workshop)

Recently uploaded

The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat

Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani

Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

UNIT - IV - Air Compressors and its Performancesivaprakash250

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan

(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat

University management System project report..pdfKamal Acharya

Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth

MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N

Roadmap to Membership of RICS - Pathways and RoutesM Maged Hegazy, LLM, MBA, CCP, P3O

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya

Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile

Introduction to IEEE STANDARDS and its different types.pptxupamatechverse

Recently uploaded (20)

The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...

Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts

Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record

Water Industry Process Automation & Control Monthly - April 2024

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts

UNIT - IV - Air Compressors and its Performance

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working

(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...

University management System project report..pdf

Processing & Properties of Floor and Wall Tiles.pptx

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...

MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE

Roadmap to Membership of RICS - Pathways and Routes

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf

Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik

Introduction to IEEE STANDARDS and its different types.pptx

Fast Algorithms for Quantized Convolutional Neural Networks

1. 1 Fast Algorithms for Quantized Convolutional Neural Networks Alessandro Pappalardo alessandro1.pappalardo@mail.polimi.it NECSTLab, Politecnico di Milano Lawrence Berkeley National Laboratory 06/05/2017

2. 2 Introduction Convolutional neural networks are at the forefront of big data processing. Embedded devices and smartphones are at the heart of big data. Image: Dumoulin, Vincent, and Francesco Visin. "A guide to convolution arithmetic for deep learning." arXiv preprint arXiv:1603.07285 (2016).

3. 3 Problem Convolutional neural networks are both memory 💽 and computational 🖥️ intensive. ❤️

4. 4 💽 ➡️ Quantized Convnets • Precision of convolution reduced to b bits. • Arrays of unsigned integers in [0, 2b - 1]. • Quantization scheme: map minimum to 0, maximum to 2b – 1. • Memory savings.

5. 5 🖥️ ➡️ Question Can we can take advantage of the pre- determined finiteness of the possible values assumed by the convolution operands to gain computational savings at an algorithmic level

6. 6 Solution Number Theoretic Transforms (NTTs) are• DFTs defined on a finite field. NTTs hold the• Circular Convolution Property (CCP) and can be computed through FFT-like fast algorithms. Reason in terms of finite fields.

7. 7 Supported parameters FNT Kernel Min Block Max Block binput,max bkernel,max F3 3x3 8x8 16x16 3 2 F3 5x5 8x8 16x16 2 2 F3 7x7,9x9 16x16 16x16 2 1 F4 3x3 8x8 32x32 6 6 F4 5x5 8x8 32x32 6 5 F4 7x7 16x16 32x32 5 5 F4 9x9 16x16 32x32 5 4 F4 11x11 32x32 32x32 5 4 RNS(F3,F4) 3x3,5x5 8x8 16x16 8 8 RNS(F3,F4) 7x7,9x9 16x16 16x16 8 8

8. 8 Theoretical costs comparison Kernel Naïve Int Ops/Output Block Valid Output NTT Modulo Ops/Output Add Mul Add Mul Mul by const Shift 3x3 9 9 8x8 6x6 21.34 1.78 1 7.12 16x16 14x14 20.89 1.31 1 7.84 32x32 30x30 22.76 1.14 1 9.10 5x5 25 25 8x8 4x4 48 4 1 16 16x16 12x12 28.44 1.78 1 10.66 32x32 28x28 26.12 1.31 1 10.45 7x7 49 49 16x16 10x10 40.96 2.56 1 15.36 32x32 26x26 30.29 1.51 1 12.12 9x9 81 81 16x16 8x8 64 4 1 24 32x32 24x24 35.56 1.78 1 14.22 11x11 121 121 32x32 22x22 42.31 2.12 1 16.93

9. 9 Theoretical costs comparison Kernel Naïve Int Ops/Output Block Valid Output NTT Modulo Ops/Output Add Mul Add Mul Mul by const Shift 3x3 9 9 8x8 6x6 21.34 1.78 1 7.12 16x16 14x14 20.89 1.31 1 7.84 32x32 30x30 22.76 1.14 1 9.10 5x5 25 25 8x8 4x4 48 4 1 16 16x16 12x12 28.44 1.78 1 10.66 32x32 28x28 26.12 1.31 1 10.45 7x7 49 49 16x16 10x10 40.96 2.56 1 15.36 32x32 26x26 30.29 1.51 1 12.12 9x9 81 81 16x16 8x8 64 4 1 24 32x32 24x24 35.56 1.78 1 14.22 11x11 121 121 32x32 22x22 42.31 2.12 1 16.93

10. 10 Benchmarks setting Qconv• : a C99 compliant and portable NTT implementation against a naïve convolution implementation. Platform of choice:• Raspberry Pi Zero.

11. 11 ResultsInput size 224x224x1

12. 12 Future works Implement SIMD subroutines for forward and inverse• transforms. Map element• -wise products to finite field matrix multiplication in FFLAS. FPGA hardware implementation.• Binary segmentation for• 3x3 filters.

13. Question Time

14. 14 Fermat Number Transforms FNTs are NTTs mod 𝒑 = 𝟐 𝟐 𝒕 + 𝟏. 1. Support FFT algorithms. 2. For a length N up to 2 𝑡+1, the forward and inverse transforms requires only modular adds and shifts. 3. Reduction mod a Fermat prime p: logical AND, right unsigned shift, modular subtraction

15. 15 Methodology FNT m 𝐹4 = 65537 with blocks: • 8x8 • 16x16 • 32x32 RNS of 𝐹3 and 𝐹4 to increase the maximum output bitwidth to: 𝑙𝑜𝑔2 𝐹3 ∙ 𝐹4 ≅ 24 bits Overlap-and-save algorithm FNT mod 𝐹3 = 257 with blocks: • 8x8 16• x16

16. 16 Optimizations Forward DIF FFT, inverse DIT transforms FFT.• Precomputed power• -of-two twiddle factors . Switch the order of the two inner• -most loop of the FFT. Avoid useless transpositions.• Normalize the non discarded output only.•

Fast Algorithms for Quantized Convolutional Neural Networks

Recommended

Recommended

More Related Content

Similar to Fast Algorithms for Quantized Convolutional Neural Networks

Similar to Fast Algorithms for Quantized Convolutional Neural Networks (20)

More from NECST Lab @ Politecnico di Milano

More from NECST Lab @ Politecnico di Milano (20)

Recently uploaded

Recently uploaded (20)

Fast Algorithms for Quantized Convolutional Neural Networks