CNN Quantization

CNN Quantization
Performance evaluation
Emanuele Ghelfi Emiliano Gagliardi
June 18, 2017
Politecnico di Milano
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 1 / 25
Contents
1 Purposes
2 Quantization
3 Our work
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 2 / 25
Purposes
Problem
Accuracy/inference speed trade-off
For real world application, convolutional neural network(CNN) model
can take more than 100MB of space and can be computationally too
expensive
How to enable embedded devices like smart phones with the power of
Neural Networks?
There is the need of floating point precision?
No, deep neural network tends to cope well with noise in their input
Training still needs floating point precision to work, it is an iteration of
little incremental adjustments of the weights
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 3 / 25
Purposes
Solution
The solution is quantization.
Deep networks can be trained with floating point precision, then a
quantization algorithm can be applied to obtain smaller models and speed
up the inference phase reducing memory requirements:
Fixed-point compute units are typically faster and consume far less
hardware resources and power than floating-point engines
Low-precision data representation reduces the memory footprint,
enabling larger models to fit within the given memory capacity
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 4 / 25
Purposes
Project purpose
Using a machine learning framework with support for convolutional neural
networks:
Define different kind of network
Train
Quantize
Evaluate the original and the quantized models
Compare them in terms of model size, cache misses, and inference time
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 5 / 25
Quantization
What is quantization?
Developing this project we saw two different approaches:
Tensorflow
Caffe Ristretto
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 6 / 25
Quantization
Tensorflow quantization
Unsupervised approach
Get a trained network
Obtain for each layer the min and
the max of the weights value
Represent the weights distributed
linearly between the minimum and
maximum with 8 bits precision
The operations have to be
reimplemented for the 8-bit format
The resulting data structure is composed by an array containing the
quantized value, and the two float min and max
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 7 / 25
Quantization
Caffe Ristretto quantization
Supervised approach
Get a trained network
Three different methods:
Dynamic fixed point: a modified fixed-point format
Minifloat: bit-width reduced floating point numbers
Power of two: layers with power-of-two parameters don’t need any
multipliers, when implemented in hardware
Evaluate the performance of the network during quantization in order
to keep the accuracy higher than a given threshold
Support for training of quantized networks (fine-tuning)
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 8 / 25
Quantization
Caffè Ristretto quantization
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 9 / 25
Our work
Caffe ristretto
The results obtained with Ristretto on a simple network for the Mnist
dataset are not so satisfying...
network accuracy model size (MB) Time (ms) LL_d misses (106
) L1_d misses (106
)
Orginal 0.9903 1.7 29.2 32.098 277.189
Dynamic f. p. 0.9829 1.7 126.41 42.077 303.209
Minifloat 0.9916 1.7 29.5 37.149 282.396
Power of two 0.9899 1.7 61.1 35.774 280.819
Linux running on macbook pro, cachegrind tool for cache statistics. Intel i5 2.9 GHz, L3 cache 3MB, 16 GB ram.
The quantized values are stored in float size after the quantization
The quantized layers implementation works with float variables:
perform the computation with low precision values stored in float
variables
quantize the results, still stored in float variables
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 10 / 25
Our work
Tensorflow
The quantization is better supported
The quantized model is stored with low precision weights
Some low precision operations are already implemented
We tried different topologies of networks, to see how quantization affects
different architectures
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 11 / 25
Our work
How we used the tensorflow quantization tool
We used python (with a bit OO, since we needed a way to use it with
different networks)
An abstract class defines the pattern of the network that the main
script can handle
The core methods of the pattern are
prepare: load the data and build the computational graph and the
training step of the network
train: iterate the train step
The main script takes in input an instance and:
calls prepare and train
quantizes the obtained network
evaluates the accuracy
evaluates cache performance using linux-perf
plots the data
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 12 / 25
Our work
Topologies on MNIST
(a) Big
(b) More
convolutional (c) More FC (d) Tf Example (e) Only FC
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 13 / 25
Our work
Some data on MNIST - accuracy
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 14 / 25
Our work
Some data on MNIST - models size
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 15 / 25
Our work
Some data on MNIST - data cache misses
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 16 / 25
Our work
Some data on MNIST - inference time
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 17 / 25
Our work
Some data on CIFAR10 - accuracy
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 18 / 25
Our work
Some data on CIFAR10 - models size
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 19 / 25
Our work
Some data on CIFAR10 - data cache misses
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 20 / 25
Our work
Some data on CIFAR10 - inference time
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 21 / 25
Our work
Why is the inference time worst?
We see an improvement in performance only for the size of the model,
and so for the data cache misses
Inference time and last level cache misses are worst in quantized
networks
From the tensorflow github page:
Only a subset of ops are supported, and on many platforms the quantized
code may actually be slower than the float equivalents, but this is a way of
increasing performance substantially when all the circumstances are right.
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 22 / 25
Our work
Original net - tensorflow benchmark tool
Original network on MNIST dataset:
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 23 / 25
Our work
Quantized net - tensorflow benchmark tool
Quantized network on MNIST dataset:
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 24 / 25
Our work
References
Tensorflow quantization: https://www.tensorflow.org/performance/quantization
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/graph_trans-
forms
Ristretto: http://lepsucd.com/?page_id=621
Github repository of the project:
https://github.com/EmilianoGagliardiEmanueleGhelfi/CNN-compression-performance
Deep Learning with Limited Numerical Precision (Suyog Gupta, Ankur
Agrawal, Kailash Gopalakrishnan)
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 25 / 25
1 of 25

Recommended

gpt3_presentation.pdf by
gpt3_presentation.pdfgpt3_presentation.pdf
gpt3_presentation.pdfGiacomo Frisoni
438 views14 slides
NLP using transformers by
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
4.1K views67 slides
Deep Generative Models by
Deep Generative Models Deep Generative Models
Deep Generative Models Chia-Wen Cheng
1.7K views55 slides
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong
860 views129 slides
LLMs_talk_March23.pdf by
LLMs_talk_March23.pdfLLMs_talk_March23.pdf
LLMs_talk_March23.pdfChaoYang81
92 views160 slides
“Vision and AI DSPs for Ultra-High-End and Always-On Applications,” a Present... by
“Vision and AI DSPs for Ultra-High-End and Always-On Applications,” a Present...“Vision and AI DSPs for Ultra-High-End and Always-On Applications,” a Present...
“Vision and AI DSPs for Ultra-High-End and Always-On Applications,” a Present...Edge AI and Vision Alliance
212 views24 slides

More Related Content

What's hot

Fine tuning large LMs by
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMsSylvainGugger
2.4K views25 slides
Pruning convolutional neural networks for resource efficient inference by
Pruning convolutional neural networks for resource efficient inferencePruning convolutional neural networks for resource efficient inference
Pruning convolutional neural networks for resource efficient inferenceKaushalya Madhawa
1.4K views20 slides
Intro to LLMs by
Intro to LLMsIntro to LLMs
Intro to LLMsLoic Merckel
1K views45 slides
Integer quantization for deep learning inference: principles and empirical ev... by
Integer quantization for deep learning inference: principles and empirical ev...Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...jemin lee
620 views29 slides
Introduction to MAML (Model Agnostic Meta Learning) with Discussions by
Introduction to MAML (Model Agnostic Meta Learning) with DiscussionsIntroduction to MAML (Model Agnostic Meta Learning) with Discussions
Introduction to MAML (Model Agnostic Meta Learning) with DiscussionsJoonyoung Yi
4.8K views81 slides
[Paper review] BERT by
[Paper review] BERT[Paper review] BERT
[Paper review] BERTJEE HYUN PARK
1.1K views46 slides

What's hot(20)

Fine tuning large LMs by SylvainGugger
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
SylvainGugger2.4K views
Pruning convolutional neural networks for resource efficient inference by Kaushalya Madhawa
Pruning convolutional neural networks for resource efficient inferencePruning convolutional neural networks for resource efficient inference
Pruning convolutional neural networks for resource efficient inference
Kaushalya Madhawa1.4K views
Integer quantization for deep learning inference: principles and empirical ev... by jemin lee
Integer quantization for deep learning inference: principles and empirical ev...Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...
jemin lee620 views
Introduction to MAML (Model Agnostic Meta Learning) with Discussions by Joonyoung Yi
Introduction to MAML (Model Agnostic Meta Learning) with DiscussionsIntroduction to MAML (Model Agnostic Meta Learning) with Discussions
Introduction to MAML (Model Agnostic Meta Learning) with Discussions
Joonyoung Yi4.8K views
Fine tune and deploy Hugging Face NLP models by OVHcloud
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
OVHcloud957 views
A Comprehensive Review of Large Language Models for.pptx by SaiPragnaKancheti
A Comprehensive Review of Large Language Models for.pptxA Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptx
SaiPragnaKancheti2.3K views
An introduction to the Transformers architecture and BERT by Suman Debnath
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
Suman Debnath326 views
Word embeddings, RNN, GRU and LSTM by Divya Gera
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
Divya Gera628 views
And then there were ... Large Language Models by Leon Dohmen
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
Leon Dohmen2.4K views
Pre trained language model by JiWenKim
Pre trained language modelPre trained language model
Pre trained language model
JiWenKim368 views
Introduction to Transformers for NLP - Olga Petrova by Alexey Grigorev
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
Alexey Grigorev403 views
Ga by venki249
GaGa
Ga
venki2491.2K views
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Young Seok Kim
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Young Seok Kim907 views
Deep Learning - Convolutional Neural Networks - Architectural Zoo by Christian Perone
Deep Learning - Convolutional Neural Networks - Architectural ZooDeep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Christian Perone26.2K views
Natural language processing and transformer models by Ding Li
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
Ding Li622 views
Machine Learning Interpretability / Explainability by Raouf KESKES
Machine Learning Interpretability / ExplainabilityMachine Learning Interpretability / Explainability
Machine Learning Interpretability / Explainability
Raouf KESKES473 views

Similar to CNN Quantization

Survey of recent deep learning with low precision by
Survey of recent deep learning with low precisionSurvey of recent deep learning with low precision
Survey of recent deep learning with low precisionMila, Université de Montréal
984 views54 slides
Novel hybrid framework for image compression for supportive hardware design o... by
Novel hybrid framework for image compression for supportive hardware design o...Novel hybrid framework for image compression for supportive hardware design o...
Novel hybrid framework for image compression for supportive hardware design o...IJECEIAES
7 views9 slides
SigOpt at GTC - Tuning the Untunable by
SigOpt at GTC - Tuning the UntunableSigOpt at GTC - Tuning the Untunable
SigOpt at GTC - Tuning the UntunableSigOpt
227 views36 slides
Tuning for Systematic Trading: Talk 2: Deep Learning by
Tuning for Systematic Trading: Talk 2: Deep LearningTuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 2: Deep LearningSigOpt
118 views38 slides
Anomaly Detection with Azure and .net by
Anomaly Detection with Azure and .netAnomaly Detection with Azure and .net
Anomaly Detection with Azure and .netMarco Parenzan
116 views42 slides
Hyper-parameter optimization of convolutional neural network based on particl... by
Hyper-parameter optimization of convolutional neural network based on particl...Hyper-parameter optimization of convolutional neural network based on particl...
Hyper-parameter optimization of convolutional neural network based on particl...journalBEEI
68 views8 slides

Similar to CNN Quantization(20)

Novel hybrid framework for image compression for supportive hardware design o... by IJECEIAES
Novel hybrid framework for image compression for supportive hardware design o...Novel hybrid framework for image compression for supportive hardware design o...
Novel hybrid framework for image compression for supportive hardware design o...
IJECEIAES7 views
SigOpt at GTC - Tuning the Untunable by SigOpt
SigOpt at GTC - Tuning the UntunableSigOpt at GTC - Tuning the Untunable
SigOpt at GTC - Tuning the Untunable
SigOpt227 views
Tuning for Systematic Trading: Talk 2: Deep Learning by SigOpt
Tuning for Systematic Trading: Talk 2: Deep LearningTuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 2: Deep Learning
SigOpt118 views
Anomaly Detection with Azure and .net by Marco Parenzan
Anomaly Detection with Azure and .netAnomaly Detection with Azure and .net
Anomaly Detection with Azure and .net
Marco Parenzan116 views
Hyper-parameter optimization of convolutional neural network based on particl... by journalBEEI
Hyper-parameter optimization of convolutional neural network based on particl...Hyper-parameter optimization of convolutional neural network based on particl...
Hyper-parameter optimization of convolutional neural network based on particl...
journalBEEI68 views
TEST-COST-SENSITIVE CONVOLUTIONAL NEURAL NETWORKS WITH EXPERT BRANCHES by sipij
TEST-COST-SENSITIVE CONVOLUTIONAL NEURAL NETWORKS WITH EXPERT BRANCHESTEST-COST-SENSITIVE CONVOLUTIONAL NEURAL NETWORKS WITH EXPERT BRANCHES
TEST-COST-SENSITIVE CONVOLUTIONAL NEURAL NETWORKS WITH EXPERT BRANCHES
sipij9 views
Traffic sign classification by Bill Kromydas
Traffic sign classificationTraffic sign classification
Traffic sign classification
Bill Kromydas171 views
Single Image Super Resolution using Fuzzy Deep Convolutional Networks by Greeshma M.S.R
Single Image Super Resolution using Fuzzy Deep Convolutional NetworksSingle Image Super Resolution using Fuzzy Deep Convolutional Networks
Single Image Super Resolution using Fuzzy Deep Convolutional Networks
Greeshma M.S.R279 views
Anomaly Detection with Azure and .NET by Marco Parenzan
Anomaly Detection with Azure and .NETAnomaly Detection with Azure and .NET
Anomaly Detection with Azure and .NET
Marco Parenzan203 views
IRJET- Mango Classification using Convolutional Neural Networks by IRJET Journal
IRJET- Mango Classification using Convolutional Neural NetworksIRJET- Mango Classification using Convolutional Neural Networks
IRJET- Mango Classification using Convolutional Neural Networks
IRJET Journal33 views
IRJET- American Sign Language Classification by IRJET Journal
IRJET- American Sign Language ClassificationIRJET- American Sign Language Classification
IRJET- American Sign Language Classification
IRJET Journal31 views
(Im2col)accelerating deep neural networks on low power heterogeneous architec... by Bomm Kim
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
Bomm Kim59 views
Slides galvin-widjaja by CodePolitan
Slides galvin-widjajaSlides galvin-widjaja
Slides galvin-widjaja
CodePolitan1K views
Michele vitali milano r aprile 2013 by MicheleVitali
Michele vitali   milano r aprile 2013Michele vitali   milano r aprile 2013
Michele vitali milano r aprile 2013
MicheleVitali346 views
SOFI Developer Meeting Göttingen 28th March 2015 by Dirk Hähnel
SOFI Developer MeetingGöttingen 28th March 2015 SOFI Developer MeetingGöttingen 28th March 2015
SOFI Developer Meeting Göttingen 28th March 2015
Dirk Hähnel165 views
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY by csandit
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHYSPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
csandit643 views
Update on the Mont-Blanc Project for ARM-based HPC by inside-BigData.com
Update on the Mont-Blanc Project for ARM-based HPCUpdate on the Mont-Blanc Project for ARM-based HPC
Update on the Mont-Blanc Project for ARM-based HPC
inside-BigData.com660 views

Recently uploaded

2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx by
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptxlwang78
83 views19 slides
Instrumentation & Control Lab Manual.pdf by
Instrumentation & Control Lab Manual.pdfInstrumentation & Control Lab Manual.pdf
Instrumentation & Control Lab Manual.pdfNTU Faisalabad
6 views63 slides
Pull down shoulder press final report docx (1).pdf by
Pull down shoulder press final report docx (1).pdfPull down shoulder press final report docx (1).pdf
Pull down shoulder press final report docx (1).pdfComsat Universal Islamabad Wah Campus
17 views25 slides
Codes and Conventions.pptx by
Codes and Conventions.pptxCodes and Conventions.pptx
Codes and Conventions.pptxIsabellaGraceAnkers
9 views5 slides
802.11 Computer Networks by
802.11 Computer Networks802.11 Computer Networks
802.11 Computer NetworksTusharChoudhary72015
10 views33 slides
START Newsletter 3 by
START Newsletter 3START Newsletter 3
START Newsletter 3Start Project
5 views25 slides

Recently uploaded(20)

2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx by lwang78
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
lwang7883 views
Instrumentation & Control Lab Manual.pdf by NTU Faisalabad
Instrumentation & Control Lab Manual.pdfInstrumentation & Control Lab Manual.pdf
Instrumentation & Control Lab Manual.pdf
NTU Faisalabad 6 views
Effect of deep chemical mixing columns on properties of surrounding soft clay... by AltinKaradagli
Effect of deep chemical mixing columns on properties of surrounding soft clay...Effect of deep chemical mixing columns on properties of surrounding soft clay...
Effect of deep chemical mixing columns on properties of surrounding soft clay...
AltinKaradagli9 views
_MAKRIADI-FOTEINI_diploma thesis.pptx by fotinimakriadi
_MAKRIADI-FOTEINI_diploma thesis.pptx_MAKRIADI-FOTEINI_diploma thesis.pptx
_MAKRIADI-FOTEINI_diploma thesis.pptx
fotinimakriadi8 views
Introduction to CAD-CAM.pptx by suyogpatil49
Introduction to CAD-CAM.pptxIntroduction to CAD-CAM.pptx
Introduction to CAD-CAM.pptx
suyogpatil495 views
Control Systems Feedback.pdf by LGGaming5
Control Systems Feedback.pdfControl Systems Feedback.pdf
Control Systems Feedback.pdf
LGGaming56 views
Design of machine elements-UNIT 3.pptx by gopinathcreddy
Design of machine elements-UNIT 3.pptxDesign of machine elements-UNIT 3.pptx
Design of machine elements-UNIT 3.pptx
gopinathcreddy32 views
DevOps-ITverse-2023-IIT-DU.pptx by Anowar Hossain
DevOps-ITverse-2023-IIT-DU.pptxDevOps-ITverse-2023-IIT-DU.pptx
DevOps-ITverse-2023-IIT-DU.pptx
Anowar Hossain12 views
GDSC Mikroskil Members Onboarding 2023.pdf by gdscmikroskil
GDSC Mikroskil Members Onboarding 2023.pdfGDSC Mikroskil Members Onboarding 2023.pdf
GDSC Mikroskil Members Onboarding 2023.pdf
gdscmikroskil53 views
fakenews_DBDA_Mar23.pptx by deepmitra8
fakenews_DBDA_Mar23.pptxfakenews_DBDA_Mar23.pptx
fakenews_DBDA_Mar23.pptx
deepmitra815 views
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc... by csegroupvn
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...
csegroupvn5 views
Generative AI Models & Their Applications by SN
Generative AI Models & Their ApplicationsGenerative AI Models & Their Applications
Generative AI Models & Their Applications
SN8 views

CNN Quantization

  • 1. CNN Quantization Performance evaluation Emanuele Ghelfi Emiliano Gagliardi June 18, 2017 Politecnico di Milano Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 1 / 25
  • 2. Contents 1 Purposes 2 Quantization 3 Our work Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 2 / 25
  • 3. Purposes Problem Accuracy/inference speed trade-off For real world application, convolutional neural network(CNN) model can take more than 100MB of space and can be computationally too expensive How to enable embedded devices like smart phones with the power of Neural Networks? There is the need of floating point precision? No, deep neural network tends to cope well with noise in their input Training still needs floating point precision to work, it is an iteration of little incremental adjustments of the weights Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 3 / 25
  • 4. Purposes Solution The solution is quantization. Deep networks can be trained with floating point precision, then a quantization algorithm can be applied to obtain smaller models and speed up the inference phase reducing memory requirements: Fixed-point compute units are typically faster and consume far less hardware resources and power than floating-point engines Low-precision data representation reduces the memory footprint, enabling larger models to fit within the given memory capacity Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 4 / 25
  • 5. Purposes Project purpose Using a machine learning framework with support for convolutional neural networks: Define different kind of network Train Quantize Evaluate the original and the quantized models Compare them in terms of model size, cache misses, and inference time Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 5 / 25
  • 6. Quantization What is quantization? Developing this project we saw two different approaches: Tensorflow Caffe Ristretto Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 6 / 25
  • 7. Quantization Tensorflow quantization Unsupervised approach Get a trained network Obtain for each layer the min and the max of the weights value Represent the weights distributed linearly between the minimum and maximum with 8 bits precision The operations have to be reimplemented for the 8-bit format The resulting data structure is composed by an array containing the quantized value, and the two float min and max Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 7 / 25
  • 8. Quantization Caffe Ristretto quantization Supervised approach Get a trained network Three different methods: Dynamic fixed point: a modified fixed-point format Minifloat: bit-width reduced floating point numbers Power of two: layers with power-of-two parameters don’t need any multipliers, when implemented in hardware Evaluate the performance of the network during quantization in order to keep the accuracy higher than a given threshold Support for training of quantized networks (fine-tuning) Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 8 / 25
  • 9. Quantization Caffè Ristretto quantization Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 9 / 25
  • 10. Our work Caffe ristretto The results obtained with Ristretto on a simple network for the Mnist dataset are not so satisfying... network accuracy model size (MB) Time (ms) LL_d misses (106 ) L1_d misses (106 ) Orginal 0.9903 1.7 29.2 32.098 277.189 Dynamic f. p. 0.9829 1.7 126.41 42.077 303.209 Minifloat 0.9916 1.7 29.5 37.149 282.396 Power of two 0.9899 1.7 61.1 35.774 280.819 Linux running on macbook pro, cachegrind tool for cache statistics. Intel i5 2.9 GHz, L3 cache 3MB, 16 GB ram. The quantized values are stored in float size after the quantization The quantized layers implementation works with float variables: perform the computation with low precision values stored in float variables quantize the results, still stored in float variables Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 10 / 25
  • 11. Our work Tensorflow The quantization is better supported The quantized model is stored with low precision weights Some low precision operations are already implemented We tried different topologies of networks, to see how quantization affects different architectures Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 11 / 25
  • 12. Our work How we used the tensorflow quantization tool We used python (with a bit OO, since we needed a way to use it with different networks) An abstract class defines the pattern of the network that the main script can handle The core methods of the pattern are prepare: load the data and build the computational graph and the training step of the network train: iterate the train step The main script takes in input an instance and: calls prepare and train quantizes the obtained network evaluates the accuracy evaluates cache performance using linux-perf plots the data Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 12 / 25
  • 13. Our work Topologies on MNIST (a) Big (b) More convolutional (c) More FC (d) Tf Example (e) Only FC Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 13 / 25
  • 14. Our work Some data on MNIST - accuracy Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 14 / 25
  • 15. Our work Some data on MNIST - models size Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 15 / 25
  • 16. Our work Some data on MNIST - data cache misses Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 16 / 25
  • 17. Our work Some data on MNIST - inference time Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 17 / 25
  • 18. Our work Some data on CIFAR10 - accuracy Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 18 / 25
  • 19. Our work Some data on CIFAR10 - models size Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 19 / 25
  • 20. Our work Some data on CIFAR10 - data cache misses Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 20 / 25
  • 21. Our work Some data on CIFAR10 - inference time Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 21 / 25
  • 22. Our work Why is the inference time worst? We see an improvement in performance only for the size of the model, and so for the data cache misses Inference time and last level cache misses are worst in quantized networks From the tensorflow github page: Only a subset of ops are supported, and on many platforms the quantized code may actually be slower than the float equivalents, but this is a way of increasing performance substantially when all the circumstances are right. Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 22 / 25
  • 23. Our work Original net - tensorflow benchmark tool Original network on MNIST dataset: Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 23 / 25
  • 24. Our work Quantized net - tensorflow benchmark tool Quantized network on MNIST dataset: Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 24 / 25
  • 25. Our work References Tensorflow quantization: https://www.tensorflow.org/performance/quantization https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/graph_trans- forms Ristretto: http://lepsucd.com/?page_id=621 Github repository of the project: https://github.com/EmilianoGagliardiEmanueleGhelfi/CNN-compression-performance Deep Learning with Limited Numerical Precision (Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan) Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 25 / 25