Do deep nets really need to be deep?

•

1 like•705 views

Marco Meoni

A review of a milestone paper from Lei Jimmy Ba (Univ Toronto) and Rich Capuana (Microsoft Research)

Data & Analytics

DO DEEP NETS REALLY
NEED TO BE DEEP?
Meoni Marco – UNIPI – March 7th 2016
Lei Jimmy Ba
University of Toronto
Rich Caruana
Microsoft Research
PhD course in Deep Learning

NNs
Outputs
Inputs
SNN: Single Hidden Layer
Outputs
Inputs
DNN: Three Hidden Layers
Outputs
Inputs
CNN: Three Hidden Layers above
Convolutional/MaxPooling Layers

Introduction
•  DNNs excel over SNNs
•  e.g. accuracy on top of 1M labeled points is 91% vs 86%
•  Source of improvement of DNNs vs SNNs
•  Deep nets have more parameters?
•  Deep nets can learn more complex functions?
•  Convolution gives a plus?

Contribution
•  Possible to train a SNN that mimics the function of a DNN
•  Model compression method
•  Possible to mimic but non able to train
•  SNNs as accurate as DNNs even if not possible to train SNNs as
accurate as DNNs on the original labeled data
•  Necessary to be deep?
•  If SNN can mimic a DNN, DDN learning function not that deep?
•  Success related to the learning process

Model Compression
DNN CNN …
Ensemble
Data
1. Build a complex model 2.  Train a simple model to
mimic complex function
3.  Apply it
Scores
Labels
SNN
Data
Scores
SNN
Data
Labels
•  Compress large ensembles into smaller, faster models
•  Train to learn the function learned by the larger model, not on original labels

Model Compression (Bucila,Caruana&Niculescu2006)
•  Train smaller model to mimic a larger, smarter model
•  train smart model anyway you want:
•  DNN, CNN, or ensemble of CNNs
•  pass large unlabeled data through model to collect predictions (capture
the function learned by smart model)
•  train “small” model to mimic large model on labeled data

Logits
•  Model compression
•  train mimic SNNs using data labeled by DNNs
•  DNN trained with softmax output and cross-entropy
•  SNN trained on logits (log of predicted probabilities)
before softmax activation

SNN-MIMIC
•  Training data
•  Objective function
•  Weights updated with BP and SGD with momentum

Speed-up Mimic Learning
•  SNN has same #parameters: slow learning (GPU weeks)
•  Add bottleneck linear layer
•  k linear hidden units between input and non-linear hidden layer
•  factorize W ∈ RH×D into the product of 2 low-rank matrices

Cost Function with Linear Layer
•  O(k(H+D)) memory instead of O(HD)
•  Factorization between input and hidden levels is new and
improve convergence speed during training
•  Previous works factorize last output layer

Use Cases
TIMIT (phoneme recognition)
•  In: lexically/phonetically labeled sentences
•  Out: phonemes
CIFAR-10 (image recognition)
•  In: images
•  Out: classes

TIMIT Phoneme Recognition
•  1845 dimension input vector from raw waveform audio data
•  183 dimension target label vectors (61 phonemes x 3)
•  1.1M examples in training set
•  DNN
•  3 hidden layers with 2000 ReLU units
•  CNN
•  Convolutional + maxPooling + 3 hidden (2000 ReLU) layers
•  ECNN
•  Ensemble of 9 CNNs
•  SNN
•  8k/50k/400k non linear hidden units

CIFAR-10 Image Recognition
•  3072 dimension input vector (32x32 pixels x 3 colors)
•  10-dimension target label vectors
•  1.05M images in two merged training sets

Discussion
•  Why MIMIC models can be more accurate than training on
original labels
•  If labels have errors, teacher may
eliminate them making learning easier
for student
•  Teacher might resolve complex
regions
•  Learning from probabilities is easier
•  All outputs have “reason” for student
while teacher may encounter
unexplainable things

Representational Power
“We see little evidence that shallow models have limited capacity
or representational power.
Instead, the main limitation appears to be the learning and
regularization procedures used to train the shallow models”

What's hot

Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...Universitat Politècnica de Catalunya

Lifelong / Incremental Deep Learning - Ramon Morros - UPC Barcelona 2018Universitat Politècnica de Catalunya

Introduction to Tree-LSTMsDaniel Perez

Reproducing and Analyzing Adaptive Computation Time in PyTorch and TensorFlowUniversitat Politècnica de Catalunya

Deep Learning for Computer Vision: Transfer Learning and Domain Adaptation (U...Universitat Politècnica de Catalunya

Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya

Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya

Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018Universitat Politècnica de Catalunya

Deep Learning for Computer Vision: Memory usage and computational considerati...Universitat Politècnica de Catalunya

Overview of TensorFlow For Natural Language Processingananth

Audio tagging system using densely connected convolutional networks (DCASE201...Hyun-gui Lim

Deep Learning for Computer Vision: Software Frameworks (UPC 2016)Universitat Politècnica de Catalunya

RNN & LSTM: Neural Network for Sequential DataYao-Chieh Hu

Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya

TensorFlow in 3 sentencesBarbara Fusinska

Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya

The Munich LSTM-RNN Approach to the MediaEval 2014 “Emotion in Music” Taskmultimediaeval

Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya

Backpropagation - Elisa Sayrol - UPC Barcelona 2018Universitat Politècnica de Catalunya

Hands-on Deep Learning in PythonImry Kissos

What's hot (20)

Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...

Lifelong / Incremental Deep Learning - Ramon Morros - UPC Barcelona 2018

Introduction to Tree-LSTMs

Reproducing and Analyzing Adaptive Computation Time in PyTorch and TensorFlow

Deep Learning for Computer Vision: Transfer Learning and Domain Adaptation (U...

Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)

Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)

Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018

Deep Learning for Computer Vision: Memory usage and computational considerati...

Overview of TensorFlow For Natural Language Processing

Audio tagging system using densely connected convolutional networks (DCASE201...

Deep Learning for Computer Vision: Software Frameworks (UPC 2016)

RNN & LSTM: Neural Network for Sequential Data

Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

TensorFlow in 3 sentences

Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)

The Munich LSTM-RNN Approach to the MediaEval 2014 “Emotion in Music” Task

Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...

Backpropagation - Elisa Sayrol - UPC Barcelona 2018

Hands-on Deep Learning in Python

Viewers also liked

IARC Marketing and SalesAmbre Quinn

What is Google+ and why should we care? (2013 edition) Kamber

จรรยาวิชาชีพวิจัยสถาบันวิจัยและพัฒนา มทร.รัตนโกสินทร์

AngolaLuis R Castellanos

Taylor Milbun Estate Agents In Essex Who Help With MortgageMark Joseph

HistoriadeladnYerson Velasques Barra

شكرsandy alasheck

Year 13 parents' evening presentation - October 2015rpalmerratcliffe

What happens to the artist when you pirateUtsab Bandopadhyay

Renevela16René Vela

Acceptable behaviour? Government intervention on unhealthy foodsIpsos UK

Presentazione turismo pellegrinoClaudio Cheirasco

Earthsoft-Collection-Apr 2011EarthSoft Foundation of Guidance - EFG

Search Engine Optimization (SEO) Trends 2015Venchito Tampon

Does Your Business Need to be Using Social MediaHall Internet Marketing

Enseñanza de la me canicamvaldes0127

Exorcise the NIMBY Withinacohenhnk

1 plan del buen vivir 2009 2013-octubre 20_2010ubertocortez

Sesion 5Andrés García

lingkaranrianika safitri

Viewers also liked (20)

IARC Marketing and Sales

What is Google+ and why should we care? (2013 edition)

จรรยาวิชาชีพวิจัย

Angola

Taylor Milbun Estate Agents In Essex Who Help With Mortgage

Historiadeladn

شكر

Year 13 parents' evening presentation - October 2015

What happens to the artist when you pirate

Renevela16

Acceptable behaviour? Government intervention on unhealthy foods

Presentazione turismo pellegrino

Earthsoft-Collection-Apr 2011

Search Engine Optimization (SEO) Trends 2015

Does Your Business Need to be Using Social Media

Enseñanza de la me canica

Exorcise the NIMBY Within

1 plan del buen vivir 2009 2013-octubre 20_2010

Sesion 5

lingkaran

Similar to Do deep nets really need to be deep?

Deep learning and image analytics using Python by Dr SanparitBAINIDA

Deep learning with TensorFlowBarbara Fusinska

Introduction to computer vision with Convoluted Neural NetworksMarcinJedyk

Introduction to computer visionMarcin Jedyk

From Conventional Machine Learning to Deep Learning and Beyond.pptxChun-Hao Chang

Fundamental of deep learningStanley Wang

Startup.Ml: Using neon for NLP and Localization Applications Intel Nervana

Deep-learning-for-computer-vision-applications-using-matlab.pdfAubainYro1

Introduction to deep learning in python and MatlabImry Kissos

Use CNN for Sequence ModelingDongang (Sean) Wang

AI powered emotion recognition: From Inception to Production - Global AI Conf...Vandana Kannan

AI powered emotion recognition: From Inception to Production - Global AI Conf...Apache MXNet

NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA Taiwan

Resnet.pdfYanhuaSi

Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakPyData

5_RNN_LSTM.pdfFEG

An Introduction to Deep Learningmilad abbasi

Introduction to Deep LearningMehrnaz Faraz

Scalable Deep Learning Using Apache MXNetAmazon Web Services

DEF CON 24 - Clarence Chio - machine duping 101Felipe Prado

Similar to Do deep nets really need to be deep? (20)

Deep learning and image analytics using Python by Dr Sanparit

Deep learning with TensorFlow

Introduction to computer vision with Convoluted Neural Networks

Introduction to computer vision

From Conventional Machine Learning to Deep Learning and Beyond.pptx

Fundamental of deep learning

Startup.Ml: Using neon for NLP and Localization Applications

Deep-learning-for-computer-vision-applications-using-matlab.pdf

Introduction to deep learning in python and Matlab

Use CNN for Sequence Modeling

AI powered emotion recognition: From Inception to Production - Global AI Conf...

NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow

Resnet.pdf

Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak

5_RNN_LSTM.pdf

An Introduction to Deep Learning

Introduction to Deep Learning

Scalable Deep Learning Using Apache MXNet

DEF CON 24 - Clarence Chio - machine duping 101

Recently uploaded

Digi Khata Problem along complete plan.pptxTanveerAhmed817946

Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Invezz.com - Grow your wealth with trading signalsInvezz1

Industrialised data - the key to AI success.pdfLars Albertsson

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh

Brighton SEO | April 2024 | Data StorytellingNeil Barnes

Decoding Loan Approval: Predictive Modeling in ActionBoston Institute of Analytics

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha

E-Commerce Order PredictionShraddha Kamble.pptxBoston Institute of Analytics

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh9953056974 Low Rate Call Girls In Saket, Delhi NCR

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

Ukraine War presentation: KNOW THE BASICSAishani27

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

Recently uploaded (20)

Digi Khata Problem along complete plan.pptx

Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...

Customer Service Analytics - Make Sense of All Your Data.pptx

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...

Invezz.com - Grow your wealth with trading signals

Industrialised data - the key to AI success.pdf

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝

Brighton SEO | April 2024 | Data Storytelling

Decoding Loan Approval: Predictive Modeling in Action

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

RA-11058_IRR-COMPRESS Do 198 series of 1998

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

Call Girls In Mahipalpur O9654467111 Escorts Service

E-Commerce Order PredictionShraddha Kamble.pptx

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh

04242024_CCC TUG_Joins and Relationships

Ukraine War presentation: KNOW THE BASICS

100-Concepts-of-AI by Anupama Kate .pptx

Do deep nets really need to be deep?

1. DO DEEP NETS REALLY NEED TO BE DEEP? Meoni Marco – UNIPI – March 7th 2016 Lei Jimmy Ba University of Toronto Rich Caruana Microsoft Research PhD course in Deep Learning

2. NNs Outputs Inputs SNN: Single Hidden Layer Outputs Inputs DNN: Three Hidden Layers Outputs Inputs CNN: Three Hidden Layers above Convolutional/MaxPooling Layers

3. Introduction •  DNNs excel over SNNs •  e.g. accuracy on top of 1M labeled points is 91% vs 86% •  Source of improvement of DNNs vs SNNs •  Deep nets have more parameters? •  Deep nets can learn more complex functions? •  Convolution gives a plus?

4. Contribution •  Possible to train a SNN that mimics the function of a DNN •  Model compression method •  Possible to mimic but non able to train •  SNNs as accurate as DNNs even if not possible to train SNNs as accurate as DNNs on the original labeled data •  Necessary to be deep? •  If SNN can mimic a DNN, DDN learning function not that deep? •  Success related to the learning process

5. Model Compression DNN CNN … Ensemble Data 1. Build a complex model 2.  Train a simple model to mimic complex function 3.  Apply it Scores Labels SNN Data Scores SNN Data Labels •  Compress large ensembles into smaller, faster models •  Train to learn the function learned by the larger model, not on original labels

6. Model Compression (Bucila,Caruana&Niculescu2006) •  Train smaller model to mimic a larger, smarter model •  train smart model anyway you want: •  DNN, CNN, or ensemble of CNNs •  pass large unlabeled data through model to collect predictions (capture the function learned by smart model) •  train “small” model to mimic large model on labeled data

7. Logits •  Model compression •  train mimic SNNs using data labeled by DNNs •  DNN trained with softmax output and cross-entropy •  SNN trained on logits (log of predicted probabilities) before softmax activation

8. SNN-MIMIC •  Training data •  Objective function •  Weights updated with BP and SGD with momentum

9. Speed-up Mimic Learning •  SNN has same #parameters: slow learning (GPU weeks) •  Add bottleneck linear layer •  k linear hidden units between input and non-linear hidden layer •  factorize W ∈ RH×D into the product of 2 low-rank matrices

10. Cost Function with Linear Layer •  O(k(H+D)) memory instead of O(HD) •  Factorization between input and hidden levels is new and improve convergence speed during training •  Previous works factorize last output layer

11. Use Cases TIMIT (phoneme recognition) •  In: lexically/phonetically labeled sentences •  Out: phonemes CIFAR-10 (image recognition) •  In: images •  Out: classes

12. TIMIT Phoneme Recognition •  1845 dimension input vector from raw waveform audio data •  183 dimension target label vectors (61 phonemes x 3) •  1.1M examples in training set •  DNN •  3 hidden layers with 2000 ReLU units •  CNN •  Convolutional + maxPooling + 3 hidden (2000 ReLU) layers •  ECNN •  Ensemble of 9 CNNs •  SNN •  8k/50k/400k non linear hidden units

13. TIMIT - Compression Results

14. TIMIT - Accuracy

15. CIFAR-10 Image Recognition •  3072 dimension input vector (32x32 pixels x 3 colors) •  10-dimension target label vectors •  1.05M images in two merged training sets

16. CIFAR-10 - Compression Results

17. Discussion •  Why MIMIC models can be more accurate than training on original labels •  If labels have errors, teacher may eliminate them making learning easier for student •  Teacher might resolve complex regions •  Learning from probabilities is easier •  All outputs have “reason” for student while teacher may encounter unexplainable things

18. Representational Power “We see little evidence that shallow models have limited capacity or representational power. Instead, the main limitation appears to be the learning and regularization procedures used to train the shallow models”

19. THANK YOU!

Do deep nets really need to be deep?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Do deep nets really need to be deep?

Similar to Do deep nets really need to be deep? (20)

Recently uploaded

Recently uploaded (20)

Do deep nets really need to be deep?