1. DO DEEP NETS REALLY
NEED TO BE DEEP?
Meoni Marco – UNIPI – March 7th 2016
Lei Jimmy Ba
University of Toronto
Rich Caruana
Microsoft Research
PhD course in Deep Learning
2. NNs
Outputs
Inputs
SNN: Single Hidden Layer
Outputs
Inputs
DNN: Three Hidden Layers
Outputs
Inputs
CNN: Three Hidden Layers above
Convolutional/MaxPooling Layers
3. Introduction
• DNNs excel over SNNs
• e.g. accuracy on top of 1M labeled points is 91% vs 86%
• Source of improvement of DNNs vs SNNs
• Deep nets have more parameters?
• Deep nets can learn more complex functions?
• Convolution gives a plus?
4. Contribution
• Possible to train a SNN that mimics the function of a DNN
• Model compression method
• Possible to mimic but non able to train
• SNNs as accurate as DNNs even if not possible to train SNNs as
accurate as DNNs on the original labeled data
• Necessary to be deep?
• If SNN can mimic a DNN, DDN learning function not that deep?
• Success related to the learning process
5. Model Compression
DNN CNN …
Ensemble
Data
1. Build a complex model 2. Train a simple model to
mimic complex function
3. Apply it
Scores
Labels
SNN
Data
Scores
SNN
Data
Labels
• Compress large ensembles into smaller, faster models
• Train to learn the function learned by the larger model, not on original labels
6. Model Compression (Bucila,Caruana&Niculescu2006)
• Train smaller model to mimic a larger, smarter model
• train smart model anyway you want:
• DNN, CNN, or ensemble of CNNs
• pass large unlabeled data through model to collect predictions (capture
the function learned by smart model)
• train “small” model to mimic large model on labeled data
7. Logits
• Model compression
• train mimic SNNs using data labeled by DNNs
• DNN trained with softmax output and cross-entropy
• SNN trained on logits (log of predicted probabilities)
before softmax activation
9. Speed-up Mimic Learning
• SNN has same #parameters: slow learning (GPU weeks)
• Add bottleneck linear layer
• k linear hidden units between input and non-linear hidden layer
• factorize W ∈ RH×D into the product of 2 low-rank matrices
10. Cost Function with Linear Layer
• O(k(H+D)) memory instead of O(HD)
• Factorization between input and hidden levels is new and
improve convergence speed during training
• Previous works factorize last output layer
17. Discussion
• Why MIMIC models can be more accurate than training on
original labels
• If labels have errors, teacher may
eliminate them making learning easier
for student
• Teacher might resolve complex
regions
• Learning from probabilities is easier
• All outputs have “reason” for student
while teacher may encounter
unexplainable things
18. Representational Power
“We see little evidence that shallow models have limited capacity
or representational power.
Instead, the main limitation appears to be the learning and
regularization procedures used to train the shallow models”