[Icml2019]LIT: Learned Intermediate Representation Training for Model Compression

LeapMind ICML2019 Reading Session
LIT: Learned Intermediate Representation
Training for Model Compression
Reader:
LeapMind, DL Researcher
Joel Nicholls

LeapMind Inc. © 2019 2
Paper Info
Published:
Animesh Koratana, Daniel Kang, Peter Bailis, Matei Zaharia ;
Proceedings of the 36th International Conference on Machine
Learning, PMLR 97:3509-3518, 2019. here.
→ Diagrams, equations, and figures from the paper
have been used in these slides to explain LIT.
ICML 2019: author’s slides.
NIPS 2018 Workshop CDNNRIA open review.

LeapMind Inc. © 2019
During training, the amount of computation is larger.
But arguably, it is more important to compress/accelerate the inference.
So for that reason, LIT is a nice technology for deep learning.
3
What is it ?
● Similar to knowledge distillation, it uses a teacher network to improve the accuracy
of a student network.
● At test time, the teacher is cut away from the student, so there is no increase in
computation for the inference stage, compared to training from scratch.
● (I think) the theory behind methods like this is still not well established. (Lopez-Paz
et al. [4] is one of the few works on theory). But, anyway if it gets results, then great.

Related works - Knowledge distillation
Knowledge distillation loss [1] is the combination of two losses.
1. The usual xentropy loss between the student outputs and the ground truth.
2. Another xentropy loss between the (trained) teacher outputs and student outputs.
Image by Ujjwal U. of Intel
https://software.intel.com/en-
us/articles/knowledge-distillation-
with-keras

Related works - Knowledge distillation
Knowledge distillation introduces two new hyperparameters.
The authors of LIT (Koratana et al.) say that both must be tuned to get good results.
1. Tau (temperature), to soften the targets.
2. Alpha, is the weighting between the two types of loss.

Related works - others
Born again neural networks [3]
They did experiments using distillation on
same-size student and teacher.
Fitnets [2]
Is a kind of hint training. It distills one of the
intermediate feature maps.
- the most similar work to LIT.
Image by Romero et al. from the paper
“fitnets: hints for thin deep nets”, published
as a conference paper in ICML 2015 [2]

Top-down view of LIT
● Combines both knowledge distillation (KD) loss AND intermediate representation (IR) loss.
● IR loss is L2 loss between intermediate feature maps (must be same size).
● In the training forward pass, student block receives teacher block as input.

LIT loss equation

More details for LIT
● Their student is less layers, but same thickness as teacher.
● They put the IR loss before downsampling points (but not at every downsample).
● New hyperparameter beta is the weighting between KD and IR losses.
● After the main training, they do a fine-tune with KD loss only.
● They find new hyperparameters tau, alpha, beta for each architecture type and dataset pair.

Experiments
Now moving to experiments !
In experiments, LIT compare with:
Knowledge distillation [1], fitnets [2], born again neural networks [3], and from-scratch.
● They say their method is better than all of these.
● They also do some ablation, which is good, but I won’t mention.

Experiments

How much improvement is that ?
Keeping to just one example:
Resnet-20 for CIFAR100
classification.
(reading roughly from the graph)
From scratch KD LIT
Test error 30.6 28.1 27.39
Improvement from scratch 0 30.6-28.1 = 2.5 30.6-27.39 = 3.21
Relative improvement 0 2.5/30.6 = 8.2% 3.21/30.6 = 10.5%

Additional experiments
Nice performance on Sentiment analysis
And can be combined with pruning

Can also be used for GAN

The main points, from my overall impression
● They compare with from-scratch. Some other distillation/pruning papers
don’t do that, but it is very important to see what is the improvement.
● Performs a bit better than knowledge distillation, in terms of relative
improvement (8.2% → 10.5% for Resnet-20 CIFAR100).
● Can compress GAN, which other distillation methods can’t do.
● Needs same size intermediate feature maps at the points where the student
and teacher are linked. For this reason, it is mostly best for student/teacher
pairs with same width (channels), and different depth (layers).

References
[1] Hinton et al. “Distilling the Knowledge in a Neural Network”
https://arxiv.org/abs/1503.02531
[2] Romero et al. “Fitnets: Hints for thin deep nets” (ICLR 2015)
[3] Furlanello et al. “Born again neural networks” (ICML 2018)
http://proceedings.mlr.press/v80/furlanello18a.html
[4] Lopez-Paz et al. “Unifying distillation and privileged information” (ICLR 2016)

[Icml2019]LIT: Learned Intermediate Representation Training for Model Compression

Recommended

Recommended

More Related Content

Similar to [Icml2019]LIT: Learned Intermediate Representation Training for Model Compression

Similar to [Icml2019]LIT: Learned Intermediate Representation Training for Model Compression (20)

More from LeapMind Inc

More from LeapMind Inc (17)

Recently uploaded

Recently uploaded (20)

[Icml2019]LIT: Learned Intermediate Representation Training for Model Compression