Distilling dark knowledge from neural networks

23 July 2016
Distilling dark knowledge from neural
networksAlex Korbonits, Data Scientist

2
Join our team!
About Remitly and Me

3
Introduction
• Feature extraction and classifier training
• Industrial settings requiring interpretability
• Model-Agnostic Interpretability of Machine Learning
• Neural networks
• Dark knowledge
• Distillation
• Applying distillation in the real world
• Applying distillation to fraud detection
Agenda

4
"A Survey of Modern Questions and Challenges in Feature Extraction"
• Two categories of learning algorithms
reviewed:
• Supervised
• Unsupervised
• Two categories of feature extraction
reviewed:
• Coupled
• Uncoupled
Feature extraction and learners

5
• Unsupervised Uncoupled feature extraction:
• PCA, IsoMap, Maximum Variance Unfolding
• Supervised Uncoupled feature extraction (i.e., feature
selection based on correlation with a label):
• MTFS (Argyriou et al., 2008)
• Supervised Coupled feature extraction:
• Neural Network (particularly with > 1 hidden layer)
• NO such thing as “unsupervised coupled” since the feature
extraction is not coupled to training a classifier.
Examples of corresponding algorithms

6
• Supervised Coupled methods tend to significantly
outperform others (but not always!) (Gonen, 2014)
• Pros: better feature extraction
• Cons: hard to interpret, complex, scalability is evolving
• No Free Lunch Theorem (Wolpert, 1997)
• Deep learning not a silver bullet!
• “We have dubbed the associated results NFL theorems
because they demonstrate that if an algorithm performs
well on a certain class of problems then it necessarily
pays for that with degraded performance on the set of all
remaining problems”
Takeaways

7
Moving right along
• Neural networks
• Dark knowledge
• Distillation
Agenda

8
Banking and Lending
• Credit card issuers required to give reasons for denial of
credit
• Anti-discriminatory regulations
• Consumer protection regulations
• Credit card issuers sacrifice predictive power to comply
• THIS IS A GOOD THING
• Restricts model complexity to interpretable model
classes
• E.g., logistic regression, single decision tree, etc.
Credit Card Applications

9
Banking and Lending
• Decisions where interpretability is essential:
• Whether or not to obtain a biopsy
• Whether or not to surgically operate
• Whether or not to try an experimental new drug
• Interpretability isn’t just good for decisions:
• Good for auditing prior decisions
• Good for building intuition and expertise
Medicine and healthcare

10
Banking and Lending
• Interpretability is a business imperative
• Helps identify who/what/where/why/when/how
• Suggests paths to change business
products/processes/services to reduce churn
Customer Churn Prediction

11
Are we there yet?
• Neural networks
• Dark knowledge
• Distillation
Agenda

12
Ribeiro et. al., 2016
• Separating prediction from interpretation:
• Use any black-box model you want for
prediction
• Use an interpretable model to explain black-
box predictions
• “Our explanations empower users in various
scenarios that require trust: deciding if one
should trust a prediction, choosing between
models, improving an untrustworthy classifier,
and detecting why a classifier should not be
trusted.”
Model-Agnostic Interpretability of Machine Learning

13
"Why Should I Trust You?": Explaining the Predictions of Any Classifier
• Local Interpretable Model-Agnostic Explanations
• Basic algorithm intuition:
• Train black-box model, get test set predictions
• Train interpretable model on those predictions
• LIME explains (locally) which features contributed
most to the given prediction.
• Important properties of any explanatory model:
• Interpretability
• Local fidelity
• Model-agnostic
LIME

14
Model-Agnostic Interpretability of Machine Learning
• CAVEAT (couldn’t have said it better myself):
• “In some domains, exact explanations may be
required (e.g. for legal or ethical reasons), and
using a black-box maybe unacceptable (or
even illegal). Interpretable models may also be
more desirable when interpretability is much
more important than accuracy, or when
interpretable models trained on a small number
of carefully engineered features are as
accurate as black-box models.”
• E.g., if you DON’T have “big data” or desire to
make other tradeoffs, LIME isn’t what you want.
CAVEATS to agnosticism

15
ERMAGHERD DEEP LEARNING
• Neural networks
• Dark knowledge
• Distillation
Agenda

16
Rosenblatt, 1957
• Rosenblatt, 1957, Cornell
Aeronautics Laboratory, funded by
the Office of Naval Research
• Linear classifier. Designed for
image recognition.
• Inputs x and weights w linearly
combined to achieve some sort of
output y.
• Can’t solve XOR (counterexample
to everything).
Perceptron

17
Cybenko, 1989
• With one hidden layer, a multilayer perceptron – which
can now figure out XOR – is capable of arbitrary
function approximation. (Cybenko, 1989)
• Riesz Representation theorem. Math nerds unite!
• Supervised, semi-supervised, unsupervised, and
reinforcement learning applications.
• Flexible architectural components – layer types,
connection types, regularization techniques – allow for
empirical tinkering. Think of playing with Lego®.
Enter the multilayer perceptron

18
Rumelhart et. al., 1985
Backpropagation

19
Sounds like Defense against the Dark Arts, amiright?
• Neural networks
• Dark knowledge
• Distillation
Agenda

20
Burning down the house!
• A trained classifier is simply a labelling function.
• More mathematically, it’s just a mapping of vectors to
vectors, inputs and outputs.
• We can just take those outputs as inputs to another
function!
• Typically, the output layer of a neural network is represented
by a “softmax layer” that computes a probability q_i for each
class from its logit z_i.
• T here is the temperature.
• Note: using a higher value for T produces a softer probability
distribution over the classes.
Dark Knowledge

22
Machine learning moonshine?
• Neural networks
• Dark knowledge
• Distillation
Agenda

23
• Distilled learning is model compression.
• There are many different procedures for distillation.
• E.g., of the family:
• The simplest way to transfer this knowledge:
• Use the cumbersome model’s output predictions
as the ground truth labels for the distilled model.
Distilled Learning
Jabir B Hayyan described distillation using an alembic in the 8th century.

24
It’s not magic, it’s just math
• Neural networks
• Dark knowledge
• Distillation
Agenda

25
MY GPU’s ARE MELTING…
• Researchers used deep learning to improve the
statistical power on benchmark problems involving:
• Higgs bosons
• Higgs boson decay modes
• Supersymmetric particles
• Results of distilled learning:
• Improved shallow classifiers on all three tasks
High Energy Physics

26
… FASTER THAN THE WICKED WITCH OF THE WEST
• Researchers used deep learning to get rich feature
representations.
• The purpose of these interpretable models were for phenotype
discovery.
• They extracted dark knowledge with a number of different neural
network architectures.
• Feedforward
• Stacked de-noising autoencoder
• LSTM (long short term memory)
• They then distilled dark knowledge into interpretable models.
• It improved the interpretable models well!
Healthcare

27
• Distilled learning doesn’t just apply to feed-forward neural nets: it’s also useful for sequence learning.
• Transfer knowledge from teacher to student network. Multiple ways to do it!
• Model compression improves speed by order of magnitude while only sacrificing 0.2 BLEU (bilingual evaluation understudy)
Neural Machine Translation

28
Making machine learning moonshine at Remitly
• Neural networks
• Dark knowledge
• Distillation
Agenda

29
• Comparison (using small toy models/data ):
• Logistic regression
• Logistic regression with distilled labels
• Distilled knowledge improves our results
• Even for very shallow black-box model with very few
iterations.
Fraud Classification

30
Citing our sources
Bibliography
Storcheus, Dmitry, Afshin Rostamizadeh, and Sanjiv Kumar. "A Survey of Modern Questions and Challenges in Feature Extraction."
In Proceedings of The 1st International Workshop on “Feature Extraction: Modern Questions and Challenges”, NIPS, pp. 1-18.
2015.
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "" Why Should I Trust You?": Explaining the Predictions of Any
Classifier." arXiv preprint arXiv:1602.04938 (2016).
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "Model-Agnostic Interpretability of Machine Learning." arXiv preprint
arXiv:1606.05386 (2016).
Freitas, Alex A. Comprehensible classification models: A position paper. SIGKDD Explor. Newsl., 15(1):1–10, March 2014.
ISSN 1931-0145.
G. Hinton, O. Vinyals, and J. Dean. Dark knowledge. Presented as the keynote in BayLearn, 2014.
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint arXiv:1503.02531 (2015).
Venkatesan, Ragav, and Baoxin Li. "Diving deeper into mentee networks."arXiv preprint arXiv:1604.08220 (2016).
Wolpert, David H., and William G. Macready. "No free lunch theorems for optimization." IEEE transactions on evolutionary
computation 1, no. 1 (1997): 67-82.

31
Citing our sources
Bibliography
Cybenko, George. "Approximation by superpositions of a sigmoidal function."Mathematics of control, signals and systems 2, no. 4
(1989): 303-314.
Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. Learning internal representations by error propagation. No. ICS-
8506. CALIFORNIA UNIV SAN DIEGO LA JOLLA INST FOR COGNITIVE SCIENCE, 1985.
Sadowski, Peter, Julian Collado, Daniel Whiteson, and Pierre Baldi. "Deep Learning, Dark Knowledge, and Dark Matter." In NIPS
2014 Workshop on High-energy Physics and Machine Learning, pp. 81-87. 2014.
Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and
K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2654–2662. Curran Associates, Inc.,
2014.
C. Bucilu, R. Caruana, and A. Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 535–541. ACM, 2006.
Che, Zhengping, Sanjay Purushotham, Robinder Khemani, and Yan Liu. "Distilling Knowledge from Deep Networks with Applications
to Healthcare Domain." arXiv preprint arXiv:1512.03542 (2015).
Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. "Multilayer feedforward networks are universal approximators." Neural networks 2, no.
5 (1989): 359-366.

32
Citing our sources
Bibliography
Venkatesan, Ragav, and Baoxin Li. "Diving deeper into mentee networks."arXiv preprint arXiv:1604.08220 (2016).
Tang, Zhiyuan, Dong Wang, and Zhiyong Zhang. "Recurrent neural network training with dark knowledge transfer." In 2016 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5900-5904. IEEE, 2016.
Romano, Nathanael, and Robin Schucker. "Distilling Knowledge to Specialist Networks for Clustered Classification.”
Papamakarios, George. "Distilling Model Knowledge." arXiv preprint arXiv:1510.02437 (2015).
Chen, Tianqi, Ian Goodfellow, and Jonathon Shlens. "Net2net: Accelerating learning via knowledge transfer." arXiv preprint
arXiv:1511.05641 (2015).
Kim, Yoon, and Alexander M. Rush. "Sequence-Level Knowledge Distillation." arXiv preprint arXiv:1606.07947 (2016).

33
What we talked about
• Feature extraction methods
• Neural networks
• Dark knowledge
• Distillation
Summary

34
Remitly’s Data Science team uses ML for a variety of purposes.
ML applications are core to our business – therefore our business must be core to our ML applications.
Machine learning at Remitly

www.remitly.com/careers
We’re hiring!
alex@remitly.com

Distilling dark knowledge from neural networks

Recommended

Recommended

More Related Content

Similar to Distilling dark knowledge from neural networks

Similar to Distilling dark knowledge from neural networks (20)

Recently uploaded

Recently uploaded (20)

Distilling dark knowledge from neural networks

Editor's Notes