SlideShare a Scribd company logo
1 of 35
23 July 2016
Distilling dark knowledge from neural
networksAlex Korbonits, Data Scientist
2
Join our team!
About Remitly and Me
3
Introduction
• Feature extraction and classifier training
• Industrial settings requiring interpretability
• Model-Agnostic Interpretability of Machine Learning
• Neural networks
• Dark knowledge
• Distillation
• Applying distillation in the real world
• Applying distillation to fraud detection
Agenda
4
"A Survey of Modern Questions and Challenges in Feature Extraction"
• Two categories of learning algorithms
reviewed:
• Supervised
• Unsupervised
• Two categories of feature extraction
reviewed:
• Coupled
• Uncoupled
Feature extraction and learners
5
"A Survey of Modern Questions and Challenges in Feature Extraction"
• Unsupervised Uncoupled feature extraction:
• PCA, IsoMap, Maximum Variance Unfolding
• Supervised Uncoupled feature extraction (i.e., feature
selection based on correlation with a label):
• MTFS (Argyriou et al., 2008)
• Supervised Coupled feature extraction:
• Neural Network (particularly with > 1 hidden layer)
• NO such thing as “unsupervised coupled” since the feature
extraction is not coupled to training a classifier.
Examples of corresponding algorithms
6
"A Survey of Modern Questions and Challenges in Feature Extraction"
• Supervised Coupled methods tend to significantly
outperform others (but not always!) (Gonen, 2014)
• Pros: better feature extraction
• Cons: hard to interpret, complex, scalability is evolving
• No Free Lunch Theorem (Wolpert, 1997)
• Deep learning not a silver bullet!
• “We have dubbed the associated results NFL theorems
because they demonstrate that if an algorithm performs
well on a certain class of problems then it necessarily
pays for that with degraded performance on the set of all
remaining problems”
Takeaways
7
Moving right along
• Feature extraction and classifier training
• Industrial settings requiring interpretability
• Model-Agnostic Interpretability of Machine Learning
• Neural networks
• Dark knowledge
• Distillation
• Applying distillation in the real world
• Applying distillation to fraud detection
Agenda
8
Banking and Lending
• Credit card issuers required to give reasons for denial of
credit
• Anti-discriminatory regulations
• Consumer protection regulations
• Credit card issuers sacrifice predictive power to comply
• THIS IS A GOOD THING
• Restricts model complexity to interpretable model
classes
• E.g., logistic regression, single decision tree, etc.
Credit Card Applications
9
Banking and Lending
• Decisions where interpretability is essential:
• Whether or not to obtain a biopsy
• Whether or not to surgically operate
• Whether or not to try an experimental new drug
• Interpretability isn’t just good for decisions:
• Good for auditing prior decisions
• Good for building intuition and expertise
Medicine and healthcare
10
Banking and Lending
• Interpretability is a business imperative
• Helps identify who/what/where/why/when/how
• Suggests paths to change business
products/processes/services to reduce churn
Customer Churn Prediction
11
Are we there yet?
• Feature extraction and classifier training
• Industrial settings requiring interpretability
• Model-Agnostic Interpretability of Machine Learning
• Neural networks
• Dark knowledge
• Distillation
• Applying distillation in the real world
• Applying distillation to fraud detection
Agenda
12
Ribeiro et. al., 2016
• Separating prediction from interpretation:
• Use any black-box model you want for
prediction
• Use an interpretable model to explain black-
box predictions
• “Our explanations empower users in various
scenarios that require trust: deciding if one
should trust a prediction, choosing between
models, improving an untrustworthy classifier,
and detecting why a classifier should not be
trusted.”
Model-Agnostic Interpretability of Machine Learning
13
"Why Should I Trust You?": Explaining the Predictions of Any Classifier
• Local Interpretable Model-Agnostic Explanations
• Basic algorithm intuition:
• Train black-box model, get test set predictions
• Train interpretable model on those predictions
• LIME explains (locally) which features contributed
most to the given prediction.
• Important properties of any explanatory model:
• Interpretability
• Local fidelity
• Model-agnostic
LIME
14
Model-Agnostic Interpretability of Machine Learning
• CAVEAT (couldn’t have said it better myself):
• “In some domains, exact explanations may be
required (e.g. for legal or ethical reasons), and
using a black-box maybe unacceptable (or
even illegal). Interpretable models may also be
more desirable when interpretability is much
more important than accuracy, or when
interpretable models trained on a small number
of carefully engineered features are as
accurate as black-box models.”
• E.g., if you DON’T have “big data” or desire to
make other tradeoffs, LIME isn’t what you want.
CAVEATS to agnosticism
15
ERMAGHERD DEEP LEARNING
• Feature extraction and classifier training
• Industrial settings requiring interpretability
• Model-Agnostic Interpretability of Machine Learning
• Neural networks
• Dark knowledge
• Distillation
• Applying distillation in the real world
• Applying distillation to fraud detection
Agenda
16
Rosenblatt, 1957
• Rosenblatt, 1957, Cornell
Aeronautics Laboratory, funded by
the Office of Naval Research
• Linear classifier. Designed for
image recognition.
• Inputs x and weights w linearly
combined to achieve some sort of
output y.
• Can’t solve XOR (counterexample
to everything).
Perceptron
17
Cybenko, 1989
• With one hidden layer, a multilayer perceptron – which
can now figure out XOR – is capable of arbitrary
function approximation. (Cybenko, 1989)
• Riesz Representation theorem. Math nerds unite!
• Supervised, semi-supervised, unsupervised, and
reinforcement learning applications.
• Flexible architectural components – layer types,
connection types, regularization techniques – allow for
empirical tinkering. Think of playing with Lego®.
Enter the multilayer perceptron
18
Rumelhart et. al., 1985
Backpropagation
19
Sounds like Defense against the Dark Arts, amiright?
• Feature extraction and classifier training
• Industrial settings requiring interpretability
• Model-Agnostic Interpretability of Machine Learning
• Neural networks
• Dark knowledge
• Distillation
• Applying distillation in the real world
• Applying distillation to fraud detection
Agenda
20
Burning down the house!
• A trained classifier is simply a labelling function.
• More mathematically, it’s just a mapping of vectors to
vectors, inputs and outputs.
• We can just take those outputs as inputs to another
function!
• Typically, the output layer of a neural network is represented
by a “softmax layer” that computes a probability q_i for each
class from its logit z_i.
• T here is the temperature.
• Note: using a higher value for T produces a softer probability
distribution over the classes.
Dark Knowledge
22
Machine learning moonshine?
• Feature extraction and classifier training
• Industrial settings requiring interpretability
• Model-Agnostic Interpretability of Machine Learning
• Neural networks
• Dark knowledge
• Distillation
• Applying distillation in the real world
• Applying distillation to fraud detection
Agenda
23
Machine learning moonshine?
• Distilled learning is model compression.
• There are many different procedures for distillation.
• E.g., of the family:
• The simplest way to transfer this knowledge:
• Use the cumbersome model’s output predictions
as the ground truth labels for the distilled model.
Distilled Learning
Jabir B Hayyan described distillation using an alembic in the 8th century.
24
It’s not magic, it’s just math
• Feature extraction and classifier training
• Industrial settings requiring interpretability
• Model-Agnostic Interpretability of Machine Learning
• Neural networks
• Dark knowledge
• Distillation
• Applying distillation in the real world
• Applying distillation to fraud detection
Agenda
25
MY GPU’s ARE MELTING…
• Researchers used deep learning to improve the
statistical power on benchmark problems involving:
• Higgs bosons
• Higgs boson decay modes
• Supersymmetric particles
• Results of distilled learning:
• Improved shallow classifiers on all three tasks
High Energy Physics
26
… FASTER THAN THE WICKED WITCH OF THE WEST
• Researchers used deep learning to get rich feature
representations.
• The purpose of these interpretable models were for phenotype
discovery.
• They extracted dark knowledge with a number of different neural
network architectures.
• Feedforward
• Stacked de-noising autoencoder
• LSTM (long short term memory)
• They then distilled dark knowledge into interpretable models.
• It improved the interpretable models well!
Healthcare
27
• Distilled learning doesn’t just apply to feed-forward neural nets: it’s also useful for sequence learning.
• Transfer knowledge from teacher to student network. Multiple ways to do it!
• Model compression improves speed by order of magnitude while only sacrificing 0.2 BLEU (bilingual evaluation understudy)
Neural Machine Translation
28
Making machine learning moonshine at Remitly
• Feature extraction and classifier training
• Industrial settings requiring interpretability
• Model-Agnostic Interpretability of Machine Learning
• Neural networks
• Dark knowledge
• Distillation
• Applying distillation in the real world
• Applying distillation to fraud detection
Agenda
29
Machine learning moonshine?
• Comparison (using small toy models/data ):
• Logistic regression
• Logistic regression with distilled labels
• Distilled knowledge improves our results
• Even for very shallow black-box model with very few
iterations.
Fraud Classification
30
Citing our sources
Bibliography
Storcheus, Dmitry, Afshin Rostamizadeh, and Sanjiv Kumar. "A Survey of Modern Questions and Challenges in Feature Extraction."
In Proceedings of The 1st International Workshop on “Feature Extraction: Modern Questions and Challenges”, NIPS, pp. 1-18.
2015.
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "" Why Should I Trust You?": Explaining the Predictions of Any
Classifier." arXiv preprint arXiv:1602.04938 (2016).
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "Model-Agnostic Interpretability of Machine Learning." arXiv preprint
arXiv:1606.05386 (2016).
Freitas, Alex A. Comprehensible classification models: A position paper. SIGKDD Explor. Newsl., 15(1):1–10, March 2014.
ISSN 1931-0145.
G. Hinton, O. Vinyals, and J. Dean. Dark knowledge. Presented as the keynote in BayLearn, 2014.
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint arXiv:1503.02531 (2015).
Venkatesan, Ragav, and Baoxin Li. "Diving deeper into mentee networks."arXiv preprint arXiv:1604.08220 (2016).
Wolpert, David H., and William G. Macready. "No free lunch theorems for optimization." IEEE transactions on evolutionary
computation 1, no. 1 (1997): 67-82.
31
Citing our sources
Bibliography
Cybenko, George. "Approximation by superpositions of a sigmoidal function."Mathematics of control, signals and systems 2, no. 4
(1989): 303-314.
Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. Learning internal representations by error propagation. No. ICS-
8506. CALIFORNIA UNIV SAN DIEGO LA JOLLA INST FOR COGNITIVE SCIENCE, 1985.
Sadowski, Peter, Julian Collado, Daniel Whiteson, and Pierre Baldi. "Deep Learning, Dark Knowledge, and Dark Matter." In NIPS
2014 Workshop on High-energy Physics and Machine Learning, pp. 81-87. 2014.
Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and
K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2654–2662. Curran Associates, Inc.,
2014.
C. Bucilu, R. Caruana, and A. Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 535–541. ACM, 2006.
Che, Zhengping, Sanjay Purushotham, Robinder Khemani, and Yan Liu. "Distilling Knowledge from Deep Networks with Applications
to Healthcare Domain." arXiv preprint arXiv:1512.03542 (2015).
Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. "Multilayer feedforward networks are universal approximators." Neural networks 2, no.
5 (1989): 359-366.
32
Citing our sources
Bibliography
Venkatesan, Ragav, and Baoxin Li. "Diving deeper into mentee networks."arXiv preprint arXiv:1604.08220 (2016).
Tang, Zhiyuan, Dong Wang, and Zhiyong Zhang. "Recurrent neural network training with dark knowledge transfer." In 2016 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5900-5904. IEEE, 2016.
Romano, Nathanael, and Robin Schucker. "Distilling Knowledge to Specialist Networks for Clustered Classification.”
Papamakarios, George. "Distilling Model Knowledge." arXiv preprint arXiv:1510.02437 (2015).
Chen, Tianqi, Ian Goodfellow, and Jonathon Shlens. "Net2net: Accelerating learning via knowledge transfer." arXiv preprint
arXiv:1511.05641 (2015).
Kim, Yoon, and Alexander M. Rush. "Sequence-Level Knowledge Distillation." arXiv preprint arXiv:1606.07947 (2016).
33
What we talked about
• Feature extraction methods
• Industrial settings requiring interpretability
• Model-Agnostic Interpretability of Machine Learning
• Neural networks
• Dark knowledge
• Distillation
• Applying distillation in the real world
• Applying distillation to fraud detection
Summary
34
Remitly’s Data Science team uses ML for a variety of purposes.
ML applications are core to our business – therefore our business must be core to our ML applications.
Machine learning at Remitly
www.remitly.com/careers
We’re hiring!
alex@remitly.com

More Related Content

Similar to Distilling dark knowledge from neural networks

Deep learning summary
Deep learning summaryDeep learning summary
Deep learning summaryankit_ppt
 
Deep learning with keras
Deep learning with kerasDeep learning with keras
Deep learning with kerasMOHITKUMAR1379
 
Supervised Learning
Supervised LearningSupervised Learning
Supervised LearningFEG
 
Machine learning
Machine learningMachine learning
Machine learninghplap
 
Activity Monitoring Using Wearable Sensors and Smart Phone
Activity Monitoring Using Wearable Sensors and Smart PhoneActivity Monitoring Using Wearable Sensors and Smart Phone
Activity Monitoring Using Wearable Sensors and Smart PhoneDrAhmedZoha
 
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...Egyptian Engineers Association
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needGibDevs
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedOmid Vahdaty
 
Deep Credit Risk Ranking with LSTM with Kyle Grove
Deep Credit Risk Ranking with LSTM with Kyle GroveDeep Credit Risk Ranking with LSTM with Kyle Grove
Deep Credit Risk Ranking with LSTM with Kyle GroveDatabricks
 
Deep Learning Made Easy with Deep Features
Deep Learning Made Easy with Deep FeaturesDeep Learning Made Easy with Deep Features
Deep Learning Made Easy with Deep FeaturesTuri, Inc.
 
10 Things I Wish I Dad Known Before Scaling Deep Learning Solutions
10 Things I Wish I Dad Known Before Scaling Deep Learning Solutions10 Things I Wish I Dad Known Before Scaling Deep Learning Solutions
10 Things I Wish I Dad Known Before Scaling Deep Learning SolutionsJesus Rodriguez
 
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013Neo4j
 
Introduction to Expert Systems {Artificial Intelligence}
Introduction to Expert Systems {Artificial Intelligence}Introduction to Expert Systems {Artificial Intelligence}
Introduction to Expert Systems {Artificial Intelligence}FellowBuddy.com
 
Unit one ppt of deeep learning which includes Ann cnn
Unit one ppt of  deeep learning which includes Ann cnnUnit one ppt of  deeep learning which includes Ann cnn
Unit one ppt of deeep learning which includes Ann cnnkartikaursang53
 

Similar to Distilling dark knowledge from neural networks (20)

Deep learning summary
Deep learning summaryDeep learning summary
Deep learning summary
 
Deep learning with keras
Deep learning with kerasDeep learning with keras
Deep learning with keras
 
Deep learning internals
Deep learning internalsDeep learning internals
Deep learning internals
 
ExplainableAI.pptx
ExplainableAI.pptxExplainableAI.pptx
ExplainableAI.pptx
 
Supervised Learning
Supervised LearningSupervised Learning
Supervised Learning
 
Fuzzy expert system
Fuzzy expert systemFuzzy expert system
Fuzzy expert system
 
Machine learning
Machine learningMachine learning
Machine learning
 
Activity Monitoring Using Wearable Sensors and Smart Phone
Activity Monitoring Using Wearable Sensors and Smart PhoneActivity Monitoring Using Wearable Sensors and Smart Phone
Activity Monitoring Using Wearable Sensors and Smart Phone
 
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your need
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data Demystified
 
Deep Credit Risk Ranking with LSTM with Kyle Grove
Deep Credit Risk Ranking with LSTM with Kyle GroveDeep Credit Risk Ranking with LSTM with Kyle Grove
Deep Credit Risk Ranking with LSTM with Kyle Grove
 
Lecture_8.ppt
Lecture_8.pptLecture_8.ppt
Lecture_8.ppt
 
Sh ch01
Sh ch01Sh ch01
Sh ch01
 
Computer Design Concepts for Machine Learning
Computer Design Concepts for Machine LearningComputer Design Concepts for Machine Learning
Computer Design Concepts for Machine Learning
 
Deep Learning Made Easy with Deep Features
Deep Learning Made Easy with Deep FeaturesDeep Learning Made Easy with Deep Features
Deep Learning Made Easy with Deep Features
 
10 Things I Wish I Dad Known Before Scaling Deep Learning Solutions
10 Things I Wish I Dad Known Before Scaling Deep Learning Solutions10 Things I Wish I Dad Known Before Scaling Deep Learning Solutions
10 Things I Wish I Dad Known Before Scaling Deep Learning Solutions
 
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
 
Introduction to Expert Systems {Artificial Intelligence}
Introduction to Expert Systems {Artificial Intelligence}Introduction to Expert Systems {Artificial Intelligence}
Introduction to Expert Systems {Artificial Intelligence}
 
Unit one ppt of deeep learning which includes Ann cnn
Unit one ppt of  deeep learning which includes Ann cnnUnit one ppt of  deeep learning which includes Ann cnn
Unit one ppt of deeep learning which includes Ann cnn
 

Recently uploaded

Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...Monika Rani
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfrohankumarsinghrore1
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...Scintica Instrumentation
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptxryanrooker
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceAlex Henderson
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxrohankumarsinghrore1
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxDiariAli
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Silpa
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspectsmuralinath2
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIADr. TATHAGAT KHOBRAGADE
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfSumit Kumar yadav
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.Silpa
 

Recently uploaded (20)

Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptx
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 

Distilling dark knowledge from neural networks

  • 1. 23 July 2016 Distilling dark knowledge from neural networksAlex Korbonits, Data Scientist
  • 2. 2 Join our team! About Remitly and Me
  • 3. 3 Introduction • Feature extraction and classifier training • Industrial settings requiring interpretability • Model-Agnostic Interpretability of Machine Learning • Neural networks • Dark knowledge • Distillation • Applying distillation in the real world • Applying distillation to fraud detection Agenda
  • 4. 4 "A Survey of Modern Questions and Challenges in Feature Extraction" • Two categories of learning algorithms reviewed: • Supervised • Unsupervised • Two categories of feature extraction reviewed: • Coupled • Uncoupled Feature extraction and learners
  • 5. 5 "A Survey of Modern Questions and Challenges in Feature Extraction" • Unsupervised Uncoupled feature extraction: • PCA, IsoMap, Maximum Variance Unfolding • Supervised Uncoupled feature extraction (i.e., feature selection based on correlation with a label): • MTFS (Argyriou et al., 2008) • Supervised Coupled feature extraction: • Neural Network (particularly with > 1 hidden layer) • NO such thing as “unsupervised coupled” since the feature extraction is not coupled to training a classifier. Examples of corresponding algorithms
  • 6. 6 "A Survey of Modern Questions and Challenges in Feature Extraction" • Supervised Coupled methods tend to significantly outperform others (but not always!) (Gonen, 2014) • Pros: better feature extraction • Cons: hard to interpret, complex, scalability is evolving • No Free Lunch Theorem (Wolpert, 1997) • Deep learning not a silver bullet! • “We have dubbed the associated results NFL theorems because they demonstrate that if an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems” Takeaways
  • 7. 7 Moving right along • Feature extraction and classifier training • Industrial settings requiring interpretability • Model-Agnostic Interpretability of Machine Learning • Neural networks • Dark knowledge • Distillation • Applying distillation in the real world • Applying distillation to fraud detection Agenda
  • 8. 8 Banking and Lending • Credit card issuers required to give reasons for denial of credit • Anti-discriminatory regulations • Consumer protection regulations • Credit card issuers sacrifice predictive power to comply • THIS IS A GOOD THING • Restricts model complexity to interpretable model classes • E.g., logistic regression, single decision tree, etc. Credit Card Applications
  • 9. 9 Banking and Lending • Decisions where interpretability is essential: • Whether or not to obtain a biopsy • Whether or not to surgically operate • Whether or not to try an experimental new drug • Interpretability isn’t just good for decisions: • Good for auditing prior decisions • Good for building intuition and expertise Medicine and healthcare
  • 10. 10 Banking and Lending • Interpretability is a business imperative • Helps identify who/what/where/why/when/how • Suggests paths to change business products/processes/services to reduce churn Customer Churn Prediction
  • 11. 11 Are we there yet? • Feature extraction and classifier training • Industrial settings requiring interpretability • Model-Agnostic Interpretability of Machine Learning • Neural networks • Dark knowledge • Distillation • Applying distillation in the real world • Applying distillation to fraud detection Agenda
  • 12. 12 Ribeiro et. al., 2016 • Separating prediction from interpretation: • Use any black-box model you want for prediction • Use an interpretable model to explain black- box predictions • “Our explanations empower users in various scenarios that require trust: deciding if one should trust a prediction, choosing between models, improving an untrustworthy classifier, and detecting why a classifier should not be trusted.” Model-Agnostic Interpretability of Machine Learning
  • 13. 13 "Why Should I Trust You?": Explaining the Predictions of Any Classifier • Local Interpretable Model-Agnostic Explanations • Basic algorithm intuition: • Train black-box model, get test set predictions • Train interpretable model on those predictions • LIME explains (locally) which features contributed most to the given prediction. • Important properties of any explanatory model: • Interpretability • Local fidelity • Model-agnostic LIME
  • 14. 14 Model-Agnostic Interpretability of Machine Learning • CAVEAT (couldn’t have said it better myself): • “In some domains, exact explanations may be required (e.g. for legal or ethical reasons), and using a black-box maybe unacceptable (or even illegal). Interpretable models may also be more desirable when interpretability is much more important than accuracy, or when interpretable models trained on a small number of carefully engineered features are as accurate as black-box models.” • E.g., if you DON’T have “big data” or desire to make other tradeoffs, LIME isn’t what you want. CAVEATS to agnosticism
  • 15. 15 ERMAGHERD DEEP LEARNING • Feature extraction and classifier training • Industrial settings requiring interpretability • Model-Agnostic Interpretability of Machine Learning • Neural networks • Dark knowledge • Distillation • Applying distillation in the real world • Applying distillation to fraud detection Agenda
  • 16. 16 Rosenblatt, 1957 • Rosenblatt, 1957, Cornell Aeronautics Laboratory, funded by the Office of Naval Research • Linear classifier. Designed for image recognition. • Inputs x and weights w linearly combined to achieve some sort of output y. • Can’t solve XOR (counterexample to everything). Perceptron
  • 17. 17 Cybenko, 1989 • With one hidden layer, a multilayer perceptron – which can now figure out XOR – is capable of arbitrary function approximation. (Cybenko, 1989) • Riesz Representation theorem. Math nerds unite! • Supervised, semi-supervised, unsupervised, and reinforcement learning applications. • Flexible architectural components – layer types, connection types, regularization techniques – allow for empirical tinkering. Think of playing with Lego®. Enter the multilayer perceptron
  • 18. 18 Rumelhart et. al., 1985 Backpropagation
  • 19. 19 Sounds like Defense against the Dark Arts, amiright? • Feature extraction and classifier training • Industrial settings requiring interpretability • Model-Agnostic Interpretability of Machine Learning • Neural networks • Dark knowledge • Distillation • Applying distillation in the real world • Applying distillation to fraud detection Agenda
  • 20. 20 Burning down the house! • A trained classifier is simply a labelling function. • More mathematically, it’s just a mapping of vectors to vectors, inputs and outputs. • We can just take those outputs as inputs to another function! • Typically, the output layer of a neural network is represented by a “softmax layer” that computes a probability q_i for each class from its logit z_i. • T here is the temperature. • Note: using a higher value for T produces a softer probability distribution over the classes. Dark Knowledge
  • 21.
  • 22. 22 Machine learning moonshine? • Feature extraction and classifier training • Industrial settings requiring interpretability • Model-Agnostic Interpretability of Machine Learning • Neural networks • Dark knowledge • Distillation • Applying distillation in the real world • Applying distillation to fraud detection Agenda
  • 23. 23 Machine learning moonshine? • Distilled learning is model compression. • There are many different procedures for distillation. • E.g., of the family: • The simplest way to transfer this knowledge: • Use the cumbersome model’s output predictions as the ground truth labels for the distilled model. Distilled Learning Jabir B Hayyan described distillation using an alembic in the 8th century.
  • 24. 24 It’s not magic, it’s just math • Feature extraction and classifier training • Industrial settings requiring interpretability • Model-Agnostic Interpretability of Machine Learning • Neural networks • Dark knowledge • Distillation • Applying distillation in the real world • Applying distillation to fraud detection Agenda
  • 25. 25 MY GPU’s ARE MELTING… • Researchers used deep learning to improve the statistical power on benchmark problems involving: • Higgs bosons • Higgs boson decay modes • Supersymmetric particles • Results of distilled learning: • Improved shallow classifiers on all three tasks High Energy Physics
  • 26. 26 … FASTER THAN THE WICKED WITCH OF THE WEST • Researchers used deep learning to get rich feature representations. • The purpose of these interpretable models were for phenotype discovery. • They extracted dark knowledge with a number of different neural network architectures. • Feedforward • Stacked de-noising autoencoder • LSTM (long short term memory) • They then distilled dark knowledge into interpretable models. • It improved the interpretable models well! Healthcare
  • 27. 27 • Distilled learning doesn’t just apply to feed-forward neural nets: it’s also useful for sequence learning. • Transfer knowledge from teacher to student network. Multiple ways to do it! • Model compression improves speed by order of magnitude while only sacrificing 0.2 BLEU (bilingual evaluation understudy) Neural Machine Translation
  • 28. 28 Making machine learning moonshine at Remitly • Feature extraction and classifier training • Industrial settings requiring interpretability • Model-Agnostic Interpretability of Machine Learning • Neural networks • Dark knowledge • Distillation • Applying distillation in the real world • Applying distillation to fraud detection Agenda
  • 29. 29 Machine learning moonshine? • Comparison (using small toy models/data ): • Logistic regression • Logistic regression with distilled labels • Distilled knowledge improves our results • Even for very shallow black-box model with very few iterations. Fraud Classification
  • 30. 30 Citing our sources Bibliography Storcheus, Dmitry, Afshin Rostamizadeh, and Sanjiv Kumar. "A Survey of Modern Questions and Challenges in Feature Extraction." In Proceedings of The 1st International Workshop on “Feature Extraction: Modern Questions and Challenges”, NIPS, pp. 1-18. 2015. Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "" Why Should I Trust You?": Explaining the Predictions of Any Classifier." arXiv preprint arXiv:1602.04938 (2016). Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "Model-Agnostic Interpretability of Machine Learning." arXiv preprint arXiv:1606.05386 (2016). Freitas, Alex A. Comprehensible classification models: A position paper. SIGKDD Explor. Newsl., 15(1):1–10, March 2014. ISSN 1931-0145. G. Hinton, O. Vinyals, and J. Dean. Dark knowledge. Presented as the keynote in BayLearn, 2014. Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint arXiv:1503.02531 (2015). Venkatesan, Ragav, and Baoxin Li. "Diving deeper into mentee networks."arXiv preprint arXiv:1604.08220 (2016). Wolpert, David H., and William G. Macready. "No free lunch theorems for optimization." IEEE transactions on evolutionary computation 1, no. 1 (1997): 67-82.
  • 31. 31 Citing our sources Bibliography Cybenko, George. "Approximation by superpositions of a sigmoidal function."Mathematics of control, signals and systems 2, no. 4 (1989): 303-314. Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. Learning internal representations by error propagation. No. ICS- 8506. CALIFORNIA UNIV SAN DIEGO LA JOLLA INST FOR COGNITIVE SCIENCE, 1985. Sadowski, Peter, Julian Collado, Daniel Whiteson, and Pierre Baldi. "Deep Learning, Dark Knowledge, and Dark Matter." In NIPS 2014 Workshop on High-energy Physics and Machine Learning, pp. 81-87. 2014. Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2654–2662. Curran Associates, Inc., 2014. C. Bucilu, R. Caruana, and A. Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541. ACM, 2006. Che, Zhengping, Sanjay Purushotham, Robinder Khemani, and Yan Liu. "Distilling Knowledge from Deep Networks with Applications to Healthcare Domain." arXiv preprint arXiv:1512.03542 (2015). Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. "Multilayer feedforward networks are universal approximators." Neural networks 2, no. 5 (1989): 359-366.
  • 32. 32 Citing our sources Bibliography Venkatesan, Ragav, and Baoxin Li. "Diving deeper into mentee networks."arXiv preprint arXiv:1604.08220 (2016). Tang, Zhiyuan, Dong Wang, and Zhiyong Zhang. "Recurrent neural network training with dark knowledge transfer." In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5900-5904. IEEE, 2016. Romano, Nathanael, and Robin Schucker. "Distilling Knowledge to Specialist Networks for Clustered Classification.” Papamakarios, George. "Distilling Model Knowledge." arXiv preprint arXiv:1510.02437 (2015). Chen, Tianqi, Ian Goodfellow, and Jonathon Shlens. "Net2net: Accelerating learning via knowledge transfer." arXiv preprint arXiv:1511.05641 (2015). Kim, Yoon, and Alexander M. Rush. "Sequence-Level Knowledge Distillation." arXiv preprint arXiv:1606.07947 (2016).
  • 33. 33 What we talked about • Feature extraction methods • Industrial settings requiring interpretability • Model-Agnostic Interpretability of Machine Learning • Neural networks • Dark knowledge • Distillation • Applying distillation in the real world • Applying distillation to fraud detection Summary
  • 34. 34 Remitly’s Data Science team uses ML for a variety of purposes. ML applications are core to our business – therefore our business must be core to our ML applications. Machine learning at Remitly

Editor's Notes

  1. Hi Everyone My name is Alex Korbonits, and I am a data scientist at Remitly This talk is broadly about taking advantage of the predictive power of black-box models to improve the accuracy of interpretable models.
  2. Before we dive in, here’s a little bit about Remitly and me. Remitly was founded in 2011 out to forever change the way people send money to their loved ones. Worldwide, remittances represent over 600 billion dollars annually, roughly 4x the amount of foreign aid. We’re now the largest independent digital remittance company in the U.S. We’re sending nearly 2 billion dollars annually and growing quickly Our CEO, Matt Oppenheimer, was just named one of Ernst and Young’s 2016 Entrepreneurs of the Year I'm Remitly's first data scientist, and our team is growing. Right now my principal focus is FRAUD CLASSIFICATION Previously, I was a data scientist at startup called Nuiku, focusing on NLP.
  3. Here’s a cliff notes version of the agenda. We’re going to motivate the use of black box models to improve interpretable ones by showing that they tend to have superior performance across a wide range of problems. I’ll digress slightly by suggesting that we don’t even need interpretable models. We can interpret predictions of any model by modeling predictions. Then I will review neural nets Talk about how to use their distributed feature representations to improve subsequently trained models I’ll go into some results and applications to industry Talk about using this on my own data for fraud classification. And throughout I will emphasize why this matters
  4. A great survey paper came out at NIPS last year, at the 1st International Workshop on "Feature Extraction", called, "A Survey of Modern Questions and Challenges in Feature Extraction". It's by a few Google researchers and I'm looking forward to seeing more of their work. What I really enjoyed about this paper was the conceptual framework it used to see feature extraction and learning as function composition (a feature extraction step followed by a learning step), and looking into methods where the feature extraction step was or was not coupled with the learning step. You should definitely read it!
  5. OKAY, so the conceptual framework is great, but what are some real life examples? A classic example of an unsupervised uncoupled method would be PCA. PCA minimizes variance, whereas IsoMap preserves distances along a fixed manifold. If you care about preserving angles, then you’ll want to use what’s called Maximum Variance Unfolding. It’s called unsupervised uncoupled because the loss function you’re minimizing for feature extraction is independent from training a classifier (uncoupled) and your loss function doesn’t know about labels (unsupervised). What’s an example of Supervised Uncoupled feature extraction? This is where the feature extraction has knowledge of the labels. One such example is called MTFS (which stands for multi-task feature selection). This algorithm picks features that correlate with labels, more or less. But the classifier is not jointly learned with the feature extraction. Supervised coupled methods simultaneously perform feature extraction and learning. Classic example here would be neural networks with at least one hidden layer (i.e. no perceptrons). I.e., you are jointly learning a feature representation as well as a labeling function. Last, there's no such thing as "unsupervised coupled" since you're not jointly learning features and a classifier together yet without supervision.
  6. Another reason this paper was fantastic is that it discussed in detail how so-called "supervised coupled" methods tend to significantly outperform other methods. The exploration of this conceptual framework for thinking about the composition of feature extraction functions and classifier training is just a lead-in to what we're really going to talk about today. Using the superior performance of supervised-coupled methods on many learning tasks, we are going to improve the performance of simpler models. (caveat: No Free Lunch Theorem still holds here!)
  7. On to talking about industrial settings requiring interpretability!
  8. One classic example of an industrial machine learning application that requires the use of interpretable models for the purpose of consumer protection and antidiscrimination is in credit card applications. Put simply, the interpretability requirements here are such that if a prospective credit card holder -- an applicant -- applies for a credit card and is subsequently denied, then the credit card company is obliged to provide a set of reasons why the applicant was denied credit. This restriction puts an onus on credit card companies to use models that can output, for any given prediction, the features and/or splits of said features that led to the prediction. E.g., this pretty much restricts us to logistic regression or single decision tree learners. Remember, sacrificing predictive power here IS A GOOD THING. We need to protect consumers and adhere to non-discriminatory practices. This talk is about keeping interpretability intact and increasing predictive power.
  9. Another classic example of an industrial machine learning application that needs interpretable models to help make sound decisions – as one part of the overall decision process – is in medicine. Whether it’s decisions relating to whether or not to take a biopsy, try a surgical procedure, or use a particular type of medicine, these decisions are extremely important and making mistakes is typically very costly (even making the right decision is costly!). Remember, sacrificing predictive power here IS A GOOD THING. We need to protect patients and make sound ethical and scientific and medical decisions. Interpretability also helps go through previous decisions to see what worked/what didn’t as well as build intuition and expertise around a particular problem area.
  10. As a last motivating example, while customer churn prediction may have fewer (if any) regulatory/ethical requirements and standards compared to consumer lending and medicine, interpretability is KEY because it aids in directly data-driving business decisions. Hashtag actionable insights? Again, sacrificing predictive power here for interpretable results gives businesses additional decision-making power while offering direct insights into explaining customer behavior AND in which directions to innovate the business. Which direction to innovate the business in… sounds kind of like gradient descent, doesn’t it???
  11. Speaking of gradient descent, we’re going to do a whole lot more of that as we get through these slides… In this section, we make a case for model-agnostic interpretability, as opposed to just using interpretable models
  12. Before we go on to the meat of the talk, I want to introduce a fairly promising idea that has been wonderfully espoused in a couple of papers this year. Three researchers from the UW wrote two papers this spring that you should definitely read. Their position paper titled “model-agnostic interpretability of machine learning” was part of the 2016 ICML Workshop on Human Interpretability in Machine Learning. Their second paper introduces a specific framework within which to interpret predictions, called LIME, which I’ll go into on the next slide. The main takeaway is that you should consider separating the model selection process for prediction from the model selection process for interpretation. Use ANY black box model you want to learn the best labelling function possible from your data. Then use an interpretable model to help explain the predictions that your black-box classifier makes. This is a pretty powerful idea!
  13. Let’s talk about the main contribution of these two papers, called LIME, or, Local Interpretable Model-Agnostic Explanations. Here’s a quick run down of how to explain the predictions of any classifier (or regressor). First, train your black-box model on a data set, and make predictions on the test set. Then, USE LIME to train an explanatory model on top of the predictions of your black-box model. There’s even a pip installable Pythong package for LIME. PLUS, I’m using it at work to aid in fraud classification. It’s great for debugging type I and type II errors, and even for suggesting further feature engineering. What are some important properties of LIME-like models? First, interpretability. Second, we want these models to have local fidelity. The predictions that the explanatory model makes should be as faithful as possible to the predictions that the black-box model makes. A stricter standard would be global fidelity, but right now it’s an open challenge to give interpretable explanations with this property. Next, your explanatory model should be model-agnostic. Model selection for the explanatory model should not depend on the original model making predictions. So what’s this math on the slide? F is the model you’re explaining. g is your explanation model PI sub X is a proximity measure from an instance Z to a point X, to define locality. L of f, g, and PI sub X is a measure of how unfaithful G is in approximating F at X, or rather, in the area around X defined by Pi sub X. OMEGA of G is is a measure of complexity of your explainer G. The idea here is to minimize, for all possible explanatory models G in a set of class of potentially interpretable models capital G, the unfaithfulness of G + the complexity of G. This formulation can be used with different explanation families G, fidelity functions L, and complexity measures Ω.
  14. This sounds all well and good in theory, but how does it work in practice? That depends on your domain. If your domain has legal or ethical restrictions that prevent you from diving deep into the most complex black-box models, LIME may not be sufficient for explaining predictions. Even if you don’t have legal or ethical restrictions, you still may have tradeoffs that are so important that you care more about exact interpretability rather than the best predictive power. When I was first preparing this talk, I hadn’t come across LIME or the idea of model-agnostic interpretation. I think it’s pretty cool. Given that the point of this talk was to use black-box models to improve the accuracy of interpretable models, I thought it would provide some excellent (yet not too orthogonal) counterpoint to discuss separating the interpretation process from the model selection process. Now we’re going to go into the heart of the talk where we demonstrate how to improve interpretable models WITH black box predictions!
  15. Back to gradient descent… now for a total crash course in neural networks...
  16. Here’s a little bit of SELECTED history of neural networks. In 1957 Rosenblatt created the perceptron learning algorithm. It was used for image recognition – and at the time it was state-of-the-art, much like convolutional neural networks are today for many computer vision tasks. It’s basically LOGISTIC REGRESSION. Problem is that, being linear, it can’t solve XOR. This is a problem. For that reason, among others such as the instability of the learning mechanism, Minsky and Papert in their 1969book Perceptrons put Neural Networks to rest for one of the longest AI winters we’ve ever seen… brrrrrr, is it freezing in here or is it just me?
  17. Greater than or equal to 1 hidden layer in a multi-layer perceptron gives us INFINITE POWER. And the ability to solve XOR. Actually the Riesz representation theorem helps us show that even a multi-layer perceptron with a single hidden layer is capable of approximating any function arbitrarily closely. Depth gives us an exponential advantage w.r.t. this problem, which means we don’t need as many neurons per layer, essentially. That’s nice for making modeling tractable. How do we train such a model?
  18. BACKPROP Who remembers their first quarter of calculus? All we’re going to do is take a derivative. This diagram is a representation of the chain rule. A simple learning algorithm that takes some total output error E defined by some loss function. For example, a typical loss function for a multi-class classification task is log loss. E is a function of all of its inputs. I.e., all of the incoming connections to the output unit of a neural network. I.e., a function that outputs a class membership prediction and whose prediction is checked against a ground truth/label. We then show: A simple derivation of the change in error as a function of each connection weight w_ij. This gives a formula for updating each w_ij, in the entire network.
  19. And that’s a crash course in Neural Networks. We don’t even have time to go into all of the recent advances in neural networks, needless to say, I think we all know they’re taking the machine learning world by storm right now. How many do you think are running on your phone? Probably more than you think. Let’s go into DARK KNOWLEDGE
  20. What’s Dark Knowledge? Sounds like something Voldemort would be good at. This term, I believe, is due to Geoff Hinton. Essentially, start with a neural network classifier with a softmax layer. Inherent in the softmax layer is a parameter T that we call temperature. WLOG, increasing the temperature here “softens” the probability distribution across the classes. As we can see on the left, we increase the temperature T as we move down the screen. We can see the relative probabilities change. Our model isn’t so certain about whether or not it’s looking at a dog. Note that the important thing here is the RELATIVE probabilities. Indeed, a DOG is much more like a CAT than a cow or a car. And much more like a COW than a CAR. As we soften the distribution we can see that the model knows quite a bit about the relationships between different concepts, whether those concepts (like classes) are semantically well-understood, or not (e.g., internal representations at different layers of a neural network, some of which COULD be considered interpretable, e.g., the famous CAT NEURON learned via unsupervised stills from YouTube videos back in 2011). So what’s dark knowledge useful for? It’s best to transfer it to OTHER MODELs. I’ll let David Byrne tell talk about what happens when you turn the temperature up…
  21. Watch out you might get what you're after Cool babies strange but not a stranger I'm an ordinary guy Burning down the house Hold tight wait till the party's over Hold tight We're in for nasty weather There has got to be a way Burning down the house
  22. Now that you know what dark knowledge is and how to obtain it, let’s talk about transferring it to other models. This process is called DISTILLATION or DISTILLED LEARNING. Distilling the knowledge, typically, from a large, cumbersome model to a smaller distilled one. Sometimes this relationship is known as teacher/student models or mentor/mentee models.
  23. So now we’ve got our dark knowledge from burning’ down the house. What do we do with it? How do we transfer it? While Hinton et. al.’s paper outlines a general framework, there’s a really simple way to do this. Here’s the gradient equation from their paper. This derivative is the cross-entropy gradient between your cumbersome model (probability p_i and logit v_i) and distilled model (probability q_i and logit z_i). Let me distill this derivative. In essence, the simplest way to do distilled learning: USE THE OUTPUT PREDICTIONS OF THE SOFTMAX OF YOUR LARGER BLACK-BOX MODEL AS THE LABELS/GROUND TRUTHS FOR YOUR SMALLER MODEL EARLIER IN TODAY’S TALKS, we saw Algorithmia putting deep learning into production. We also saw Stitchfix embedding word vectors with LDA2VEC. Ken from Algorithmia mentioned distilled learning when he talked about model compression. Using this technique, he showed results that using a much smaller network with a 50x reduction in the # of parameters, only a tiny amount of performance is sacrificed due to using a smaller network. This generally is the case because the distributed feature representations in a much larger network are often quite redundant across all of the parameters. I realize this whole model compression idea sounds like Pied Piper’s “Middle Out” algorithm, but I promise it’s real and it has NOTHING to do with the origin story of middle out. If you haven’t seen that episode of Silicon Valley, it’s worth watching.
  24. Does this even work? Let’s look at a couple of recent examples.
  25. One really neat application of distilled learning is in high energy physics. Researchers from UC Irvine used deep neural networks to take a look at a few problems involving: Higgs bosons Higgs boson decay modes And Supersymmetric particles. AWESOME. THIS IS SO FREAKING COOL. Alright, so the did this and saw fantastic performance. They also took a look at training on smaller networks, and saw predictably worse performance. They then used the ideas from Hinton’s paper to distill some of the dark knowledge from their larger model. They just took the outputs from the larger model as labels for the subsequent model. Indeed, they saw performance gains from this, and are actively looking into future applications. How cool is that?
  26. Another really neat application of distilled learning in practice is in healthcare. This paper inspired this talk, actually. What these researchers did was apply distilled learning to electronic healthcare records data and assess the relative performance of their models with and without the distillation process. What’s also cool is that they went into using different neural network architectures for looking at this process and compared/contrasted all of their results of transferring their underlying learned representations. They saw a significant bump in performance when using these methods.
  27. Sequence-Level Knowledge Distillation is a great paper that came out recently applying the idea of transferring dark knowledge via distillation to sequence learners. Specifically, the authors explored this for neural machine translation. Besides, it’s 2016, convolutional neural nets are so last year. I expect to see more applications of distilled learning to recurrent neural networks this year. The authors trained teacher networks and distilled into student networks, drastically improving the speed of translation via the smaller student networks, while hardly sacrificing any accuracy, which for machine translation tasks is known as BLEU, which I literally just learned about. They actually tried multiple methods of distillation. Word-level, sequence-level distillation, and sequence-level interpolation. Be sure to check out their paper for more details
  28. Wow, so that’s freaking AWESOME. Now I’m going to show you that we’re seeing some results as it applies to fraud classification at Remitly.
  29. So I got really excited by this and HAD to try it (caveat: on a small, toy dataset -- this doesn’t necessarily reflect reality) In fact, I obtained these results 10 minutes ago, so they’re hot off the press. Essentially, I took a note from the high energy physics folks and decided to simply use the predictions from a black-box model as the new label set for my logistic regression learner. In the future I intend to look at varying temperature and even experiment with the gradients between the two networks as in Hinton’s paper. For now, the results are clear. You should try it out!
  30. We motivated distilled learning by looking at the superior performance of supervised-coupled methods. We went over some industrial machine learning settings requiring interpretability. We digressed and questioned the need to make predictions with an interpretable model by showing you can interpret the predictions of any model with an explainer model We crashed through neural nets and dark knowledge, We talked about distilled learning and how to transfer dark knowledge into subsequent models We saw some freaking AWESOME applications in the real world We saw real results from trying this at Remitly. #represent LIME and distilled learning give us a whole new world in terms of improving interpretable models and interpreting the predictions of ANY model. Go forth and conquer!
  31. What does machine learning at Remitly look like? Understanding: Fraud classification Anomaly detection Customer behavior Market forces
  32. We're hiring! Email me at alex@remitly.com. That’s all, folks! THANKS