Machine Learning with Small Data
John C. Liu, Ph.D. CFA
June 18, 2019
Twitter: @drjohncliu
Disclaimer
THE INFORMATION SET FORTH HEREIN HAS BEEN OBTAINED OR DERIVED FROM SOURCES GENERALLY
AVAILABLE TO THE PUBLIC AND BELIEVED BY THE AUTHOR TO BE RELIABLE, BUT THE AUTHOR DOES NOT MAKE
ANY REPRESENTATION OR WARRANTY, EXPRESS OR IMPLIED, AS TO ITS ACCURACY OR COMPLETENESS. THE
INFORMATION IS FOR EDUCATIONAL PURPOSES ONLY AND IS NOT INTENDED TO BE USED AS THE BASIS OF ANY
BUSINESS OR INVESTMENT DECISIONS BY ANY PERSON OR ENTITY. ALL OF THE INFORMATION CONTAINED IN
THE PRESENTATION IS SUBJECT TO FURTHER MODIFICATION AND ANY AND ALL FORECASTS, PROJECTIONS OR
FORWARD-LOOKING STATEMENTS CONTAINED HEREIN SHALL NOT BE RELIED UPON AS FACTS NOR RELIED
UPON AS ANY REPRESENTATION OF FUTURE RESULTS WHICH MAY MATERIALLY VARY FROM SUCH
PROJECTIONS AND FORECASTS.
Roadmap
• Introduction
• Big Data Revolution
• What about Small Data?
• Dealing with Reality
– Semantic/Contextualized Representations
– Experimental Design
– Adversarial Data Generation
• Conclusion
Big Data
Source: Bernard Marr & Co.
Deep Learning
Source: NVIDIA
Data is the New Oil
Source: James Corbett
More Data = Better Models
Source: Andrew Ng
What’s Wrong With this
Picture
Train Set?
Source: The Simpsons
Data Annotation is Expensive
Source: Jia, Yangqing. 2014). Learning Semantic Image Representations at a Large Scale.
Annotator (Dis)Agreement?
Source: Stephen Yip & Chintan Parmar
Annotation = Bottleneck
Source:physiconet.org
• 14 million images
• 20,000 categories
• 25 Human Years to annotate!
Source: Li Fei-Fei. (2010). ImageNet: Crowdsourcing, benchmarking & other cool things
Reality = Small Annotated
Data
Source: NASA/JPL/UCSD/JSC
Ways to Deal with Small Data
• AWS Mechanical Turk (e.g., ImageNet)
• CrowdFlower/Figure8/Appen
• Hire SMEs
• Data Augmentation/Synthetic Generation (SMOTE)
Synthetic Minority Oversampling
Nearest
Neighbor
Algorithm
Source: Bart Baesens
Anything
Else?
Photograph: Andrea Shea
Not All Data is Created Equal
https://pypi.org/project/imbalanced-learn/
Source: Rishabh Misra
Training a Cat/Dog Classifier
• Which training samples are more useful?
Photograph:American Kennel Club
Photograph:Atchoumfan
Photograph:Sujoy Roychowdhury
Oncology Text Classifier
Which training samples are more useful?
1. Left medial foot and ankle pain and swelling. Plantar
metatarsal pain for 5 weeks. No known trauma.
2. Dorsal right medial upper back pain for 10 weeks. Right
parotid mass.
3. History pancreatic cancer. Status post aortic
chemotherapy and Whipple procedure
Points Near Decision
Boundary
Maximum
Entropy
Machine Learning with Small Data
What Data Scientists Should Care Most About
Kid Saw This in a Toy Store
Tiger
Photograph:Nat & Jules Brown
At the Zoo a Few Weeks Later
Tiger
Photograph:Skip O’Rourke
Inductive Transfer Learning
• Learning new tasks using knowledge learned from other
tasks
Source: Dipanjan Sarkar
Semantic Image Representations
Source: Jia, Yangqing. 2014). Learning Semantic Image Representations at a Large Scale.
Word Embeddings
Corpus Docs Sentences
Words
Vectors
Word embeddings encode semantic
relationships learned from corpus.
Word2Vec Context too
Narrow
Source: Kamath, Uday, Liu, John & Whitaker, James. (2019). Deep Learning for NLP and Speech Recognition.
Neural Language Model
Source: Kamath, Uday, Liu, John & Whitaker, James. (2019). Deep Learning for NLP and Speech Recognition.
ELMO
Embeddings from Language Models
– Bidirectional Language Models (forward & backward)
– Using LSTMs
– Concatenate hidden layers
Source: Karan Purohit
Concept Embeddings
• RDF2Vec
Source: Kamath, Uday, Liu, John & Whitaker, James. (2019). Deep Learning for NLP and Speech Recognition.
Did We Solve the Tiger Problem?
• Generalize with only a single label? (One-Shot Learning)
• If I described a lion, would you recognize one if you never
ever saw one? (Zero-Shot Learning)
• Did the chicken come before the egg, or vice versa?
(Causality)
THE WORLD
IS NOT
RANDOM
INHERENT STRUCTURE EXISTS
Source: NASA/JPL/UCSD/JSC
Not Random
• Each CIFAR-10 image = 32x32 pixels by 3x256 colors
• Number of possible permutations = 786432!
Source: Krizhevsky, Alex. (2009). Learning Multiple Layers of Features from Tiny Images.
Not a Possible Permutation
Source: Goodfellow, Ian. (2016). Generative Adversarial Nettworks.
How many Laws of Physics are
sufficient to describe motion?
Photograph: Richard Jognston
Bayesian Networks
Factorizing
Joint PDF
Source: Sato, Renato and Sato, Graziela. (2015). Probabilistic graphic models applied to identification of diseases.
Adversarial Data Generation
Source: Mino, Ajkel & Spanakis, Gerasimos. (2018). LoGAN: Generating Logos with a Generative Adversarial Neural Network Conditioned on color.
Last Word
Photograph: Gregor Schmidt
My New Book
A comprehensive resource that
builds up from elementary deep
learning, text, and speech
principles to advanced state-of-
the-art neural architectures.
On Amazon, BN, Springer
https://www.amazon.com/Deep-Learning-
NLP-Speech-Recognition/dp/3030145956
Thank you.
AI/ML Solutions to Solve Business Problems

Machine Learning with Small Data

  • 1.
    Machine Learning withSmall Data John C. Liu, Ph.D. CFA June 18, 2019 Twitter: @drjohncliu
  • 2.
    Disclaimer THE INFORMATION SETFORTH HEREIN HAS BEEN OBTAINED OR DERIVED FROM SOURCES GENERALLY AVAILABLE TO THE PUBLIC AND BELIEVED BY THE AUTHOR TO BE RELIABLE, BUT THE AUTHOR DOES NOT MAKE ANY REPRESENTATION OR WARRANTY, EXPRESS OR IMPLIED, AS TO ITS ACCURACY OR COMPLETENESS. THE INFORMATION IS FOR EDUCATIONAL PURPOSES ONLY AND IS NOT INTENDED TO BE USED AS THE BASIS OF ANY BUSINESS OR INVESTMENT DECISIONS BY ANY PERSON OR ENTITY. ALL OF THE INFORMATION CONTAINED IN THE PRESENTATION IS SUBJECT TO FURTHER MODIFICATION AND ANY AND ALL FORECASTS, PROJECTIONS OR FORWARD-LOOKING STATEMENTS CONTAINED HEREIN SHALL NOT BE RELIED UPON AS FACTS NOR RELIED UPON AS ANY REPRESENTATION OF FUTURE RESULTS WHICH MAY MATERIALLY VARY FROM SUCH PROJECTIONS AND FORECASTS.
  • 3.
    Roadmap • Introduction • BigData Revolution • What about Small Data? • Dealing with Reality – Semantic/Contextualized Representations – Experimental Design – Adversarial Data Generation • Conclusion
  • 4.
  • 5.
  • 6.
    Data is theNew Oil Source: James Corbett
  • 7.
    More Data =Better Models Source: Andrew Ng
  • 8.
    What’s Wrong Withthis Picture Train Set? Source: The Simpsons
  • 9.
    Data Annotation isExpensive Source: Jia, Yangqing. 2014). Learning Semantic Image Representations at a Large Scale.
  • 10.
  • 11.
  • 12.
    • 14 millionimages • 20,000 categories • 25 Human Years to annotate! Source: Li Fei-Fei. (2010). ImageNet: Crowdsourcing, benchmarking & other cool things
  • 13.
    Reality = SmallAnnotated Data Source: NASA/JPL/UCSD/JSC
  • 14.
    Ways to Dealwith Small Data • AWS Mechanical Turk (e.g., ImageNet) • CrowdFlower/Figure8/Appen • Hire SMEs • Data Augmentation/Synthetic Generation (SMOTE)
  • 15.
  • 16.
  • 17.
    Not All Datais Created Equal https://pypi.org/project/imbalanced-learn/ Source: Rishabh Misra
  • 18.
    Training a Cat/DogClassifier • Which training samples are more useful? Photograph:American Kennel Club Photograph:Atchoumfan Photograph:Sujoy Roychowdhury
  • 19.
    Oncology Text Classifier Whichtraining samples are more useful? 1. Left medial foot and ankle pain and swelling. Plantar metatarsal pain for 5 weeks. No known trauma. 2. Dorsal right medial upper back pain for 10 weeks. Right parotid mass. 3. History pancreatic cancer. Status post aortic chemotherapy and Whipple procedure
  • 20.
  • 21.
    Machine Learning withSmall Data What Data Scientists Should Care Most About
  • 22.
    Kid Saw Thisin a Toy Store Tiger Photograph:Nat & Jules Brown
  • 23.
    At the Zooa Few Weeks Later Tiger Photograph:Skip O’Rourke
  • 24.
    Inductive Transfer Learning •Learning new tasks using knowledge learned from other tasks Source: Dipanjan Sarkar
  • 25.
    Semantic Image Representations Source:Jia, Yangqing. 2014). Learning Semantic Image Representations at a Large Scale.
  • 26.
    Word Embeddings Corpus DocsSentences Words Vectors Word embeddings encode semantic relationships learned from corpus.
  • 27.
    Word2Vec Context too Narrow Source:Kamath, Uday, Liu, John & Whitaker, James. (2019). Deep Learning for NLP and Speech Recognition.
  • 28.
    Neural Language Model Source:Kamath, Uday, Liu, John & Whitaker, James. (2019). Deep Learning for NLP and Speech Recognition.
  • 29.
    ELMO Embeddings from LanguageModels – Bidirectional Language Models (forward & backward) – Using LSTMs – Concatenate hidden layers Source: Karan Purohit
  • 30.
    Concept Embeddings • RDF2Vec Source:Kamath, Uday, Liu, John & Whitaker, James. (2019). Deep Learning for NLP and Speech Recognition.
  • 31.
    Did We Solvethe Tiger Problem? • Generalize with only a single label? (One-Shot Learning) • If I described a lion, would you recognize one if you never ever saw one? (Zero-Shot Learning) • Did the chicken come before the egg, or vice versa? (Causality)
  • 32.
    THE WORLD IS NOT RANDOM INHERENTSTRUCTURE EXISTS Source: NASA/JPL/UCSD/JSC
  • 33.
    Not Random • EachCIFAR-10 image = 32x32 pixels by 3x256 colors • Number of possible permutations = 786432! Source: Krizhevsky, Alex. (2009). Learning Multiple Layers of Features from Tiny Images.
  • 34.
    Not a PossiblePermutation Source: Goodfellow, Ian. (2016). Generative Adversarial Nettworks.
  • 35.
    How many Lawsof Physics are sufficient to describe motion? Photograph: Richard Jognston
  • 36.
    Bayesian Networks Factorizing Joint PDF Source:Sato, Renato and Sato, Graziela. (2015). Probabilistic graphic models applied to identification of diseases.
  • 37.
    Adversarial Data Generation Source:Mino, Ajkel & Spanakis, Gerasimos. (2018). LoGAN: Generating Logos with a Generative Adversarial Neural Network Conditioned on color.
  • 38.
  • 39.
    My New Book Acomprehensive resource that builds up from elementary deep learning, text, and speech principles to advanced state-of- the-art neural architectures. On Amazon, BN, Springer https://www.amazon.com/Deep-Learning- NLP-Speech-Recognition/dp/3030145956
  • 40.
    Thank you. AI/ML Solutionsto Solve Business Problems

Editor's Notes