Deep learning
made doubly easy with
reusable deep features
Carlos Guestrin
Dato, CEO
University of Washington, Amazon Prof. of ML
Successful apps
in 2015 must be
intelligent
Machine
learning
key to next-gen apps
• Recommenders
• Fraud detection
• Ad targeting
• Financial models
• Personalized medicine
• Churn prediction
• Smart UX
(video & text)
• Personal assistants
• IoT
• Socials nets
• …Last decade:
Data management
Now:
Intelligent apps
?
Last 5 years:
Traditional analytics
The ML pipeline circa 2013
DATA
ML
Algorithm
My curve is
better than
your curve
Write a
paper
2015: Production ML pipeline
DATA
YourWebServiceor
IntelligentApp
ML
Algorithm
Data
cleaning
&
feature
eng
Offline
eval &
Parameter
search
Deploy
model
Data engineering Data intelligence Deployment
Using deep learning
Goal: Platform to help implement, manage, optimize entire pipeline
Today’s talk
Features in
ML
Neural
networks
Deep
learning for
computer
vision
Deep
learning
made easy
with deep
features
Applications
to text data
Deployment
in
production
Features are key to machine learning
7
Simple example: Spam filtering
• A user walks into an email…
- Will she thinks its spam?
• What’s the probability email is spam?
Text of email
User info
Source info
Input: x
MODEL
Yes!
No
Output:
Probability of y
8
Feature engineering:
the painful black art of transforming raw inputs
into useful inputs for ML algorithm
• E.g., important words, stemming text, complex
transformation of inputs,…
MODEL
Yes!
No
Output:
Probability of y
Feature
extraction
Features: Φ(x)
Text of email
User info
Source info
Input: x
Neural networks

Learning *very* non-linear features
10
Linear classifiers
• Most common classifier
- Logistic regression
- SVMs
- …
• Decision correspond to
hyperplane:
- Line in high dimensional
space
w0 + w1 x1 + w2 x2 > 0 w0 + w1 x1 + w2 x2 < 0
11
Graph representation of classifier:
useful for defining neural networks
x
1
x
2
x
d
y
…
1
w2 w0 + w1 x1 + w2 x2 + … + wd xd
> 0, output 1
< 0, output 0
Input Output
12
What can a linear classifier represent
x1 OR x2 x1 AND x2
x
1
x
2
1
y
-0.5
1
1
x
1
x
2
1
y
-1.5
1
1
13
What can’t a simple linear classifier represent?
XOR
the counterexample
to everything
Need non-linear features
Solving the XOR problem: Adding a layer
XOR = x1 AND NOT x2 OR NOT x1 AND x2
z
1
-0.5
1
-1
z1 z2
z
2
-0.5
-1
1
x
1
x
2
1
y
1 -0.5
1
1
Thresholded to 0 or 1
15
A neural network
• Layers and layers and layers of linear models and non-linear
transformation
• Around for about 50 years
- Fell in “disfavor” in 90s
• In last few years, big resurgence
- Impressive accuracy on a several benchmark problems
- Powered by huge datasets, GPUs, & modeling/learning alg
improvements
x
1
x
2
1
z
1
z
2
1
y
Applications to computer vision
(or the deep devil is in the deep details)
17
Image features
• Features = local detectors
- Combined to make prediction
- (in reality, features are more low-level)
Face!
Eye
Eye
Nose
Mouth
18
Many hand create features exist…
Computer$vision$features$
SIFT$ Spin$image$
HoG$ RIFT$
Textons$ GLOH$
Slide$Credit:$Honglak$Lee$
19
Standard image classification approach
Input
Computer$vision$features$
SIFT$ Spin$image$
HoG$ RIFT$
Textons$ GLOH$
Slide$Credit:$Honglak$Lee$
Extract features Use simple classifier
e.g., logistic regression, SVMs
Car?
20
Many hand create features exist…
Computer$vision$features$
SIFT$ Spin$image$
HoG$ RIFT$
Textons$ GLOH$
Slide$Credit:$Honglak$Lee$
… but very painful to design
21
Use neural network to learn features
Each layer learns features, at different levels of abstraction
Y LeCun
MA Ranzato
Deep Learning = Learning Hierarchical Representations
It's deep if it has more than one stage of non-linear feature
transformation
Trainable
Classifier
Low-Level
Feature
Mid-Level
Feature
High-Level
Feature
Feature visualization of convolutional net trained on ImageNet from [ Zeiler & Fergus 2013]
22
Many tricks needed to work well…
• Different types of layers, connections,… needed for high accuracy
Krizhevsky et al.
‘12
Sample performance results
Sample results
• Traffic sign recognition
(GTSRB)
- 99.2% accuracy
• House number recognition
(Google)
- 94.3% accuracy
30
Krizhevsky et al. ’12:
60M parameters, won 2012 ImageNet competition
31
32
ImageNet 2012 competition: 1.5M images, 1000 categories
32
33
©Carlos Guestrin 2005-2014
33
MA
TEST
IMAGE RETRIEVED IMAGES
34
Application to scene parsing
©Carlos Guestrin 2005-2014
Y LeCun
MA Ranzato
Semantic Labeling:
Labeling every pixel with the object it belongs to
[ Farabet et al. ICML 2012, PAMI 2013]
Would help identify obstacles, targets, landing sites, dangerous areas
Would help line up depth map with edge maps
Challenges of deep learning
Deep learning score card
Pros
• Enables learning of features rather
than hand tuning
• Impressive performance gains on
- Computer vision
- Speech recognition
- Some text analysis
• Potential for much more impact
Cons
Deep learning workflow
Lots of
labeled data
Training set
Validation set
80%
20%
Learn deep
neural net
model
Validate
Deep learning score card
Pros
• Enables learning of features rather
than hand tuning
• Impressive performance gains on
- Computer vision
- Speech recognition
- Some text analysis
• Potential for much more impact
Cons
• Computationally really expensive
• Requires a lot of data for high
accuracy
• Extremely hard to tune
- Choice of architecture
- Parameter types
- Hyperparameters
- Learning algorithm
- …
• Computational + so many choices =
incredibly hard to tune
Deep features:
Deep learning
+
Transfer learning
40
Change image classification approach?
Input
Computer$vision$features$
SIFT$ Spin$image$
HoG$ RIFT$
Textons$ GLOH$
Slide$Credit:$Honglak$Lee$
Extract features Use simple classifier
e.g., logistic regression, SVMs
Car?
Can we learn features
from data,
even when
we don’t have
data or time?
41
Transfer learning:
Use data from one domain to help learn on another
Lots of data:
Learn
neural net
Great
accuracy on
cat v. dogvs.
Some data: Neural net as
feature extractor
+
Simple classifier
Great accuracy on
101 categories
Old idea, explored for deep learning by Donahue et al. ’14
42
What’s learned in a neural net
Neural net trained for Task 1: cat vs. dog
Very specific to Task 1
Should be ignored for other tasks
More generic
Can be used as feature extractor
vs.
43
Transfer learning in more detail…
Neural net trained for Task 1: cat vs. dog
Very specific to Task 1
Should be ignored for other tasks
More generic
Can be used as feature extractor
Keep weights fixed!
For Task 2, predicting 101 categories, learn only end part
Use simple classifier
e.g., logistic regression, SVMs
Class
?
44
Careful where you cut…
Last few layers tend to be too specific
Y LeCun
MA Ranzato
Deep Learning = Learning Hierarchical Representations
It's deep if it has more than one stage of non-linear feature
transformation
Trainable
Classifier
Low-Level
Feature
Mid-Level
Feature
High-Level
Feature
Feature visualization of convolutional net trained on ImageNet from [ Zeiler & Fergus 2013]
Too specific for
car detectionUse these!
Transfer learning with deep features
Training set
Validation set
80%
20%
Learn
simple
model
Some
labeled data
Extract
features with
neural net
trained on
different task
Validate
Deploy in
production
How general are deep features?
Applications to text data
Simple text classification with bag of words
aardvark 0
about 2
all 2
Africa 1
apple 0
anxious 0
...
gas 1
...
oil 1
…
Zaire 0
Use simple classifier
e.g., logistic regression, SVMs
Class
?
One “feature” per word
Word2Vec: Neural network for finding high
dimensional representation per word Mikolov et al. ‘13
Skip-gram Model: From a word, predict nearby words in sentence
Awesome learning talk at
Strata
deep
300 dim
representation
300 dim
representation
300 dim
representation
300 dim
representation
300 dim
representation
300 dim
representation
Neural net
Viewed as deep
features
50
Related words placed nearby high dim space
Projecting 300 dim space into 2 dim with PCA (Mikolov et al. ’13)
Classifier:
e.g., logistic regression, SVMs with
300 x number_of_words parameters
Class
?
Embed each
word into
300 dim
space
Text classification with word embeddings
aardvark 0
about 2
all 2
Africa 1
apple 0
anxious 0
...
gas 1
...
oil 1
…
Zaire 0
Practical example
Blog corpus
Haha
Yea
Hahaha
Hahah
Lisxc
Umm
Hehe
laughingoutloud
LOL
Closest words
in 300 dim
Predicts gender of author with 79% accuracy
Deploying ML in production
55
DATA
ML
Algorithm
Deployment?
• Write spec, other team
implements in ‘production’ language
o 6-12 months
o Stale/irrelevant model/approach
o 2 teams maintaining 2 systems
Custom
Model
Data Engineers, Data
Architects, DevOps,
App Developers
App
A
P
I
Data Scientist
ML deployment requirements
56
Easy to
integrate
Rest API
Scalable
Fault tolerant
Flexible
Any model,
any Python
App
A
P
I
A
P
I
C
A
C
H
E
A
P
I
C
A
C
H
E
A
P
I
C
A
C
H
E
LB
GLC
Model
GLC
Model
GLC
Model
Dato
Models
Dato
Models
Dato
Models
A
P
I
C
A
C
H
E
A
P
I
C
A
C
H
E
A
P
I
C
A
C
H
E
LB
GLC
Model
GLC
Model
GLC
Model
Python
Python
Python
57
Do-It-Yourself
• Web Service layer:
- Tornado, Flask, Keen, Django, …
• Caching layer:
- Redis, Cassandra, Memcached,
DynamoDb, MySQL, …
• Logs:
- Logback, LogStash, Splunk, Loggly, …
• Metrics:
- AWS CloudWatch, Mixpanel, Librato, …
A
P
I
C
A
C
H
E
A
P
I
C
A
C
H
E
A
P
I
C
A
C
H
E
LB
GLC
Model
GLC
Model
GLC
Model
Python
Python
Python
App
58
… or use Dato Predictive Services
YourWebServiceor
IntelligentApp
ML Model
Dato
Predictive Services
CachingLayer
Predictive
ObjectServer
Serves predictions in a robust, scalable, incremental fashion
Better
ML Model
Serve any model: GraphLab Create, scikit-learn, Python, …
• Out-of-core computation
• Tools for feature engineering
• Rich data type support
• Models built for scale
• App-oriented toolkits
• Advanced ML & Extensible
• Deploy models as low-latency REST services
• Same code for distributed computation
• Elastically scale up or out with one command
• Job monitoring & model management
• Deploy existing Python code & models
• Run on AWS EC2 or Hadoop Yarn
SGraph
Create Engine
SFrameCanvas
Machine Learning Toolkits SDK
GraphLab Create Dato Distributed Dato Predictive Services
Predictive Engine
REST Client Direct
Model Mgmt
Distributed Engine
DirectJob Client
Job Mgmt
Dato Platform
Summary
Deep learning made easy with deep features
Deep learning: exciting ML development
Slow, lots of tuning,
needs lots of data
Deep features: reuse deep models for new domains
Needs less data Faster training times Much simpler tuning
Can still achieve excellent performance

Strata London - Deep Learning 05-2015

  • 1.
    Deep learning made doublyeasy with reusable deep features Carlos Guestrin Dato, CEO University of Washington, Amazon Prof. of ML
  • 2.
    Successful apps in 2015must be intelligent Machine learning key to next-gen apps • Recommenders • Fraud detection • Ad targeting • Financial models • Personalized medicine • Churn prediction • Smart UX (video & text) • Personal assistants • IoT • Socials nets • …Last decade: Data management Now: Intelligent apps ? Last 5 years: Traditional analytics
  • 3.
    The ML pipelinecirca 2013 DATA ML Algorithm My curve is better than your curve Write a paper
  • 4.
    2015: Production MLpipeline DATA YourWebServiceor IntelligentApp ML Algorithm Data cleaning & feature eng Offline eval & Parameter search Deploy model Data engineering Data intelligence Deployment Using deep learning Goal: Platform to help implement, manage, optimize entire pipeline
  • 5.
    Today’s talk Features in ML Neural networks Deep learningfor computer vision Deep learning made easy with deep features Applications to text data Deployment in production
  • 6.
    Features are keyto machine learning
  • 7.
    7 Simple example: Spamfiltering • A user walks into an email… - Will she thinks its spam? • What’s the probability email is spam? Text of email User info Source info Input: x MODEL Yes! No Output: Probability of y
  • 8.
    8 Feature engineering: the painfulblack art of transforming raw inputs into useful inputs for ML algorithm • E.g., important words, stemming text, complex transformation of inputs,… MODEL Yes! No Output: Probability of y Feature extraction Features: Φ(x) Text of email User info Source info Input: x
  • 9.
  • 10.
    10 Linear classifiers • Mostcommon classifier - Logistic regression - SVMs - … • Decision correspond to hyperplane: - Line in high dimensional space w0 + w1 x1 + w2 x2 > 0 w0 + w1 x1 + w2 x2 < 0
  • 11.
    11 Graph representation ofclassifier: useful for defining neural networks x 1 x 2 x d y … 1 w2 w0 + w1 x1 + w2 x2 + … + wd xd > 0, output 1 < 0, output 0 Input Output
  • 12.
    12 What can alinear classifier represent x1 OR x2 x1 AND x2 x 1 x 2 1 y -0.5 1 1 x 1 x 2 1 y -1.5 1 1
  • 13.
    13 What can’t asimple linear classifier represent? XOR the counterexample to everything Need non-linear features
  • 14.
    Solving the XORproblem: Adding a layer XOR = x1 AND NOT x2 OR NOT x1 AND x2 z 1 -0.5 1 -1 z1 z2 z 2 -0.5 -1 1 x 1 x 2 1 y 1 -0.5 1 1 Thresholded to 0 or 1
  • 15.
    15 A neural network •Layers and layers and layers of linear models and non-linear transformation • Around for about 50 years - Fell in “disfavor” in 90s • In last few years, big resurgence - Impressive accuracy on a several benchmark problems - Powered by huge datasets, GPUs, & modeling/learning alg improvements x 1 x 2 1 z 1 z 2 1 y
  • 16.
    Applications to computervision (or the deep devil is in the deep details)
  • 17.
    17 Image features • Features= local detectors - Combined to make prediction - (in reality, features are more low-level) Face! Eye Eye Nose Mouth
  • 18.
    18 Many hand createfeatures exist… Computer$vision$features$ SIFT$ Spin$image$ HoG$ RIFT$ Textons$ GLOH$ Slide$Credit:$Honglak$Lee$
  • 19.
    19 Standard image classificationapproach Input Computer$vision$features$ SIFT$ Spin$image$ HoG$ RIFT$ Textons$ GLOH$ Slide$Credit:$Honglak$Lee$ Extract features Use simple classifier e.g., logistic regression, SVMs Car?
  • 20.
    20 Many hand createfeatures exist… Computer$vision$features$ SIFT$ Spin$image$ HoG$ RIFT$ Textons$ GLOH$ Slide$Credit:$Honglak$Lee$ … but very painful to design
  • 21.
    21 Use neural networkto learn features Each layer learns features, at different levels of abstraction Y LeCun MA Ranzato Deep Learning = Learning Hierarchical Representations It's deep if it has more than one stage of non-linear feature transformation Trainable Classifier Low-Level Feature Mid-Level Feature High-Level Feature Feature visualization of convolutional net trained on ImageNet from [ Zeiler & Fergus 2013]
  • 22.
    22 Many tricks neededto work well… • Different types of layers, connections,… needed for high accuracy Krizhevsky et al. ‘12
  • 23.
  • 24.
    Sample results • Trafficsign recognition (GTSRB) - 99.2% accuracy • House number recognition (Google) - 94.3% accuracy 30
  • 25.
    Krizhevsky et al.’12: 60M parameters, won 2012 ImageNet competition 31
  • 26.
    32 ImageNet 2012 competition:1.5M images, 1000 categories 32
  • 27.
  • 28.
    34 Application to sceneparsing ©Carlos Guestrin 2005-2014 Y LeCun MA Ranzato Semantic Labeling: Labeling every pixel with the object it belongs to [ Farabet et al. ICML 2012, PAMI 2013] Would help identify obstacles, targets, landing sites, dangerous areas Would help line up depth map with edge maps
  • 29.
  • 30.
    Deep learning scorecard Pros • Enables learning of features rather than hand tuning • Impressive performance gains on - Computer vision - Speech recognition - Some text analysis • Potential for much more impact Cons
  • 31.
    Deep learning workflow Lotsof labeled data Training set Validation set 80% 20% Learn deep neural net model Validate
  • 32.
    Deep learning scorecard Pros • Enables learning of features rather than hand tuning • Impressive performance gains on - Computer vision - Speech recognition - Some text analysis • Potential for much more impact Cons • Computationally really expensive • Requires a lot of data for high accuracy • Extremely hard to tune - Choice of architecture - Parameter types - Hyperparameters - Learning algorithm - … • Computational + so many choices = incredibly hard to tune
  • 33.
  • 34.
    40 Change image classificationapproach? Input Computer$vision$features$ SIFT$ Spin$image$ HoG$ RIFT$ Textons$ GLOH$ Slide$Credit:$Honglak$Lee$ Extract features Use simple classifier e.g., logistic regression, SVMs Car? Can we learn features from data, even when we don’t have data or time?
  • 35.
    41 Transfer learning: Use datafrom one domain to help learn on another Lots of data: Learn neural net Great accuracy on cat v. dogvs. Some data: Neural net as feature extractor + Simple classifier Great accuracy on 101 categories Old idea, explored for deep learning by Donahue et al. ’14
  • 36.
    42 What’s learned ina neural net Neural net trained for Task 1: cat vs. dog Very specific to Task 1 Should be ignored for other tasks More generic Can be used as feature extractor vs.
  • 37.
    43 Transfer learning inmore detail… Neural net trained for Task 1: cat vs. dog Very specific to Task 1 Should be ignored for other tasks More generic Can be used as feature extractor Keep weights fixed! For Task 2, predicting 101 categories, learn only end part Use simple classifier e.g., logistic regression, SVMs Class ?
  • 38.
    44 Careful where youcut… Last few layers tend to be too specific Y LeCun MA Ranzato Deep Learning = Learning Hierarchical Representations It's deep if it has more than one stage of non-linear feature transformation Trainable Classifier Low-Level Feature Mid-Level Feature High-Level Feature Feature visualization of convolutional net trained on ImageNet from [ Zeiler & Fergus 2013] Too specific for car detectionUse these!
  • 39.
    Transfer learning withdeep features Training set Validation set 80% 20% Learn simple model Some labeled data Extract features with neural net trained on different task Validate Deploy in production
  • 40.
    How general aredeep features?
  • 41.
  • 42.
    Simple text classificationwith bag of words aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0 ... gas 1 ... oil 1 … Zaire 0 Use simple classifier e.g., logistic regression, SVMs Class ? One “feature” per word
  • 43.
    Word2Vec: Neural networkfor finding high dimensional representation per word Mikolov et al. ‘13 Skip-gram Model: From a word, predict nearby words in sentence Awesome learning talk at Strata deep 300 dim representation 300 dim representation 300 dim representation 300 dim representation 300 dim representation 300 dim representation Neural net Viewed as deep features
  • 44.
    50 Related words placednearby high dim space Projecting 300 dim space into 2 dim with PCA (Mikolov et al. ’13)
  • 45.
    Classifier: e.g., logistic regression,SVMs with 300 x number_of_words parameters Class ? Embed each word into 300 dim space Text classification with word embeddings aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0 ... gas 1 ... oil 1 … Zaire 0
  • 46.
  • 47.
  • 48.
    Deploying ML inproduction
  • 49.
    55 DATA ML Algorithm Deployment? • Write spec,other team implements in ‘production’ language o 6-12 months o Stale/irrelevant model/approach o 2 teams maintaining 2 systems Custom Model Data Engineers, Data Architects, DevOps, App Developers App A P I Data Scientist
  • 50.
    ML deployment requirements 56 Easyto integrate Rest API Scalable Fault tolerant Flexible Any model, any Python App A P I A P I C A C H E A P I C A C H E A P I C A C H E LB GLC Model GLC Model GLC Model Dato Models Dato Models Dato Models A P I C A C H E A P I C A C H E A P I C A C H E LB GLC Model GLC Model GLC Model Python Python Python
  • 51.
    57 Do-It-Yourself • Web Servicelayer: - Tornado, Flask, Keen, Django, … • Caching layer: - Redis, Cassandra, Memcached, DynamoDb, MySQL, … • Logs: - Logback, LogStash, Splunk, Loggly, … • Metrics: - AWS CloudWatch, Mixpanel, Librato, … A P I C A C H E A P I C A C H E A P I C A C H E LB GLC Model GLC Model GLC Model Python Python Python App
  • 52.
    58 … or useDato Predictive Services YourWebServiceor IntelligentApp ML Model Dato Predictive Services CachingLayer Predictive ObjectServer Serves predictions in a robust, scalable, incremental fashion Better ML Model Serve any model: GraphLab Create, scikit-learn, Python, …
  • 53.
    • Out-of-core computation •Tools for feature engineering • Rich data type support • Models built for scale • App-oriented toolkits • Advanced ML & Extensible • Deploy models as low-latency REST services • Same code for distributed computation • Elastically scale up or out with one command • Job monitoring & model management • Deploy existing Python code & models • Run on AWS EC2 or Hadoop Yarn SGraph Create Engine SFrameCanvas Machine Learning Toolkits SDK GraphLab Create Dato Distributed Dato Predictive Services Predictive Engine REST Client Direct Model Mgmt Distributed Engine DirectJob Client Job Mgmt Dato Platform
  • 54.
  • 55.
    Deep learning madeeasy with deep features Deep learning: exciting ML development Slow, lots of tuning, needs lots of data Deep features: reuse deep models for new domains Needs less data Faster training times Much simpler tuning Can still achieve excellent performance

Editor's Notes

  • #56 So I got started with ML by taking a class. Data -> to ML algo, and then generate a plot. Of course this isn’t how actual applications are written, but this is often where customers are starting when approaching taking ML to production.