Strata London - Deep Learning 05-2015

Deep learning
made doubly easy with
reusable deep features
Carlos Guestrin
Dato, CEO
University of Washington, Amazon Prof. of ML

Successful apps
in 2015 must be
intelligent
Machine
learning
key to next-gen apps
• Recommenders
• Fraud detection
• Ad targeting
• Financial models
• Personalized medicine
• Churn prediction
• Smart UX
(video & text)
• Personal assistants
• IoT
• Socials nets
• …Last decade:
Data management
Now:
Intelligent apps
?
Last 5 years:
Traditional analytics

The ML pipeline circa 2013
DATA
ML
Algorithm
My curve is
better than
your curve
Write a
paper

2015: Production ML pipeline
DATA
YourWebServiceor
IntelligentApp
ML
Algorithm
Data
cleaning
&
feature
eng
Offline
eval &
Parameter
search
Deploy
model
Data engineering Data intelligence Deployment
Using deep learning
Goal: Platform to help implement, manage, optimize entire pipeline

Today’s talk
Features in
ML
Neural
networks
Deep
learning for
computer
vision
Deep
learning
made easy
with deep
features
Applications
to text data
Deployment
in
production

Features are key to machine learning

7
Simple example: Spam filtering
• A user walks into an email…
- Will she thinks its spam?
• What’s the probability email is spam?
Text of email
User info
Source info
Input: x
MODEL
Yes!
No
Output:
Probability of y

8
Feature engineering:
the painful black art of transforming raw inputs
into useful inputs for ML algorithm
• E.g., important words, stemming text, complex
transformation of inputs,…
MODEL
Yes!
No
Output:
Probability of y
Feature
extraction
Features: Φ(x)
Text of email
User info
Source info
Input: x

Neural networks

Learning *very* non-linear features

10
Linear classifiers
• Most common classifier
- Logistic regression
- SVMs
- …
• Decision correspond to
hyperplane:
- Line in high dimensional
space
w0 + w1 x1 + w2 x2 > 0 w0 + w1 x1 + w2 x2 < 0

11
Graph representation of classifier:
useful for defining neural networks
x
1
x
2
x
d
y
…
1
w2 w0 + w1 x1 + w2 x2 + … + wd xd
> 0, output 1
< 0, output 0
Input Output

12
What can a linear classifier represent
x1 OR x2 x1 AND x2
x
1
x
2
1
y
-0.5
1
1
x
1
x
2
1
y
-1.5
1
1

13
What can’t a simple linear classifier represent?
XOR
the counterexample
to everything
Need non-linear features

Solving the XOR problem: Adding a layer
XOR = x1 AND NOT x2 OR NOT x1 AND x2
z
1
-0.5
1
-1
z1 z2
z
2
-0.5
-1
1
x
1
x
2
1
y
1 -0.5
1
1
Thresholded to 0 or 1

15
A neural network
• Layers and layers and layers of linear models and non-linear
transformation
• Around for about 50 years
- Fell in “disfavor” in 90s
• In last few years, big resurgence
- Impressive accuracy on a several benchmark problems
- Powered by huge datasets, GPUs, & modeling/learning alg
improvements
x
1
x
2
1
z
1
z
2
1
y

Applications to computer vision
(or the deep devil is in the deep details)

17
Image features
• Features = local detectors
- Combined to make prediction
- (in reality, features are more low-level)
Face!
Eye
Eye
Nose
Mouth

18
Many hand create features exist…
Computer$vision$features$
SIFT$ Spin$image$
HoG$ RIFT$
Textons$ GLOH$
Slide$Credit:$Honglak$Lee$

19
Standard image classification approach
Input
SIFT$ Spin$image$
HoG$ RIFT$
Textons$ GLOH$
Extract features Use simple classifier
e.g., logistic regression, SVMs
Car?

20
Many hand create features exist…
SIFT$ Spin$image$
HoG$ RIFT$
Textons$ GLOH$
… but very painful to design

21
Use neural network to learn features
Each layer learns features, at different levels of abstraction
Y LeCun
MA Ranzato
Deep Learning = Learning Hierarchical Representations
It's deep if it has more than one stage of non-linear feature
transformation
Trainable
Classifier
Low-Level
Feature
Mid-Level
Feature
High-Level
Feature
Feature visualization of convolutional net trained on ImageNet from [ Zeiler & Fergus 2013]

22
Many tricks needed to work well…
• Different types of layers, connections,… needed for high accuracy
Krizhevsky et al.
‘12

Sample results
• Traffic sign recognition
(GTSRB)
- 99.2% accuracy
• House number recognition
(Google)
- 94.3% accuracy
30

Krizhevsky et al. ’12:
60M parameters, won 2012 ImageNet competition
31

32
ImageNet 2012 competition: 1.5M images, 1000 categories
32

33
©Carlos Guestrin 2005-2014
33
MA
TEST
IMAGE RETRIEVED IMAGES

34
Application to scene parsing
©Carlos Guestrin 2005-2014
Y LeCun
MA Ranzato
Semantic Labeling:
Labeling every pixel with the object it belongs to
[ Farabet et al. ICML 2012, PAMI 2013]
Would help identify obstacles, targets, landing sites, dangerous areas
Would help line up depth map with edge maps

Deep learning score card
Pros
• Enables learning of features rather
than hand tuning
• Impressive performance gains on
- Computer vision
- Speech recognition
- Some text analysis
• Potential for much more impact
Cons

Deep learning workflow
Lots of
labeled data
Training set
Validation set
80%
20%
Learn deep
neural net
model
Validate

Deep learning score card
Pros
• Enables learning of features rather
than hand tuning
• Impressive performance gains on
- Computer vision
- Speech recognition
- Some text analysis
• Potential for much more impact
Cons
• Computationally really expensive
• Requires a lot of data for high
accuracy
• Extremely hard to tune
- Choice of architecture
- Parameter types
- Hyperparameters
- Learning algorithm
- …
• Computational + so many choices =
incredibly hard to tune

Deep features:
Deep learning
+
Transfer learning

40
Change image classification approach?
Input
SIFT$ Spin$image$
HoG$ RIFT$
Textons$ GLOH$
Extract features Use simple classifier
Car?
Can we learn features
from data,
even when
we don’t have
data or time?

41
Transfer learning:
Use data from one domain to help learn on another
Lots of data:
Learn
neural net
Great
accuracy on
cat v. dogvs.
Some data: Neural net as
feature extractor
+
Simple classifier
Great accuracy on
101 categories
Old idea, explored for deep learning by Donahue et al. ’14

42
What’s learned in a neural net
Neural net trained for Task 1: cat vs. dog
Very specific to Task 1
Should be ignored for other tasks
More generic
Can be used as feature extractor
vs.

43
Transfer learning in more detail…
Neural net trained for Task 1: cat vs. dog
Very specific to Task 1
Should be ignored for other tasks
More generic
Can be used as feature extractor
Keep weights fixed!
For Task 2, predicting 101 categories, learn only end part
Use simple classifier
Class
?

44
Careful where you cut…
Last few layers tend to be too specific
Y LeCun
MA Ranzato
Deep Learning = Learning Hierarchical Representations
It's deep if it has more than one stage of non-linear feature
transformation
Trainable
Classifier
Low-Level
Feature
Mid-Level
Feature
High-Level
Feature
Feature visualization of convolutional net trained on ImageNet from [ Zeiler & Fergus 2013]
Too specific for
car detectionUse these!

Transfer learning with deep features
Training set
Validation set
80%
20%
Learn
simple
model
Some
labeled data
Extract
features with
neural net
trained on
different task
Validate
Deploy in
production

How general are deep features?

Simple text classification with bag of words
aardvark 0
about 2
all 2
Africa 1
apple 0
anxious 0
...
gas 1
...
oil 1
…
Zaire 0
Use simple classifier
Class
?
One “feature” per word

Word2Vec: Neural network for finding high
dimensional representation per word Mikolov et al. ‘13
Skip-gram Model: From a word, predict nearby words in sentence
Awesome learning talk at
Strata
deep
300 dim
representation
300 dim
representation
300 dim
representation
300 dim
representation
300 dim
representation
300 dim
representation
Neural net
Viewed as deep
features

50
Related words placed nearby high dim space
Projecting 300 dim space into 2 dim with PCA (Mikolov et al. ’13)

Classifier:
e.g., logistic regression, SVMs with
300 x number_of_words parameters
Class
?
Embed each
word into
300 dim
space
Text classification with word embeddings
aardvark 0
about 2
all 2
Africa 1
apple 0
anxious 0
...
gas 1
...
oil 1
…
Zaire 0

Blog corpus
Haha
Yea
Hahaha
Hahah
Lisxc
Umm
Hehe
laughingoutloud
LOL
Closest words
in 300 dim
Predicts gender of author with 79% accuracy

55
DATA
ML
Algorithm
Deployment?
• Write spec, other team
implements in ‘production’ language
o 6-12 months
o Stale/irrelevant model/approach
o 2 teams maintaining 2 systems
Custom
Model
Data Engineers, Data
Architects, DevOps,
App Developers
App
A
P
I
Data Scientist

ML deployment requirements
56
Easy to
integrate
Rest API
Scalable
Fault tolerant
Flexible
Any model,
any Python
App
A
P
I
A
P
I
C
A
C
H
E
A
P
I
C
A
C
H
E
A
P
I
C
A
C
H
E
LB
GLC
Model
GLC
Model
GLC
Model
Dato
Models
Dato
Models
Dato
Models
A
P
I
C
A
C
H
E
A
P
I
C
A
C
H
E
A
P
I
C
A
C
H
E
LB
GLC
Model
GLC
Model
GLC
Model
Python
Python
Python

57
Do-It-Yourself
• Web Service layer:
- Tornado, Flask, Keen, Django, …
• Caching layer:
- Redis, Cassandra, Memcached,
DynamoDb, MySQL, …
• Logs:
- Logback, LogStash, Splunk, Loggly, …
• Metrics:
- AWS CloudWatch, Mixpanel, Librato, …
A
P
I
C
A
C
H
E
A
P
I
C
A
C
H
E
A
P
I
C
A
C
H
E
LB
GLC
Model
GLC
Model
GLC
Model
Python
Python
Python
App

58
… or use Dato Predictive Services
YourWebServiceor
IntelligentApp
ML Model
Dato
Predictive Services
CachingLayer
Predictive
ObjectServer
Serves predictions in a robust, scalable, incremental fashion
Better
ML Model
Serve any model: GraphLab Create, scikit-learn, Python, …

• Out-of-core computation
• Tools for feature engineering
• Rich data type support
• Models built for scale
• App-oriented toolkits
• Advanced ML & Extensible
• Deploy models as low-latency REST services
• Same code for distributed computation
• Elastically scale up or out with one command
• Job monitoring & model management
• Deploy existing Python code & models
• Run on AWS EC2 or Hadoop Yarn
SGraph
Create Engine
SFrameCanvas
Machine Learning Toolkits SDK
GraphLab Create Dato Distributed Dato Predictive Services
Predictive Engine
REST Client Direct
Model Mgmt
Distributed Engine
DirectJob Client
Job Mgmt
Dato Platform

Deep learning made easy with deep features
Deep learning: exciting ML development
Slow, lots of tuning,
needs lots of data
Deep features: reuse deep models for new domains
Needs less data Faster training times Much simpler tuning
Can still achieve excellent performance

Strata London - Deep Learning 05-2015

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Strata London - Deep Learning 05-2015

Similar to Strata London - Deep Learning 05-2015 (20)

More from Turi, Inc.

More from Turi, Inc. (20)

Recently uploaded

Recently uploaded (20)

Strata London - Deep Learning 05-2015

Editor's Notes