Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 2014-01-08

Deep Learning
for Data Scientists

Andrew B. Gardner
agardner@momentics.com
http://linkd.in/1byADxC

www.momentics.com/deep-learning

Deep Learning in the Press…

Ng

Hinton

LeCun

Zuckerberg

Google Hires Brains that Helped
Supercharge Machine Learning.
Wired 3/2013.

Kurzweil

Facebook taps ‘Deep Learning’ Giant for New AI Lab.
Wired 12/2013.

Is “Deep Learning” A Revolutions in
Artificial Intelligence?

The Man Behind the Google Brain: Andrew
Ng and the Quest for the New AI.

New Yorker 11/2012.

Wired 5/2013.

New Techniques from Google and Ray Kurzweil Are
Taking Artificial Intelligence to Another Level.
MIT Technology Review 5/2013.

… Publication & Search Trends …
Google Scholar Citations

Google Trends

600

big data

500

data science

400
300

“deep learning” +
“neural network”

deep learning
machine learning

200
100

0

‘06

‘11

‘06

‘11

domains: computer vision, speech & audio, bioinformatics, etc.

Conferences: NIPS, ICLR, ICML, …

… Industry & Products
• Google

Microsoft Real-time English-Chinese Translation

– Android Voice
Recognition
– Maps
– Image+

•
•
•
•

SIRI
Translation
Documents
…

https://www.youtube.com/watch?v=Nu-nlQqFCKg

Microsoft Chief Research Officer Rick
Rashid, 11/2012

Deep Learning Epicenters (North America)

de Freitas (UBC)
Microsoft

Bengio (U Montreal)
Hinton (U Toronto)

Facebook
Ng (Stanford)
Google
Yahoo

LeCun (NYU)

Deep Learning: The Origin Story

Before: A Cat Detector
We want to build this….

classifier
f : X ®Y

Y ~ the labels {“cat”, “dog”}

X ~ the images

… for less than $1.0M !

Challenge: Labeled Data
Labels are expensive  Less data
Intuitively: more data is good
cat

cat
dog

unused,unlabeled

cat
dog

Challenge: Features
Features are expensive  Fewer, shallow
Intuitively: better features are good
image (pixels)
Magic feature dictionary
SIFT
HoG

B W

SIFT

binary histogram

Moments
Shape Histogram

+
++
+
+++

+

+
+
+ +
+

x=(1.3, 2.8, …)

Fang detector
Something new

Machine Learning (Before)
Building a Cat Detector 1.0

expensive
important*

Features

Detector
(Classifier)

fa
ng
of
in
ch
on,
of
of
us
on
is
is
bly

How Good is “More Data?”

speech. The memory-based learner used only
the word before and word after as features.

Labels are expensive  Less data
1.00

• More data dominates*
better techniques

.975

0.95

0.90

Test Accuracy

a
93,
In
is
fic
es
are
m
ber

• Often have lots of data

0.85

.825

0.80

Memory-Based
Winnow

0.75

Perceptron
Naïve Bayes
0.70
0.1

1

10
100
Millions of Words

1000

Learning curves for confusion set
Figure 1. Learning Curves for Confusion Set
disambiguation, e.g. {to, two, too}.
Disambiguation
We collected a 1
-billion-word training
corpus from a variety of English texts, including

• … we just don’t have
lots of labels
• What if there was a
way to use unlabeled
data?

“Scaling to Very Very Large Corpora for Natural Language Disambiguation,” Banko and Brill, 2001.

The Impact of Features
Intuitively: better features are good

• Critical to success – even more than data!
• How to create / engineer features?
– Typically shallow

• Domain-specific
• What if there was a way to automatically
learn features?

Machine Learning (What We Want)
Building a Cat Detector 2.0

bountiful
important*

Features + Detector
(Classifier)

end-to-end

AR” Building an Object Recognition System

”

“CAR”

Deep Nets Intuition
“CAR”

car
intermediate representations
CLASSIFIER

FEATURE
EXTRACTOR

label

IDEA: Use data to optimize features for the given task.

olutional DBN's for scalable unsup. learning...” ICML 2009

Lee et al. ICML 2009

12

Ranzato
2

Ranzato
13

Ranzato

Ranza

on from low
structure as
hical Another Example of Hierarchy
Learning
rchical Learning
mplexity from low
progression

ral progression from low
high level structure as
to high level structure as
natural complexity
in natural complexity

what is being
eto monitor whatisisbeing
the machine
o monitor what being
r
and guide the machine
es toto guide themachine
t and

er subspaces
tter subspaces

od lower level
llower level heads
ntation can be used for
sentation can be usedfor
ndistinct tasks for
be used
istinct tasks

s

faces

as

parts

edges

d tomachine machine
e guide the
he
subspaces Hierarchy Reusability?
faces

cars

elephants

chairs

wer level
be used forbe used for
tation can
tinct tasks

5

5

A Breakthrough
G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning
algorithm for deep belief nets,” Neural
Computation, vol. 18, pp. 1527–1554, 2006.
G. E. Hinton and R. R. Salakhutdniov, “Reducing the
dimensionality of data with neural networks,”
Science, vol. 313, no. 5786, pp. 504-507, July 2006.

before

after

Deep Belief Nets
MNIST

60K + 10K Images

Technique

Test Error

DBN pretrain

1.25

SVM

1.4

kNN

2.8-4.4

ConvNet

0.4 -> 0.23

supervised tuning
unsupervised pretraining

MNIST Sample Errors

Ciresan et al. “Deep Big Simple Neural Networks Excel on
Handwritten Digit Recognition,” 2010

Key Ideas
• Learn features from data
– Use all data

• Deep architecture
– Representation
– Computational efficiency
– Shared statistics

• Practical training
• State-of-the-art (it worked)

After: Cat Detector
unlabeled images (millions)

labeled images (few)

deep learning
network

more data

automatic (deep) features

This Is A Neuron
output

1. Sum all inputs (weighted)

y

x = w0 + w1z1 + w2 z2 + w3z3

f(x)

2. Nonlinearly transform

y = f ( x)

weights
w0 w1

w2

sigmoid

w3
tanh

1
bias

z1

z2
inputs

z3
activation function

A Neural Network
forward propagation: weighted sum inputs, produce activation, feed forward

cat

dog

Output

Hidden

13.5

weight

21

n_teeth

16

n_whiskers

Inputs
(the features)

Training
Back propagation of error.

1

0

cat

dog

total error at top

proportional
contributions going
backwards

13.5

weight

21

n_teeth

16

n_whiskers

After Training
network

layer weights

weights as a matrix

[.5, -.2, 4, .15, -1,…]

-.5

.4

0

.1

.1

.5

-1

2

[-.5, -.3, .4, 0, …]

-.3
.7

-.2

.4

we can view weight matrix as image

… plus performance evaluation & logging

Building Blocks
So many choices!
network topology

• Network Topology
– Number of layers
– Nodes per layer

• Layer Type
– Feedforward
– Restricted Boltzmann
– Autoencoder
– Recurrent
– Convolutional

layer type

neuron type

• Neuron Type
– Rectified Linear Unit

• Regularization
– Dropout

• Magic Numbers

A Deep Learning Recipe, 1.0
• Lots of data, some+
labels
• Train each RBM layer
greedily, successively
• Add an output layer
and train with labels

labels

A Few Other Important Things
• Deep Learning Recipe 2.0
– Dropout / regularization
– Rectified Linear Units

•
•
•
•

Convolutional networks
Hyperparameters
Not just neural networks
Practical Issues (GPU)

Sample Classification Results

ImageNet
V
alidation classification

Krizhevsky et al., NIPS 2012.

[Krizhevsky et al. NI PS’12

Segmentation
neuronal membranes

Ciresan et al. “DNN segment neuronal membranes...” NIPS 2012

CalTech 256 2 5 6
Caltech
Z eiler & Fergus, Vis
ualizing and Unders
tanding Convolutional Ne
tworks arXiv 1311.2901, 2013
,
7
5
7
0
6
5

6 training examples

6
0
5
5
5
0
4
5
4
0
3
5
3
0
2
5
0

1
0

2
0

3
0

4
0

5
0

6
0

Zeiler & Fergus,”Visualizing and Understanding Convolutional Networks,” arXiv 1311.2901, 2013

Application: Speech
frequencies
in window

“He can for example present significant university wide
issues to the senate.”

small time window
slide 15ms

phoneme

Spectrogram: window in time -> vector of frequences; slide; repeat

Automatic Speech
CDBNs for speech
Unlabeled TIMIT data -> convolutional DBN

Trained on unlabeled TIMIT corpus

Experimental R

• Speaker identification
TIMIT Speaker identification

Accuracy

Prior art (Reynolds, 1995)

99.7%

Convolutional DBN

100.0%

• Phone classification
TIMIT Phone classification

Accuracy

Clarkson et al. (1999)

77.6%

Gunawardana et al. (2005)

78.3%

Sung et al. (2007)

78.5%

Petrov et al. (2007)

78.6%

Sha & Saul (2006)

78.9%

Yu et al. (2009)

79.2%

Convolutional DBN

80.3%

Learned first-layer bases

Lee et al., “Unsupervised feature learning for audio classiﬁcation using convolutional deep
68
belief networks”, NIPS 2009.

A Long List of Others
• Kaggle
– Merck Molecular Activation (‘12)
– Salary Prediction (‘13)

•
•
•
•

Learning to Play Atari Games (‘13)
NLP – chunking, NER, parsing, etc.
Activity recognition from video
Recommendations

Deep Learning In A Nutshell
•
•
•
•
•
•
•
•

Architectures vs. features
Deep vs. shallow
Automatic* features
Lots of data vs. best technique
Compute- vs. human intensive
State-of-the-art
Breaks expert, domain barrier
Details & tricks can be complex
http://www.deeplearning.net/

Interested in Deep Learning?
Connect for:
• Training Workshop (interest list)
• Projects / consulting

• Collaboration
• Questions

agardner@momentics.com
http://www.momentics.com/deep-learning/

Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 2014-01-08

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 2014-01-08

Similar to Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 2014-01-08 (20)

Recently uploaded

Recently uploaded (20)

Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 2014-01-08

Editor's Notes