SlideShare a Scribd company logo
1 of 75
Download to read offline
Deep Learning and its Applications to
Multimedia!
Kai Yu!
Multimedia Department, Baidu!
Background!
Shallow Models Since Late 80’s !
!
!  Neural Networks!
!  Boosting!
!  Support Vector Machines!
!  Maximum Entropy !
!  … !
13-1-12 3
Since 2000 – Learning with Structures!
!  Kernel Learning!
!  Transfer Learning!
!  Semi-supervised Learning!
!  Manifold Learning!
!  Sparse Learning!
!  Matrix Factorization!
!  Structured Input-Output Prediction!
!  … !
13-1-12 4
Mission Yet Accomplished !
13-1-12 5
Images$&$Video$
Rela:onal$Data/$$
Social$Network$
Massive$increase$in$both$computa:onal$power$and$the$amount$of$
data$available$from$web,$video$cameras,$laboratory$measurements.$
Mining$for$Structure$
Speech$&$Audio$
Gene$Expression$
Text$&$Language$$
Geological$Data$
Product$$
Recommenda:on$
Climate$Change$
Mostly$Unlabeled$
• $Develop$sta:s:cal$models$that$can$discover$underlying$structure,$cause,$or$
sta:s:cal$correla:on$from$data$in$unsupervised*or$semi,supervised*way.$$
• $Mul:ple$applica:on$domains.$
Slide Courtesy: Russ Salakhutdinov
The pipeline of machine visual perception
13-1-12 6
Low-level
sensing
Pre-
processing
Feature
extract.
Feature
selection
Inference:
prediction,
recognition
• Most critical for accuracy
• Account for most of the computation for testing
• Most time-consuming in development cycle
• Often hand-craft in practice
Most Efforts in
Machine Learning
Computer vision features
SIFT Spin image
HoG RIFT
Slide Courtesy: Andrew Ng
GLOH
Learning features from data
13-1-12 8
Low-level
sensing
Pre-
processing
Feature
extract.
Feature
selection
Inference:
prediction,
recognition
Feature Learning
Machine Learning
Convolution Neural Networks!
Coding Pooling Coding Pooling
13-1-12 9
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W.
Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip
code recognition. Neural Computation, 1989.
“Winter of Neural Networks” Since 90’s !
13-1-12 10
!
!  Non-convex !
!  Need a lot of tricks to play with!
!  Hard to do theoretical analysis!
The Paradigm of Deep Learning!
Deep Learning Since 2006!
13-1-12 12
Neural networks are coming back!
Race on ImageNet (Top 5 Hit Rate)!
13-1-12 13
72%, 2010
74%, 2011
Variants
of Sparse Coding:
LLC, Super-vector
Pooling Linear
Classifier
Local Gradients Pooling
e.g, SIFT, HOG
The Best system on ImageNet (by 2012.10)!
13-1-12 14
-  This is a moderate deep model!
-  The first two layers are hand-designed!
Challenge to Deep Learners!
13-1-12 15
!
Key questions:!
!  What if no hand-craft features at all? !
!  What if use much deeper neural networks? !
Experiment on ImageNet 1000 classes
(1.3 million high-resolution training images)
•  The 2010 competition winner got 47% error for
its first choice and 25% error for top 5 choices.
•  The current record is 45% error for first choice.
– This uses methods developed by the winners
of the 2011 competition.
•  Our chief critic, Jitendra Malik, has said that this
competition is a good test of whether deep
neural networks really do work well for object
recognition. -- By Geoff Hinton
Answer from Geoff Hinton 2012.10!
13-1-12 16
72%, 2010
74%, 2011
85%, 2012
The Architecture !
13-1-12 17
Our model
● Max-pooling layers follow first, second, and
fifth convolutional layers
● The number of neurons in each layer is given
by 253440, 186624, 64896, 64896, 43264,
4096, 4096, 1000
Our model
● Max-pooling layers follow first, second, and
fifth convolutional layers
● The number of neurons in each layer is given
by 253440, 186624, 64896, 64896, 43264,
4096, 4096, 1000
Slide Courtesy: Geoff Hinton
The Architecture !
13-1-12 18
Our model
● Max-pooling layers follow first, second, and
fifth convolutional layers
● The number of neurons in each layer is given
by 253440, 186624, 64896, 64896, 43264,
4096, 4096, 1000
– 7 hidden layers not counting max pooling.
– Early layers are conv., last two layers globally connected.
– Uses rectified linear units in every layer.
– Uses competitive normalization to suppress hidden activities.
Slide Courtesy: Geoff HInton
Revolution on Speech Recognition!
13-1-12 19
Word error rates from MSR, IBM, and
the Google speech group
Slide Courtesy: Geoff Hinton
Deep Learning for NLP!
13-1-12 20
NATURAL LANGUAGE PROCESSING (ALMOST) FROM SCRATCH
Input Window
Lookup Table
Linear
HardTanh
Linear
Text cat sat on the mat
Feature 1 w1
1 w1
2 . . . w1
N
...
Feature K wK
1 wK
2 . . . wK
N
LTW 1
...
LTW K
M1
× ·
M2
× ·
word of interest
d
concat
n1
hu
n2
hu = #tags
Figure 1: Window approach network.Natural Language Processing (Almost) from
Scratch, Collobert et al, JMLR 2011!
Deep Learning in Industry!
Microsoft!
!  First successful deep learning models for speech
recognition, by MSR in 2009!
!  Now deployed in MS products, e.g. Xbox!
13-1-12 22
“Google Brain” Project!
13-1-12 23
Ley by Google ellow Jeff Dean!
!
!  Published two papers,
ICML2012, NIPS2012!
!  Company-wise large-scale deep
learning infrastructure!
!  Big success on images, speech,
NLPs!
Deep Learning @ Baidu!
!  Starts working on deep learning in 2012 summer!
!  Achieved big success in speech recognition and
image recognition, both will be deployed into
products in late November. !
!  Meanwhile, efforts are being carried on in areas like
OCR, NLP, text retrieval, ranking, …!
13-1-12 24
Building Blocks of Deep Learning!
CVPR 2012 Tutorial Deep Learning for Vision!
09.00am: Introduction Rob Fergus (NYU)
10.00am: Coffee Break
10.30am: Sparse Coding Kai Yu (Baidu)
11.30am: Neural Networks Marc’Aurelio Ranzato (Google)
12.30pm: Lunch
01.30pm: Restricted Boltzmann Honglak Lee (Michigan)
Machines
02.30pm: Deep Boltzmann Ruslan Salakhutdinov (Toronto)
Machines
03.00pm: Coffee Break
03.30pm: Transfer Learning Ruslan Salakhutdinov (Toronto)
04.00pm: Motion & Video Graham Taylor (Guelph)
05.00pm: Summary / Q & A All
05.30pm: End
Building Block 1 - RBM!
Restricted Boltzmann Machine!
13-1-12 28
Restricted$Boltzmann$Machines$
The$energy$of$the$joint$configura:on:$$
model$parameters.$
Probability$of$the$joint$configura:on$is$given$by$the$Boltzmann$distribu:on:$
Markov$random$fields,$Boltzmann$machines,$logSlinear$models.$
are$connected$to$stochas:c$binary$hidden$
variables $ $$$$$$$$.$$
Stochas:c$binary$visible$variables$
Image$$$$$$visible$variables$
$$hidden$variables$
Bipar:te$$
Structure$
par::on$func:on$ poten:al$func:ons$
Slide Courtesy: Russ Salakhutdinov
Restricted Boltzmann Machine!
13-1-12 29
Restricted$Boltzmann$Machines$
Restricted:$$$No$interac:on$between$
$ $ $$hidden$variables$
Inferring$the$distribu:on$over$the$
hidden$variables$is$easy:$
Factorizes:$Easy$to$compute$
Image$$$$$$visible$variables$
$$hidden$variables$
Bipar:te$$
Structure$
Similarly:$
Markov$random$fields,$Boltzmann$machines,$logSlinear$models.$
Slide Courtesy: Russ Salakhutdinov
13-1-12 30
Model$Learning$
Contras:ve$Divergence$(Hinton$2000) $$$$$Pseudo$Likelihood$(Besag$1977) $$
MCMCSMLE$es:mator$(Geyer$1991)$$ $$$$$Composite$Likelihoods$(Lindsay,$1988;$Varin$2008)$$
Approximate*maximum*likelihood*learning:**
Tempered$MCMC$ $ $ $ $ $$$$$Adap:ve$MCMC$$
(Salakhutdinov,$NIPS$2009) $ $ $$$$$(Salakhutdinov,$ICML$2010)$
Image$$$$$$visible$variables$
$$hidden$variables$
Maximize$(penalized)$logSlikelihood$objec:ve:$
Deriva:ve$of$the$logSlikelihood:$
Given$a$set$of$i.i.d.$training$examples$$
$ $ $ $$$$$$$$$$$$$$,$we$want$to$learn$$
model$parameters $ $ $$$$$$.$$$$
Model Parameter Learning!
Slide Courtesy: Russ Salakhutdinov
Contrastive Divergence!
13-1-12 31
Contras:ve$Divergence$
Run$Markov$chain$for$a$few$steps$(e.g.$one$step):$
Update$model$parameters:$
…$
Data$ Reconstructed$
Data$
Slide Courtesy: Russ Salakhutdinov
RBMs for Images!
13-1-12 32
RBMs$for$Images$
GaussianSBernoulli$RBM:$$
Interpreta:on:$Mixture$of$exponen:al$
number$of$Gaussians$
Gaussian$
Image$$$$$$visible$variables$
$where$
$is$an$implicit$prior,$and$$
Slide Courtesy: Russ Salakhutdinov
Layerwise Pre-training!
13-1-12 33
Layerwise$Pretraining$
Efficient$layerSwise$pretraining$
algorithm.$
Deep$Belief$Network$
Varia:onal$Lower$Bound$
Likelihood$term$ Entropy$func:onal$
Replace$with$a$$
second$layer$RBM$
h3
h2
h1
v
W3
W2
W1
Similar$arguments$for$pretraining$a$
Deep$Boltzmann$machine$
Slide Courtesy: Russ Salakhutdinov
DBNs for MNIST Classification!
13-1-12 34
DBNs$for$Classifica:on$
• $Aier$layerSbySlayer$unsupervised*pretraining,$discrimina:ve$fineStuning$$
by$backpropaga:on$achieves$an$error$rate$of$1.2%$on$MNIST.$SVM’s$get$
1.4%$and$randomly$ini:alized$backprop$gets$1.6%.$$
• $Clearly$unsupervised$learning$helps$generaliza:on.$It$ensures$that$most$of$
the$informa:on$in$the$weights$comes$from$modeling$the$input$data.$
W +W
W
W
W +
W +
W +
W
W
W
W
1 11
500 500
500
2000
500
500
2000
500
2
500
RBM
500
2000
3
Pretraining Unrolling Fine tuning
4 4
2 2
3 3
1
2
3
4
RBM
10
Softmax Output
10
RBM
T
T
T
T
T
T
T
T
Slide Courtesy: Russ Salakhutdinov
Deep Autoencoders for Unsupervised
Feature Learning!
13-1-12 35
Deep$Autoencoders$
!
!
! !
!
!
!
!
! !
! !
! !
!
! !
! !
! !
!
!
!
!
!
!
!
"
#$$$
%&'
#
#$$$
"$$$
($$
($$
"$$$
"$$$
($$
" "
#$$$
#$$$
($$($$
"$$$
"$$$
#$$$
($$
#$$$
)
*
)
%&'
+,-.,/01012 31,455012
"$$$
%&'
6
*
6$
6$
701- .81012
* *
# #
6 6
*
)
(
6
)
9
#
)
:
"
)
;
<1=4>-,
"
#
6
6$
*
6
#
)
"
)
?4>-@5/A-,
B-=4>-,
%&'
)4C
Slide Courtesy: Russ Salakhutdinov
Image Retrieval using Binary Codes!
13-1-12 36
Searching$Large$Image$Database$
using$Binary$Codes$
• $Map$images$into$binary$codes$for$fast$retrieval.*
• $Small$Codes,$Torralba,$Fergus,$Weiss,$CVPR$2008$
• *Spectral$Hashing,$Y.$Weiss,$A.$Torralba,$R.$Fergus,$NIPS$2008*
• *Kulis$and$Darrell,$NIPS$2009,$Gong$and$Lazebnik,$CVPR$20111$
• *Norouzi$and$Fleet,$ICML$2011,$*
Slide Courtesy: Russ Salakhutdinov
Building Block 2 – Autoencoder Neural Net!
Autoencoder Neural Network!
13-1-12 38
116
AUTO-ENCODERS NEURAL NETS
Ranzato
encoder
Error
input prediction
decoder
code
– input higher dimensional than code
- error: ||prediction - input||
- training: back-propagation
2
Slide Courtesy: Marc'Aurelio Ranzato
Sparse Autoencoder!
13-1-12 39
117
SPARSE AUTO-ENCODERS
Ranzato
encoder
Error
input prediction
decoder
code
– sparsity penalty: ||code||
- error: ||prediction - input||
- training: back-propagation
Sparsity
Penalty
1
2
- loss: sum of square reconstruction error and sparsity
Ran
encoder
Err
input prediction
decoder
code
– sparsity penalty: ||code||
- error: ||prediction - input||
- training: back-propagation
Sparsity
Penalty
1
2
- loss: sum of square reconstruction error and sparsity
Slide Courtesy: Marc'Aurelio Ranzato
Sparse Autoencoder!
13-1-12 40
117
SPARSE AUTO-ENCODERS
Ranzato
encoder
Error
input prediction
decoder
code
– sparsity penalty: ||code||
- error: ||prediction - input||
- training: back-propagation
Sparsity
Penalty
1
2
- loss: sum of square reconstruction error and sparsity
SPARSE AUTO-ENCODERS
Ran
encoder
Err
input prediction
decoder
code
– input: code:
- loss:
Sparsity
Penalty
Le et al. “ICA with reconstruction cost..” NIPS 2011
h=W T
XX
L X ;W =∥W h−X∥2
∑j
∣hj∣
Slide Courtesy: Marc'Aurelio Ranzato
Building Block 3 – Sparse Coding!
Sparse coding
13-1-12 42
min
a,
mX
i=1
xi
kX
j=1
ai,j⇥j
2
+
mX
i=1
kX
j=1
|ai,j|
Sparse coding (Olshausen & Field,1996). Originally
developed to explain early visual processing in the brain
(edge detection).
Training: given a set of random patches x, learning a
dictionary of bases [Φ1, Φ2, …]
Coding: for data vector x, solve LASSO to find the
sparse coefficient vector a
Sparse coding: training time
Input: Images x1, x2, …, xm (each in Rd)"
Learn: Dictionary of bases φ1, φ2, …, φk (also Rd)."
min
a,
mX
i=1
xi
kX
j=1
ai,j⇥j
2
+
mX
i=1
kX
j=1
|ai,j|
Alternating optimization:
1.  Fix dictionary φ1, φ2, …, φk , optimize a (a standard
LASSO problem "
2.  Fix activations , optimize dictionary φ1, φ2, …, φk (a
convex QP problem)
Sparse coding: testing time
Input: Novel image patch x (in Rd) and previously learned φi’s"
Output: Representation [ai,1, ai,2, …, ai,κ] of image patch xi. "
≈ 0.8 * + 0.3 * + 0.5 *
Represent)xi as:)ai =)[0,)0,)…,)0,)0.8,)0,)…,)0,)0.3,)0,)…,)0,)0.5,)…]))
min
a,
mX
i=1
xi
kX
j=1
ai,j⇥j
2
+
mX
i=1
kX
j=1
|ai,j|
Sparse coding illustration
))))Natural)Images) Learned)bases)(φ1","…,"φ64):))“Edges”)
50 100 150 200 250 300 350 400 450 500
50
100
150
200
250
300
350
400
450
500
50 100 150 200 250 300 350 400 450 500
50
100
150
200
250
300
350
400
450
500
50 100 150 200 250 300 350 400 450 500
50
100
150
200
250
300
350
400
450
500
≈ 0.8 * + 0.3 * + 0.5 *
x ≈ 0.8 * φ
36
+ 0.3 * φ42 + 0.5 * φ63"[a1,)…,)a64])=)[0,)0,)…,)0,)0.8,)0,)…,)0,)0.3,)0,)…,)0,)0.5,)0]))
(feature)representaDon)))
Test)example
Compact)&)easily)interpretable)Slide credit: Andrew Ng
RBM & autoencoders
-  also involve activation and reconstruction
-  but have explicit f(x)
-  not necessarily enforce sparsity on a
-  but if put sparsity on a, often get improved results
[e.g. sparse RBM, Lee et al. NIPS08]
13-1-12 46
x
a
(x)
’
a
(a)encoding decoding
Sparse coding: A broader view
Any feature mapping from x to a, .e. a = f(x), where !
-  a is sparse (and often higher dim. han x)!
-  f(x) is nonlinear !
-  reconstruction x’= (a) , such that x’ x!
13-1-12 47
x
a
(x)
’
a
(a)
Therefore, sparse RBMs, sparse auto-encoder, even VQ
can be viewed as a form of sparse coding.
Example of sparse activations (sparse coding)
13-1-12 48
•  different x has different dimensions activated
•  locally-shared sparse representation: similar x’s tend to have
similar non-zero dimensions
a1 [ 0 | | 0 0 … 0
a2 [ | | 0 0 0 … 0
am [ 0 0 0 | | … 0
…
a3 [ | 0 | 0 0 … 0
x1
x2
x3
xm
Example of sparse activations (sparse coding)
13-1-12 49
•  another example: preserving manifold structure
•  more informative in highlighting richer data structures,
i.e. clusters, manifolds,
a1 [ | | 0 0 0 … 0
a2 [ 0 | | 0 0 … 0
am [ 0 0 0 | | … 0
…
a3 [ 0 0 | | 0 … 0
x1
x2
x3
xm
Sparsity vs. Locality
13-1-12 50
sparse coding
local sparse
coding
•  Intuition: similar data should
get similar activated features
•  Local sparse coding:
•  data in the same
neighborhood tend to
have shared activated
features;
•  data in different
neighborhoods tend to
have different features
activated.
Sparse coding is not always local: example
13-1-12
51
Case 2
data manifold (or clusters)
• Each basis an “anchor point”
• Sparsity: each datum is a
linear combination of neighbor
anchors.
• Sparsity is caused by locality.
Case 1
independent subspaces
• Each basis is a “direction”
• Sparsity: each datum is a
linear combination of only
several bases.
Two approaches to local sparse coding
13-1-12
52
Approach 2
Coding via local subspaces
Approach 1
Coding via local anchor points
Local coordinate coding Super-vector coding
Image Classification using Super-Vector Coding of
Local Image Descriptors, Xi Zhou, Kai Yu, Tong Zhang,
and Thomas Huang. In ECCV 2010.
Large-scale Image Classification: Fast Feature
Extraction and SVM Training, Yuanqing Lin, Fengjun
Lv, Shenghuo Zhu, Ming Yang, Timothee Cour, Kai Yu,
LiangLiang Cao, Thomas Huang. In CVPR 2011
Learning locality-constrained linear coding for image
classification, Jingjun Wang, Jianchao Yang, Kai Yu,
Fengjun Lv, Thomas Huang. In CVPR 2010.
Nonlinear learning using local coordinate coding,
Kai Yu, Tong Zhang, and Yihong Gong. In NIPS 2009.
Two approaches to local sparse coding
13-1-12
53
Approach 2
Coding via local subspaces
Approach 1
Coding via local anchor points
-  Sparsity achieved by explicitly ensuring locality
-  Sound theoretical justifications
-  Much simpler to implement and compute
-  Strong empirical success
Local coordinate coding Super-vector coding
Hierarchical Sparse Coding!
13-1-12 54
alpha
(⌦W, ) =
W,
L(W, ) +
⇤1
n
⌃W⌃1 + ⇥⌃ ⌃1
⌅ 0,
L(W, )
1
n
n
i=1
⌅
1
2
xi Bwi
2
+ ⇤2w⇤
i ⇤( )wi
⇧
.
W = (w1 w2 · · · wn) ⇧ Rp⇥n
⇧ Rq
⌃ q
⌥ 1
(⌦W, ) =
W,
L(W, ) +
⇤1
n
⌃W⌃1 + ⇥⌃
⌅ 0,
L(W, )
1
n
n
i=1
⌅
1
2
xi Bwi
2
+ ⇤2w⇤
i ⇤( )wi
⇧
W = (w1 w2 · · · wn) ⇧ Rp⇥n
⇧ Rq
⇤( ) ⇤
⌃ q
k (⌅k)
⌥ 1
.
n
· xn] ⇧ Rd⇥n
B ⇧ Rd⇥p
p⇥q
+
⇥
S(W) ⇤
1
n
n
i=1
wiwT
i ⇧ Rp
L(W, )
L(W, ) =
1
2n
⌃X BW⌃2
F +
⇤2
n
⇥
S
wi
( )
W
⇥
S(W
W
(⌦W, ) =
W,
L(W, ) +
⇤1
n
⌃W⌃1 + ⇥⌃ ⌃1
⌅ 0,
L(W, )
1
n
n
i=1
⌅
1
2
xi Bwi
2
+ ⇤2w⇤
i ⇤( )wi
⇧
.
W = (w1 w2 · · · wn) ⇧ Rp⇥n
⇧ Rq
⇤( ) ⇤
⌃ q
k=1
k (⌅k)
⌥ 1
.
1 wi
Yu, Lin, & Lafferty, CVPR 11!
Hierarchical Sparse Coding on MNIST!
HSC vs. CNN: HSC provide even better performance than CNN
""" more amazingly, HSC learns features in unsupervised manner!
55
Yu, Lin, & Lafferty, CVPR 11!
Second-layer dictionary
56
A hidden unit in the second layer is connected to a unit group in the
1st layer: invariance to translation, rotation, and deformation !
Yu, Lin, & Lafferty, CVPR 11!
Adaptive Deconvolutional Networks for Mid
and High Level Feature Learning!
!  Hierarchical)
ConvoluDonal)Sparse)
Coding.)
!  Trained)with)respect)
to)image)from)all)
layers)(L1RL4).)
!  Pooling)both)spaDally)
and)amongst)features.)
!  Learns)invariant)midR
level)features.)
MaUhew)D.)Zeiler,)Graham)W.)Taylor,)and)Rob)Fergus,)ICCV)2011)
b)Layer2
a) Layer 1
c) Layer 3
e) Re
Field
L1
L2
f) Sample Input Images & Reconstructions:
Input InputLayer 1 Layer 1Layer 2 Layer 3 Layer 4
Select L2 Feature Groups
Select L3 Feature Groups
Select L4 Features
L1 Feature
Maps
Image
L2 Feature
Maps
L4 Feature
Maps
L1 Features
L3 Feature
Maps
er2
c) Layer 3
b)Layer2
a) Layer 1
c) Layer 3
f) Sample Input Images & Reconstructions:
Input Layer 1 Layer 2 Layer 3
c) Layer 3
er2
c) Layer 3
e) Receptive
Fields to Scale
L4
Recap of Deep Learning Tutorial!
!  Building blocks!
–  RBMs, Autoencoder Neural Net, Sparse Coding!
!  Go deeper: Layerwise feature learning!
–  Layer-by-layer unsupervised training!
–  Layer-by-layer supervised training!
!
!  Fine tuning via Backpropogation!
–  If data are ig enough, direct fine tuning is enough!
!  Sparsity n hidden layers are often useful. !
!
13-1-12 58
Layer-by-layer unsupervised training +
fine tuning !
13-1-12 59
Ran
input
prediction
Weston et al. “Deep learning via semi-supervised embedding” ICML 2008
label
Loss = supervised_error + unsupervised_error
Slide Courtesy: Marc'Aurelio Ranzato
Layer-by-Layer Supervised Training!
13-1-12 60
138
- Easy to add many error terms to loss function.
- Joint learning of related tasks yields better representations.
Example of architecture:
Collobert et al. “NLP (almost) from scratch” JMLR 2011 Ranzato
Slide Courtesy: Marc'Aurelio Ranzato
Biological & Theoretical Justification!
Why Hierarchy?
Theoretical:
“…well-known depth-breadth tradeoff in circuits design [Hastad
1987]. This suggests many functions can be much more
efficiently represented with deeper architectures…” [Bengio &
LeCun 2007]
Biological: Visual cortex is hierarchical (Hubel-Wiesel
Model)
[Thorpe]
Sparse DBN: Training on face images!
pixels
edges
object parts
(combination
of edges)
object models
[Lee, Grosse, Ranganath & Ng, 2009]
Deep Architecture in the Brain
Retina
Area V1
Area V2
Area V4
pixels
Edge detectors
Primitive shape
detectors
Higher level visual
abstractions
Sensor representation in the brain!
[Roe et al., 1992; BrainPort; Welsh & Blasch, 1997]
Seeing with your tongue
Human echolocation (sonar
Auditory cortex learns to
see.
(Same rewiring process
also works for touch/
somatosensory cortex.)
Auditory
Cortex
Large-scale training!
The challenge!
13-1-12 66150
The Challenge
– best optimizer in practice is on-line SGD which is naturally
sequential, hard to parallelize.
– layers cannot be trained independently and in parallel, hard to
distribute
– model can have lots of parameters that may clog the network,
hard to distribute across machines
A Large Scale problem has:
– lots of training samples (>10M)
– lots of classes (>10K) and
– lots of input dimensions (>10K).
Slide Courtesy: Marc'Aurelio Ranzato
A solution by model parallelism!
13-1-12 67
153
MODEL
PARALLELISM
Ranzato
Our Solution
1st
machine
2nd
machine
3rd
machine
MODEL
PARALLELISM
+
DATA
PARALLELISM
RaLe et al. “Building high-level features using large-scale unsupervised learning” ICML 2012
input #1
input #2
input #3
Slide Courtesy: Marc'Aurelio Ranzato
13-1-12 68
MODEL
PARALLELISM
+
DATA
PARALLELISM
RanLe et al. “Building high-level features using large-scale unsupervised learning” ICML 2012
input #1
input #2
input #3
MODEL
PARALLELISM
+
DATA
PARALLELISM
RaLe et al. “Building high-level features using large-scale unsupervised learning” ICML 2012
input #1
input #2
input #3
Slide Courtesy: Marc'Aurelio Ranzato
Asynchronous SGD!
13-1-12 69
157
Asynchronous SGD
PARAMETER SERVER
1st
replica 2nd
replica 3rd
replica
Ranzato
∂ L
∂1
Slide Courtesy: Marc'Aurelio Ranzato
13-1-12 70
158
Asynchronous SGD
PARAMETER SERVER
1st
replica 2nd
replica 3rd
replica
Ranzato
1
Asynchronous SGD!
Slide Courtesy: Marc'Aurelio Ranzato
13-1-12 71
160
PARAMETER SERVER
1st
replica 2nd
replica 3rd
replica
Ranzato
∂ L
∂2
Slide Courtesy: Marc'Aurelio Ranzato
13-1-12 72
161
PARAMETER SERVER
1st
replica 2nd
replica 3rd
replica
Ranzato
2
Slide Courtesy: Marc'Aurelio Ranzato
Training A Model with 1 B Parameters!
13-1-12 73
Deep Net:
– 3 stages
– each stage consists of local filtering, L2 pooling, LCN
- 18x18 filters
- 8 filters at each location
- L2 pooling and LCN over 5x5 neighborhoods
– training jointly the three layers by:
- reconstructing the input of each layer
- sparsity on the code
Unsupervised Learning With 1B Paramete
Slide Courtesy: Marc'Aurelio Ranzato
Thank you!
关注我们:t.baidu-tech.com
资料下载和详细介绍:infoq.com/cn/zones/baidu-salon
InfoQ 策划·组织·实施
关注我们:weibo.com/infoqchina
“畅想•交流•争鸣•聚会”是百度技术沙龙的宗旨。 百度技术沙龙是由百度与InfoQ中文站定期组织的线下技术交流活动。目
的是让中高端技术人员有一个相对自由的思想交流和交友沟通的的平台。主要分讲师分享和OpenSpace两个关键环节,每期
只关注一个焦点话题。
讲师分享和现场Q&A让大家了解百度和其他知名网站技术支持的先进实践经验,OpenSpace环节是百度技术沙龙主题的升华
和展开,提供一个自由交流的平台。针对当期主题,参与者人人都可以发起话题,展开讨论。

More Related Content

What's hot

What's hot (7)

Promise notes
Promise notesPromise notes
Promise notes
 
TFI2014 Conference Program
TFI2014 Conference ProgramTFI2014 Conference Program
TFI2014 Conference Program
 
Vision and Language: Past, Present and Future
Vision and Language: Past, Present and FutureVision and Language: Past, Present and Future
Vision and Language: Past, Present and Future
 
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and ExperienceIaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
 
Learning with Unpaired Data
Learning with Unpaired DataLearning with Unpaired Data
Learning with Unpaired Data
 
Women's Engineering Society, UK; 11 September 2009
Women's Engineering Society, UK; 11 September 2009Women's Engineering Society, UK; 11 September 2009
Women's Engineering Society, UK; 11 September 2009
 
ScienceSoft: Open Software for Open Science
ScienceSoft: Open Software for Open ScienceScienceSoft: Open Software for Open Science
ScienceSoft: Open Software for Open Science
 

Viewers also liked

Emotional design presentation
Emotional design presentationEmotional design presentation
Emotional design presentationVinny Wu
 
Computer vision lab seminar(deep learning) yong hoon
Computer vision lab seminar(deep learning) yong hoonComputer vision lab seminar(deep learning) yong hoon
Computer vision lab seminar(deep learning) yong hoonYonghoon Kwon
 
“From Eliza to Siri and beyond: Promise and challenges of intelligent, langua...
“From Eliza to Siri and beyond: Promise and challenges of intelligent, langua...“From Eliza to Siri and beyond: Promise and challenges of intelligent, langua...
“From Eliza to Siri and beyond: Promise and challenges of intelligent, langua...diannepatricia
 
Siri vs Alexa, Which is Better?
Siri vs Alexa, Which is Better?Siri vs Alexa, Which is Better?
Siri vs Alexa, Which is Better?Debbie Peterson
 
2015 Sport Analysis for March Madness
2015 Sport Analysis for March Madness2015 Sport Analysis for March Madness
2015 Sport Analysis for March MadnessYi Chun (Nancy) Chien
 
机器学习概述
机器学习概述机器学习概述
机器学习概述Dong Guo
 
CNN(Convolutional Neural Nets) backpropagation
CNN(Convolutional Neural Nets) backpropagationCNN(Convolutional Neural Nets) backpropagation
CNN(Convolutional Neural Nets) backpropagationJooyoul Lee
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsBhaskar Mitra
 
AWS re:Invent 2016: Deep Learning in Alexa (MAC202)
AWS re:Invent 2016: Deep Learning in Alexa (MAC202)AWS re:Invent 2016: Deep Learning in Alexa (MAC202)
AWS re:Invent 2016: Deep Learning in Alexa (MAC202)Amazon Web Services
 
[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程台灣資料科學年會
 
Deep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsDeep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsBuhwan Jeong
 
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)台灣資料科學年會
 
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural NetsPython for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural NetsRoelof Pieters
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
好东西传送门-深度学习资源卡片汇总
好东西传送门-深度学习资源卡片汇总好东西传送门-深度学习资源卡片汇总
好东西传送门-深度学习资源卡片汇总好东西传送门
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Jen Aman
 
[系列活動] 手把手的深度學習實務
[系列活動] 手把手的深度學習實務[系列活動] 手把手的深度學習實務
[系列活動] 手把手的深度學習實務台灣資料科學年會
 

Viewers also liked (20)

Emotional design presentation
Emotional design presentationEmotional design presentation
Emotional design presentation
 
Computer vision lab seminar(deep learning) yong hoon
Computer vision lab seminar(deep learning) yong hoonComputer vision lab seminar(deep learning) yong hoon
Computer vision lab seminar(deep learning) yong hoon
 
“From Eliza to Siri and beyond: Promise and challenges of intelligent, langua...
“From Eliza to Siri and beyond: Promise and challenges of intelligent, langua...“From Eliza to Siri and beyond: Promise and challenges of intelligent, langua...
“From Eliza to Siri and beyond: Promise and challenges of intelligent, langua...
 
Car accident repairshops
Car accident repairshopsCar accident repairshops
Car accident repairshops
 
Siri vs Alexa, Which is Better?
Siri vs Alexa, Which is Better?Siri vs Alexa, Which is Better?
Siri vs Alexa, Which is Better?
 
2015 Sport Analysis for March Madness
2015 Sport Analysis for March Madness2015 Sport Analysis for March Madness
2015 Sport Analysis for March Madness
 
机器学习概述
机器学习概述机器学习概述
机器学习概述
 
CNN(Convolutional Neural Nets) backpropagation
CNN(Convolutional Neural Nets) backpropagationCNN(Convolutional Neural Nets) backpropagation
CNN(Convolutional Neural Nets) backpropagation
 
Machine learning with Google machine learning APIs - Puppy or Muffin?
Machine learning with Google machine learning APIs - Puppy or Muffin?Machine learning with Google machine learning APIs - Puppy or Muffin?
Machine learning with Google machine learning APIs - Puppy or Muffin?
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
AWS re:Invent 2016: Deep Learning in Alexa (MAC202)
AWS re:Invent 2016: Deep Learning in Alexa (MAC202)AWS re:Invent 2016: Deep Learning in Alexa (MAC202)
AWS re:Invent 2016: Deep Learning in Alexa (MAC202)
 
[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程
 
Deep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsDeep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applications
 
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
 
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural NetsPython for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
好东西传送门-深度学习资源卡片汇总
好东西传送门-深度学习资源卡片汇总好东西传送门-深度学习资源卡片汇总
好东西传送门-深度学习资源卡片汇总
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow
 
[系列活動] 機器學習速遊
[系列活動] 機器學習速遊[系列活動] 機器學習速遊
[系列活動] 機器學習速遊
 
[系列活動] 手把手的深度學習實務
[系列活動] 手把手的深度學習實務[系列活動] 手把手的深度學習實務
[系列活動] 手把手的深度學習實務
 

Similar to 34th.余凯.机器学习进展及语音图像中的应用

Forrest Iandola: My Adventures in Artificial Intelligence and Entrepreneurship
Forrest Iandola: My Adventures in Artificial Intelligence and EntrepreneurshipForrest Iandola: My Adventures in Artificial Intelligence and Entrepreneurship
Forrest Iandola: My Adventures in Artificial Intelligence and EntrepreneurshipForrest Iandola
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2Karthik Murugesan
 
MLconf - Distributed Deep Learning for Classification and Regression Problems...
MLconf - Distributed Deep Learning for Classification and Regression Problems...MLconf - Distributed Deep Learning for Classification and Regression Problems...
MLconf - Distributed Deep Learning for Classification and Regression Problems...Sri Ambati
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseJosh Patterson
 
The Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and BeyondThe Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and BeyondNUS-ISS
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Alexandru Iosup
 
Hpai class 16 - learning - 041320
Hpai   class 16 - learning - 041320Hpai   class 16 - learning - 041320
Hpai class 16 - learning - 041320melendez321
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Sudeep Das, Ph.D.
 
Deep Learning: Changing the Playing Field of Artificial Intelligence - MaRS G...
Deep Learning: Changing the Playing Field of Artificial Intelligence - MaRS G...Deep Learning: Changing the Playing Field of Artificial Intelligence - MaRS G...
Deep Learning: Changing the Playing Field of Artificial Intelligence - MaRS G...MaRS Discovery District
 
Building the iRODS Consortium
Building the iRODS ConsortiumBuilding the iRODS Consortium
Building the iRODS ConsortiumAll Things Open
 
Promises of Deep Learning
Promises of Deep LearningPromises of Deep Learning
Promises of Deep LearningDavid Khosid
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014Paris Open Source Summit
 
A Lightning Introduction To Clouds & HLT - Human Language Technology Conference
A Lightning Introduction To Clouds & HLT - Human Language Technology ConferenceA Lightning Introduction To Clouds & HLT - Human Language Technology Conference
A Lightning Introduction To Clouds & HLT - Human Language Technology ConferenceBasis Technology
 
Msm2013challenge
Msm2013challengeMsm2013challenge
Msm2013challengefgodin
 
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found..."Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...Dataconomy Media
 
Week1- Introduction.pptx
Week1- Introduction.pptxWeek1- Introduction.pptx
Week1- Introduction.pptxfahmi324663
 
Multi modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed modelsMulti modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed modelsRoelof Pieters
 
building intelligent systems with large scale deep learning
building intelligent systems with large scale deep learningbuilding intelligent systems with large scale deep learning
building intelligent systems with large scale deep learningmustafa sarac
 

Similar to 34th.余凯.机器学习进展及语音图像中的应用 (20)

Forrest Iandola: My Adventures in Artificial Intelligence and Entrepreneurship
Forrest Iandola: My Adventures in Artificial Intelligence and EntrepreneurshipForrest Iandola: My Adventures in Artificial Intelligence and Entrepreneurship
Forrest Iandola: My Adventures in Artificial Intelligence and Entrepreneurship
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
 
MLconf - Distributed Deep Learning for Classification and Regression Problems...
MLconf - Distributed Deep Learning for Classification and Regression Problems...MLconf - Distributed Deep Learning for Classification and Regression Problems...
MLconf - Distributed Deep Learning for Classification and Regression Problems...
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the Enterprise
 
The Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and BeyondThe Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and Beyond
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
 
Deeplearning in finance
Deeplearning in financeDeeplearning in finance
Deeplearning in finance
 
Hpai class 16 - learning - 041320
Hpai   class 16 - learning - 041320Hpai   class 16 - learning - 041320
Hpai class 16 - learning - 041320
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
 
Deep Learning: Changing the Playing Field of Artificial Intelligence - MaRS G...
Deep Learning: Changing the Playing Field of Artificial Intelligence - MaRS G...Deep Learning: Changing the Playing Field of Artificial Intelligence - MaRS G...
Deep Learning: Changing the Playing Field of Artificial Intelligence - MaRS G...
 
Building the iRODS Consortium
Building the iRODS ConsortiumBuilding the iRODS Consortium
Building the iRODS Consortium
 
Promises of Deep Learning
Promises of Deep LearningPromises of Deep Learning
Promises of Deep Learning
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014
 
Gutenberg H4D Stanford 2019
Gutenberg H4D Stanford 2019Gutenberg H4D Stanford 2019
Gutenberg H4D Stanford 2019
 
A Lightning Introduction To Clouds & HLT - Human Language Technology Conference
A Lightning Introduction To Clouds & HLT - Human Language Technology ConferenceA Lightning Introduction To Clouds & HLT - Human Language Technology Conference
A Lightning Introduction To Clouds & HLT - Human Language Technology Conference
 
Msm2013challenge
Msm2013challengeMsm2013challenge
Msm2013challenge
 
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found..."Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
 
Week1- Introduction.pptx
Week1- Introduction.pptxWeek1- Introduction.pptx
Week1- Introduction.pptx
 
Multi modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed modelsMulti modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed models
 
building intelligent systems with large scale deep learning
building intelligent systems with large scale deep learningbuilding intelligent systems with large scale deep learning
building intelligent systems with large scale deep learning
 

Recently uploaded

IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119APNIC
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxAndrieCagasanAkio
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxMario
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxmibuzondetrabajo
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 
ETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxNIMMANAGANTI RAMAKRISHNA
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 

Recently uploaded (11)

IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptx
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptx
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptx
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 
ETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptx
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 

34th.余凯.机器学习进展及语音图像中的应用

  • 1. Deep Learning and its Applications to Multimedia! Kai Yu! Multimedia Department, Baidu!
  • 3. Shallow Models Since Late 80’s ! ! !  Neural Networks! !  Boosting! !  Support Vector Machines! !  Maximum Entropy ! !  … ! 13-1-12 3
  • 4. Since 2000 – Learning with Structures! !  Kernel Learning! !  Transfer Learning! !  Semi-supervised Learning! !  Manifold Learning! !  Sparse Learning! !  Matrix Factorization! !  Structured Input-Output Prediction! !  … ! 13-1-12 4
  • 5. Mission Yet Accomplished ! 13-1-12 5 Images$&$Video$ Rela:onal$Data/$$ Social$Network$ Massive$increase$in$both$computa:onal$power$and$the$amount$of$ data$available$from$web,$video$cameras,$laboratory$measurements.$ Mining$for$Structure$ Speech$&$Audio$ Gene$Expression$ Text$&$Language$$ Geological$Data$ Product$$ Recommenda:on$ Climate$Change$ Mostly$Unlabeled$ • $Develop$sta:s:cal$models$that$can$discover$underlying$structure,$cause,$or$ sta:s:cal$correla:on$from$data$in$unsupervised*or$semi,supervised*way.$$ • $Mul:ple$applica:on$domains.$ Slide Courtesy: Russ Salakhutdinov
  • 6. The pipeline of machine visual perception 13-1-12 6 Low-level sensing Pre- processing Feature extract. Feature selection Inference: prediction, recognition • Most critical for accuracy • Account for most of the computation for testing • Most time-consuming in development cycle • Often hand-craft in practice Most Efforts in Machine Learning
  • 7. Computer vision features SIFT Spin image HoG RIFT Slide Courtesy: Andrew Ng GLOH
  • 8. Learning features from data 13-1-12 8 Low-level sensing Pre- processing Feature extract. Feature selection Inference: prediction, recognition Feature Learning Machine Learning
  • 9. Convolution Neural Networks! Coding Pooling Coding Pooling 13-1-12 9 Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1989.
  • 10. “Winter of Neural Networks” Since 90’s ! 13-1-12 10 ! !  Non-convex ! !  Need a lot of tricks to play with! !  Hard to do theoretical analysis!
  • 11. The Paradigm of Deep Learning!
  • 12. Deep Learning Since 2006! 13-1-12 12 Neural networks are coming back!
  • 13. Race on ImageNet (Top 5 Hit Rate)! 13-1-12 13 72%, 2010 74%, 2011
  • 14. Variants of Sparse Coding: LLC, Super-vector Pooling Linear Classifier Local Gradients Pooling e.g, SIFT, HOG The Best system on ImageNet (by 2012.10)! 13-1-12 14 -  This is a moderate deep model! -  The first two layers are hand-designed!
  • 15. Challenge to Deep Learners! 13-1-12 15 ! Key questions:! !  What if no hand-craft features at all? ! !  What if use much deeper neural networks? ! Experiment on ImageNet 1000 classes (1.3 million high-resolution training images) •  The 2010 competition winner got 47% error for its first choice and 25% error for top 5 choices. •  The current record is 45% error for first choice. – This uses methods developed by the winners of the 2011 competition. •  Our chief critic, Jitendra Malik, has said that this competition is a good test of whether deep neural networks really do work well for object recognition. -- By Geoff Hinton
  • 16. Answer from Geoff Hinton 2012.10! 13-1-12 16 72%, 2010 74%, 2011 85%, 2012
  • 17. The Architecture ! 13-1-12 17 Our model ● Max-pooling layers follow first, second, and fifth convolutional layers ● The number of neurons in each layer is given by 253440, 186624, 64896, 64896, 43264, 4096, 4096, 1000 Our model ● Max-pooling layers follow first, second, and fifth convolutional layers ● The number of neurons in each layer is given by 253440, 186624, 64896, 64896, 43264, 4096, 4096, 1000 Slide Courtesy: Geoff Hinton
  • 18. The Architecture ! 13-1-12 18 Our model ● Max-pooling layers follow first, second, and fifth convolutional layers ● The number of neurons in each layer is given by 253440, 186624, 64896, 64896, 43264, 4096, 4096, 1000 – 7 hidden layers not counting max pooling. – Early layers are conv., last two layers globally connected. – Uses rectified linear units in every layer. – Uses competitive normalization to suppress hidden activities. Slide Courtesy: Geoff HInton
  • 19. Revolution on Speech Recognition! 13-1-12 19 Word error rates from MSR, IBM, and the Google speech group Slide Courtesy: Geoff Hinton
  • 20. Deep Learning for NLP! 13-1-12 20 NATURAL LANGUAGE PROCESSING (ALMOST) FROM SCRATCH Input Window Lookup Table Linear HardTanh Linear Text cat sat on the mat Feature 1 w1 1 w1 2 . . . w1 N ... Feature K wK 1 wK 2 . . . wK N LTW 1 ... LTW K M1 × · M2 × · word of interest d concat n1 hu n2 hu = #tags Figure 1: Window approach network.Natural Language Processing (Almost) from Scratch, Collobert et al, JMLR 2011!
  • 21. Deep Learning in Industry!
  • 22. Microsoft! !  First successful deep learning models for speech recognition, by MSR in 2009! !  Now deployed in MS products, e.g. Xbox! 13-1-12 22
  • 23. “Google Brain” Project! 13-1-12 23 Ley by Google ellow Jeff Dean! ! !  Published two papers, ICML2012, NIPS2012! !  Company-wise large-scale deep learning infrastructure! !  Big success on images, speech, NLPs!
  • 24. Deep Learning @ Baidu! !  Starts working on deep learning in 2012 summer! !  Achieved big success in speech recognition and image recognition, both will be deployed into products in late November. ! !  Meanwhile, efforts are being carried on in areas like OCR, NLP, text retrieval, ranking, …! 13-1-12 24
  • 25. Building Blocks of Deep Learning!
  • 26. CVPR 2012 Tutorial Deep Learning for Vision! 09.00am: Introduction Rob Fergus (NYU) 10.00am: Coffee Break 10.30am: Sparse Coding Kai Yu (Baidu) 11.30am: Neural Networks Marc’Aurelio Ranzato (Google) 12.30pm: Lunch 01.30pm: Restricted Boltzmann Honglak Lee (Michigan) Machines 02.30pm: Deep Boltzmann Ruslan Salakhutdinov (Toronto) Machines 03.00pm: Coffee Break 03.30pm: Transfer Learning Ruslan Salakhutdinov (Toronto) 04.00pm: Motion & Video Graham Taylor (Guelph) 05.00pm: Summary / Q & A All 05.30pm: End
  • 28. Restricted Boltzmann Machine! 13-1-12 28 Restricted$Boltzmann$Machines$ The$energy$of$the$joint$configura:on:$$ model$parameters.$ Probability$of$the$joint$configura:on$is$given$by$the$Boltzmann$distribu:on:$ Markov$random$fields,$Boltzmann$machines,$logSlinear$models.$ are$connected$to$stochas:c$binary$hidden$ variables $ $$$$$$$$.$$ Stochas:c$binary$visible$variables$ Image$$$$$$visible$variables$ $$hidden$variables$ Bipar:te$$ Structure$ par::on$func:on$ poten:al$func:ons$ Slide Courtesy: Russ Salakhutdinov
  • 29. Restricted Boltzmann Machine! 13-1-12 29 Restricted$Boltzmann$Machines$ Restricted:$$$No$interac:on$between$ $ $ $$hidden$variables$ Inferring$the$distribu:on$over$the$ hidden$variables$is$easy:$ Factorizes:$Easy$to$compute$ Image$$$$$$visible$variables$ $$hidden$variables$ Bipar:te$$ Structure$ Similarly:$ Markov$random$fields,$Boltzmann$machines,$logSlinear$models.$ Slide Courtesy: Russ Salakhutdinov
  • 30. 13-1-12 30 Model$Learning$ Contras:ve$Divergence$(Hinton$2000) $$$$$Pseudo$Likelihood$(Besag$1977) $$ MCMCSMLE$es:mator$(Geyer$1991)$$ $$$$$Composite$Likelihoods$(Lindsay,$1988;$Varin$2008)$$ Approximate*maximum*likelihood*learning:** Tempered$MCMC$ $ $ $ $ $$$$$Adap:ve$MCMC$$ (Salakhutdinov,$NIPS$2009) $ $ $$$$$(Salakhutdinov,$ICML$2010)$ Image$$$$$$visible$variables$ $$hidden$variables$ Maximize$(penalized)$logSlikelihood$objec:ve:$ Deriva:ve$of$the$logSlikelihood:$ Given$a$set$of$i.i.d.$training$examples$$ $ $ $ $$$$$$$$$$$$$$,$we$want$to$learn$$ model$parameters $ $ $$$$$$.$$$$ Model Parameter Learning! Slide Courtesy: Russ Salakhutdinov
  • 32. RBMs for Images! 13-1-12 32 RBMs$for$Images$ GaussianSBernoulli$RBM:$$ Interpreta:on:$Mixture$of$exponen:al$ number$of$Gaussians$ Gaussian$ Image$$$$$$visible$variables$ $where$ $is$an$implicit$prior,$and$$ Slide Courtesy: Russ Salakhutdinov
  • 33. Layerwise Pre-training! 13-1-12 33 Layerwise$Pretraining$ Efficient$layerSwise$pretraining$ algorithm.$ Deep$Belief$Network$ Varia:onal$Lower$Bound$ Likelihood$term$ Entropy$func:onal$ Replace$with$a$$ second$layer$RBM$ h3 h2 h1 v W3 W2 W1 Similar$arguments$for$pretraining$a$ Deep$Boltzmann$machine$ Slide Courtesy: Russ Salakhutdinov
  • 34. DBNs for MNIST Classification! 13-1-12 34 DBNs$for$Classifica:on$ • $Aier$layerSbySlayer$unsupervised*pretraining,$discrimina:ve$fineStuning$$ by$backpropaga:on$achieves$an$error$rate$of$1.2%$on$MNIST.$SVM’s$get$ 1.4%$and$randomly$ini:alized$backprop$gets$1.6%.$$ • $Clearly$unsupervised$learning$helps$generaliza:on.$It$ensures$that$most$of$ the$informa:on$in$the$weights$comes$from$modeling$the$input$data.$ W +W W W W + W + W + W W W W 1 11 500 500 500 2000 500 500 2000 500 2 500 RBM 500 2000 3 Pretraining Unrolling Fine tuning 4 4 2 2 3 3 1 2 3 4 RBM 10 Softmax Output 10 RBM T T T T T T T T Slide Courtesy: Russ Salakhutdinov
  • 35. Deep Autoencoders for Unsupervised Feature Learning! 13-1-12 35 Deep$Autoencoders$ ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! " #$$$ %&' # #$$$ "$$$ ($$ ($$ "$$$ "$$$ ($$ " " #$$$ #$$$ ($$($$ "$$$ "$$$ #$$$ ($$ #$$$ ) * ) %&' +,-.,/01012 31,455012 "$$$ %&' 6 * 6$ 6$ 701- .81012 * * # # 6 6 * ) ( 6 ) 9 # ) : " ) ; <1=4>-, " # 6 6$ * 6 # ) " ) ?4>-@5/A-, B-=4>-, %&' )4C Slide Courtesy: Russ Salakhutdinov
  • 36. Image Retrieval using Binary Codes! 13-1-12 36 Searching$Large$Image$Database$ using$Binary$Codes$ • $Map$images$into$binary$codes$for$fast$retrieval.* • $Small$Codes,$Torralba,$Fergus,$Weiss,$CVPR$2008$ • *Spectral$Hashing,$Y.$Weiss,$A.$Torralba,$R.$Fergus,$NIPS$2008* • *Kulis$and$Darrell,$NIPS$2009,$Gong$and$Lazebnik,$CVPR$20111$ • *Norouzi$and$Fleet,$ICML$2011,$* Slide Courtesy: Russ Salakhutdinov
  • 37. Building Block 2 – Autoencoder Neural Net!
  • 38. Autoencoder Neural Network! 13-1-12 38 116 AUTO-ENCODERS NEURAL NETS Ranzato encoder Error input prediction decoder code – input higher dimensional than code - error: ||prediction - input|| - training: back-propagation 2 Slide Courtesy: Marc'Aurelio Ranzato
  • 39. Sparse Autoencoder! 13-1-12 39 117 SPARSE AUTO-ENCODERS Ranzato encoder Error input prediction decoder code – sparsity penalty: ||code|| - error: ||prediction - input|| - training: back-propagation Sparsity Penalty 1 2 - loss: sum of square reconstruction error and sparsity Ran encoder Err input prediction decoder code – sparsity penalty: ||code|| - error: ||prediction - input|| - training: back-propagation Sparsity Penalty 1 2 - loss: sum of square reconstruction error and sparsity Slide Courtesy: Marc'Aurelio Ranzato
  • 40. Sparse Autoencoder! 13-1-12 40 117 SPARSE AUTO-ENCODERS Ranzato encoder Error input prediction decoder code – sparsity penalty: ||code|| - error: ||prediction - input|| - training: back-propagation Sparsity Penalty 1 2 - loss: sum of square reconstruction error and sparsity SPARSE AUTO-ENCODERS Ran encoder Err input prediction decoder code – input: code: - loss: Sparsity Penalty Le et al. “ICA with reconstruction cost..” NIPS 2011 h=W T XX L X ;W =∥W h−X∥2 ∑j ∣hj∣ Slide Courtesy: Marc'Aurelio Ranzato
  • 41. Building Block 3 – Sparse Coding!
  • 42. Sparse coding 13-1-12 42 min a, mX i=1 xi kX j=1 ai,j⇥j 2 + mX i=1 kX j=1 |ai,j| Sparse coding (Olshausen & Field,1996). Originally developed to explain early visual processing in the brain (edge detection). Training: given a set of random patches x, learning a dictionary of bases [Φ1, Φ2, …] Coding: for data vector x, solve LASSO to find the sparse coefficient vector a
  • 43. Sparse coding: training time Input: Images x1, x2, …, xm (each in Rd)" Learn: Dictionary of bases φ1, φ2, …, φk (also Rd)." min a, mX i=1 xi kX j=1 ai,j⇥j 2 + mX i=1 kX j=1 |ai,j| Alternating optimization: 1.  Fix dictionary φ1, φ2, …, φk , optimize a (a standard LASSO problem " 2.  Fix activations , optimize dictionary φ1, φ2, …, φk (a convex QP problem)
  • 44. Sparse coding: testing time Input: Novel image patch x (in Rd) and previously learned φi’s" Output: Representation [ai,1, ai,2, …, ai,κ] of image patch xi. " ≈ 0.8 * + 0.3 * + 0.5 * Represent)xi as:)ai =)[0,)0,)…,)0,)0.8,)0,)…,)0,)0.3,)0,)…,)0,)0.5,)…])) min a, mX i=1 xi kX j=1 ai,j⇥j 2 + mX i=1 kX j=1 |ai,j|
  • 45. Sparse coding illustration ))))Natural)Images) Learned)bases)(φ1","…,"φ64):))“Edges”) 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 ≈ 0.8 * + 0.3 * + 0.5 * x ≈ 0.8 * φ 36 + 0.3 * φ42 + 0.5 * φ63"[a1,)…,)a64])=)[0,)0,)…,)0,)0.8,)0,)…,)0,)0.3,)0,)…,)0,)0.5,)0])) (feature)representaDon))) Test)example Compact)&)easily)interpretable)Slide credit: Andrew Ng
  • 46. RBM & autoencoders -  also involve activation and reconstruction -  but have explicit f(x) -  not necessarily enforce sparsity on a -  but if put sparsity on a, often get improved results [e.g. sparse RBM, Lee et al. NIPS08] 13-1-12 46 x a (x) ’ a (a)encoding decoding
  • 47. Sparse coding: A broader view Any feature mapping from x to a, .e. a = f(x), where ! -  a is sparse (and often higher dim. han x)! -  f(x) is nonlinear ! -  reconstruction x’= (a) , such that x’ x! 13-1-12 47 x a (x) ’ a (a) Therefore, sparse RBMs, sparse auto-encoder, even VQ can be viewed as a form of sparse coding.
  • 48. Example of sparse activations (sparse coding) 13-1-12 48 •  different x has different dimensions activated •  locally-shared sparse representation: similar x’s tend to have similar non-zero dimensions a1 [ 0 | | 0 0 … 0 a2 [ | | 0 0 0 … 0 am [ 0 0 0 | | … 0 … a3 [ | 0 | 0 0 … 0 x1 x2 x3 xm
  • 49. Example of sparse activations (sparse coding) 13-1-12 49 •  another example: preserving manifold structure •  more informative in highlighting richer data structures, i.e. clusters, manifolds, a1 [ | | 0 0 0 … 0 a2 [ 0 | | 0 0 … 0 am [ 0 0 0 | | … 0 … a3 [ 0 0 | | 0 … 0 x1 x2 x3 xm
  • 50. Sparsity vs. Locality 13-1-12 50 sparse coding local sparse coding •  Intuition: similar data should get similar activated features •  Local sparse coding: •  data in the same neighborhood tend to have shared activated features; •  data in different neighborhoods tend to have different features activated.
  • 51. Sparse coding is not always local: example 13-1-12 51 Case 2 data manifold (or clusters) • Each basis an “anchor point” • Sparsity: each datum is a linear combination of neighbor anchors. • Sparsity is caused by locality. Case 1 independent subspaces • Each basis is a “direction” • Sparsity: each datum is a linear combination of only several bases.
  • 52. Two approaches to local sparse coding 13-1-12 52 Approach 2 Coding via local subspaces Approach 1 Coding via local anchor points Local coordinate coding Super-vector coding Image Classification using Super-Vector Coding of Local Image Descriptors, Xi Zhou, Kai Yu, Tong Zhang, and Thomas Huang. In ECCV 2010. Large-scale Image Classification: Fast Feature Extraction and SVM Training, Yuanqing Lin, Fengjun Lv, Shenghuo Zhu, Ming Yang, Timothee Cour, Kai Yu, LiangLiang Cao, Thomas Huang. In CVPR 2011 Learning locality-constrained linear coding for image classification, Jingjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas Huang. In CVPR 2010. Nonlinear learning using local coordinate coding, Kai Yu, Tong Zhang, and Yihong Gong. In NIPS 2009.
  • 53. Two approaches to local sparse coding 13-1-12 53 Approach 2 Coding via local subspaces Approach 1 Coding via local anchor points -  Sparsity achieved by explicitly ensuring locality -  Sound theoretical justifications -  Much simpler to implement and compute -  Strong empirical success Local coordinate coding Super-vector coding
  • 54. Hierarchical Sparse Coding! 13-1-12 54 alpha (⌦W, ) = W, L(W, ) + ⇤1 n ⌃W⌃1 + ⇥⌃ ⌃1 ⌅ 0, L(W, ) 1 n n i=1 ⌅ 1 2 xi Bwi 2 + ⇤2w⇤ i ⇤( )wi ⇧ . W = (w1 w2 · · · wn) ⇧ Rp⇥n ⇧ Rq ⌃ q ⌥ 1 (⌦W, ) = W, L(W, ) + ⇤1 n ⌃W⌃1 + ⇥⌃ ⌅ 0, L(W, ) 1 n n i=1 ⌅ 1 2 xi Bwi 2 + ⇤2w⇤ i ⇤( )wi ⇧ W = (w1 w2 · · · wn) ⇧ Rp⇥n ⇧ Rq ⇤( ) ⇤ ⌃ q k (⌅k) ⌥ 1 . n · xn] ⇧ Rd⇥n B ⇧ Rd⇥p p⇥q + ⇥ S(W) ⇤ 1 n n i=1 wiwT i ⇧ Rp L(W, ) L(W, ) = 1 2n ⌃X BW⌃2 F + ⇤2 n ⇥ S wi ( ) W ⇥ S(W W (⌦W, ) = W, L(W, ) + ⇤1 n ⌃W⌃1 + ⇥⌃ ⌃1 ⌅ 0, L(W, ) 1 n n i=1 ⌅ 1 2 xi Bwi 2 + ⇤2w⇤ i ⇤( )wi ⇧ . W = (w1 w2 · · · wn) ⇧ Rp⇥n ⇧ Rq ⇤( ) ⇤ ⌃ q k=1 k (⌅k) ⌥ 1 . 1 wi Yu, Lin, & Lafferty, CVPR 11!
  • 55. Hierarchical Sparse Coding on MNIST! HSC vs. CNN: HSC provide even better performance than CNN """ more amazingly, HSC learns features in unsupervised manner! 55 Yu, Lin, & Lafferty, CVPR 11!
  • 56. Second-layer dictionary 56 A hidden unit in the second layer is connected to a unit group in the 1st layer: invariance to translation, rotation, and deformation ! Yu, Lin, & Lafferty, CVPR 11!
  • 57. Adaptive Deconvolutional Networks for Mid and High Level Feature Learning! !  Hierarchical) ConvoluDonal)Sparse) Coding.) !  Trained)with)respect) to)image)from)all) layers)(L1RL4).) !  Pooling)both)spaDally) and)amongst)features.) !  Learns)invariant)midR level)features.) MaUhew)D.)Zeiler,)Graham)W.)Taylor,)and)Rob)Fergus,)ICCV)2011) b)Layer2 a) Layer 1 c) Layer 3 e) Re Field L1 L2 f) Sample Input Images & Reconstructions: Input InputLayer 1 Layer 1Layer 2 Layer 3 Layer 4 Select L2 Feature Groups Select L3 Feature Groups Select L4 Features L1 Feature Maps Image L2 Feature Maps L4 Feature Maps L1 Features L3 Feature Maps er2 c) Layer 3 b)Layer2 a) Layer 1 c) Layer 3 f) Sample Input Images & Reconstructions: Input Layer 1 Layer 2 Layer 3 c) Layer 3 er2 c) Layer 3 e) Receptive Fields to Scale L4
  • 58. Recap of Deep Learning Tutorial! !  Building blocks! –  RBMs, Autoencoder Neural Net, Sparse Coding! !  Go deeper: Layerwise feature learning! –  Layer-by-layer unsupervised training! –  Layer-by-layer supervised training! ! !  Fine tuning via Backpropogation! –  If data are ig enough, direct fine tuning is enough! !  Sparsity n hidden layers are often useful. ! ! 13-1-12 58
  • 59. Layer-by-layer unsupervised training + fine tuning ! 13-1-12 59 Ran input prediction Weston et al. “Deep learning via semi-supervised embedding” ICML 2008 label Loss = supervised_error + unsupervised_error Slide Courtesy: Marc'Aurelio Ranzato
  • 60. Layer-by-Layer Supervised Training! 13-1-12 60 138 - Easy to add many error terms to loss function. - Joint learning of related tasks yields better representations. Example of architecture: Collobert et al. “NLP (almost) from scratch” JMLR 2011 Ranzato Slide Courtesy: Marc'Aurelio Ranzato
  • 61. Biological & Theoretical Justification!
  • 62. Why Hierarchy? Theoretical: “…well-known depth-breadth tradeoff in circuits design [Hastad 1987]. This suggests many functions can be much more efficiently represented with deeper architectures…” [Bengio & LeCun 2007] Biological: Visual cortex is hierarchical (Hubel-Wiesel Model) [Thorpe]
  • 63. Sparse DBN: Training on face images! pixels edges object parts (combination of edges) object models [Lee, Grosse, Ranganath & Ng, 2009] Deep Architecture in the Brain Retina Area V1 Area V2 Area V4 pixels Edge detectors Primitive shape detectors Higher level visual abstractions
  • 64. Sensor representation in the brain! [Roe et al., 1992; BrainPort; Welsh & Blasch, 1997] Seeing with your tongue Human echolocation (sonar Auditory cortex learns to see. (Same rewiring process also works for touch/ somatosensory cortex.) Auditory Cortex
  • 66. The challenge! 13-1-12 66150 The Challenge – best optimizer in practice is on-line SGD which is naturally sequential, hard to parallelize. – layers cannot be trained independently and in parallel, hard to distribute – model can have lots of parameters that may clog the network, hard to distribute across machines A Large Scale problem has: – lots of training samples (>10M) – lots of classes (>10K) and – lots of input dimensions (>10K). Slide Courtesy: Marc'Aurelio Ranzato
  • 67. A solution by model parallelism! 13-1-12 67 153 MODEL PARALLELISM Ranzato Our Solution 1st machine 2nd machine 3rd machine MODEL PARALLELISM + DATA PARALLELISM RaLe et al. “Building high-level features using large-scale unsupervised learning” ICML 2012 input #1 input #2 input #3 Slide Courtesy: Marc'Aurelio Ranzato
  • 68. 13-1-12 68 MODEL PARALLELISM + DATA PARALLELISM RanLe et al. “Building high-level features using large-scale unsupervised learning” ICML 2012 input #1 input #2 input #3 MODEL PARALLELISM + DATA PARALLELISM RaLe et al. “Building high-level features using large-scale unsupervised learning” ICML 2012 input #1 input #2 input #3 Slide Courtesy: Marc'Aurelio Ranzato
  • 69. Asynchronous SGD! 13-1-12 69 157 Asynchronous SGD PARAMETER SERVER 1st replica 2nd replica 3rd replica Ranzato ∂ L ∂1 Slide Courtesy: Marc'Aurelio Ranzato
  • 70. 13-1-12 70 158 Asynchronous SGD PARAMETER SERVER 1st replica 2nd replica 3rd replica Ranzato 1 Asynchronous SGD! Slide Courtesy: Marc'Aurelio Ranzato
  • 71. 13-1-12 71 160 PARAMETER SERVER 1st replica 2nd replica 3rd replica Ranzato ∂ L ∂2 Slide Courtesy: Marc'Aurelio Ranzato
  • 72. 13-1-12 72 161 PARAMETER SERVER 1st replica 2nd replica 3rd replica Ranzato 2 Slide Courtesy: Marc'Aurelio Ranzato
  • 73. Training A Model with 1 B Parameters! 13-1-12 73 Deep Net: – 3 stages – each stage consists of local filtering, L2 pooling, LCN - 18x18 filters - 8 filters at each location - L2 pooling and LCN over 5x5 neighborhoods – training jointly the three layers by: - reconstructing the input of each layer - sparsity on the code Unsupervised Learning With 1B Paramete Slide Courtesy: Marc'Aurelio Ranzato