SlideShare a Scribd company logo
Deep Learning
Unsupervised Learning
Russ Salakhutdinov
Machine Learning Department
Carnegie Mellon University
Canadian Institute for Advanced Research
Tutorial Roadmap
Part 1: Supervised (Discrimina>ve) Learning: Deep
Networks
Part 2: Unsupervised Learning: Deep Genera>ve
Models
Part 3: Open Research Ques>ons
Unsupervised Learning
Non-probabilis>c Models
Ø Sparse Coding
Ø Autoencoders
Ø Others (e.g. k-means)
Explicit Density p(x)
Probabilis>c (Genera>ve)
Models
Tractable Models
Ø Fully observed
Belief Nets
Ø NADE
Ø PixelRNN
Non-Tractable Models
Ø Boltzmann Machines
Ø Varia>onal
Autoencoders
Ø Helmholtz Machines
Ø Many others…
Ø Genera>ve Adversarial
Networks
Ø Moment Matching
Networks
Implicit Density
Tutorial Roadmap
• Basic Building Blocks:
Ø Sparse Coding
Ø Autoencoders
• Deep Genera>ve Models
Ø Restricted Boltzmann Machines
Ø Deep Boltzmann Machines
Ø Helmholtz Machines / Varia>onal Autoencoders
• Genera>ve Adversarial Networks
Tutorial Roadmap
• Basic Building Blocks:
Ø Sparse Coding
Ø Autoencoders
• Deep Genera>ve Models
Ø Restricted Boltzmann Machines
Ø Deep Boltzmann Machines
Ø Helmholtz Machines / Varia>onal Autoencoders
• Genera>ve Adversarial Networks
Sparse Coding
• Sparse coding (Olshausen & Field, 1996). Originally developed
to explain early visual processing in the brain (edge detec>on).
• Objec>ve: Given a set of input data vectors
learn a dic>onary of bases such that:
• Each data vector is represented as a sparse linear combina>on
of bases.
Sparse: mostly zeros
Natural Images
[0, 0, … 0.8, …, 0.3, …, 0.5, …] = coefficients (feature representa>on)
New example
Sparse Coding
Learned bases: “Edges”
x = 0.8 * + 0.3 * + 0.5 *
Slide Credit: Honglak Lee
= 0.8 * + 0.3 * + 0.5 *
Sparse Coding: Training
• Input image patches:
• Learn dic>onary of bases:
Reconstruc>on error Sparsity penalty
• Alterna>ng Op>miza>on:
1. Fix dic>onary of bases and solve for
ac>va>ons a (a standard Lasso problem).
2. Fix ac>va>ons a, op>mize the dic>onary of bases (convex
QP problem).
Sparse Coding: Tes>ng Time
• Input: a new image patch x* , and K learned bases
• Output: sparse representa>on a of an image patch x*.
= 0.8 * + 0.3 * + 0.5 *
x* = 0.8 * + 0.3 * + 0.5 *
[0, 0, … 0.8, …, 0.3, …, 0.5, …] = coefficients (feature representa>on)
Evaluated on Caltech101 object category dataset.
Classification
Algorithm
(SVM)
Algorithm Accuracy
Baseline (Fei-Fei et al., 2004) 16%
PCA 37%
Sparse Coding 47%
Input Image Features (coefficients)
Learned
bases
Image Classifica>on
9K images, 101 classes
Lee, Bakle, Raina, Ng, 2006
Slide Credit: Honglak Lee
g(a)
Interpre>ng Sparse Coding
x’
Explicit
Linear
Decoding
a
f(x)
Implicit
nonlinear
encoding
x
a
• Sparse, over-complete representa>on a.
• Encoding a = f(x) is implicit and nonlinear func>on of x.
• Reconstruc>on (or decoding) x’ = g(a) is linear and explicit.
Sparse features
Autoencoder
Encoder
Decoder
Input Image
Feature Representation
Feed-back,
generative,
top-down
path
Feed-forward,
bottom-up
• Details of what goes insider the encoder and decoder maker!
• Need constraints to avoid learning an iden>ty.
Autoencoder
z=σ(Wx)
Dz
Input Image x
Binary Features z
Decoder
filters D
Linear
function
path
Encoder
filters W.
Sigmoid
function
Autoencoder
• An autoencoder with D inputs,
D outputs, and K hidden units,
with K<D.
• Given an input x, its
reconstruc>on is given by:
Encoder
z=σ(Wx)
Dz
Input Image x
Binary Features z
Decoder
Autoencoder
• An autoencoder with D inputs,
D outputs, and K hidden units,
with K<D.
z=σ(Wx)
Dz
Input Image x
Binary Features z
• We can determine the network parameters W and D by
minimizing the reconstruc>on error:
Autoencoder
• If the hidden and output layers
are linear, it will learn hidden units
that are a linear func>on of the data
and minimize the squared error.
• The K hidden units will span the
same space as the first k principal
components. The weight vectors
may not be orthogonal.
z=Wx
Wz
Input Image x
Linear Features z
• With nonlinear hidden units, we have a nonlinear
generaliza>on of PCA.
Another Autoencoder Model
z=σ(Wx)
σ(WTz)
Binary Input x
Binary Features z
Decoder
filters D
path
Encoder
filters W.
Sigmoid
function
• Relates to Restricted Boltzmann Machines (later).
• Need addi>onal constraints to avoid learning an iden>ty.
Predic>ve Sparse Decomposi>on
z=σ(Wx)
Dz
Real-valued Input x
Binary Features z
Decoder
filters D
path
Encoder
filters W.
Sigmoid
function
L1 Sparsity
Encoder
Decoder
At training
time
path
Kavukcuoglu, Ranzato, Fergus, LeCun, 2009
Stacked Autoencoders
Input x
Features
Encoder
Decoder
Class Labels
Encoder
Decoder
Sparsity
Features
Encoder
Decoder
Sparsity
Stacked Autoencoders
Input x
Features
Encoder
Decoder
Features
Class Labels
Encoder
Decoder
Encoder
Decoder
Sparsity
Sparsity
Greedy Layer-wise Learning.
Stacked Autoencoders
Input x
Features
Encoder
Features
Class Labels
Encoder
Encoder
• Remove decoders and
use feed-forward part.
• Standard, or
convolu>onal neural
network architecture.
• Parameters can be
fine-tuned using
backpropaga>on.
Deep Autoencoders
W
W
W +
W
W
W
W
W +
W +
W +
W
W +
W +
W +
+
W
W
W
W
W
W
1
2000
RBM
2
2000
1000
500
500
1000
1000
500
1 1
2000
2000
500
500
1000
1000
2000
500
2000
T
4
T
RBM
Pretraining Unrolling
1000
RBM
3
4
30
30
Finetuning
4 4
2 2
3 3
4
T
5
3
T
6
2
T
7
1
T
8
Encoder
1
2
3
30
4
3
2
T
1
T
Code layer
Decoder
RBM
Top
Deep Autoencoders
• 25x25 – 2000 – 1000 – 500 – 30 autoencoder to extract 30-D real-
valued codes for Olives face patches.
• Top: Random samples from the test dataset.
• Middle: Reconstrucons by the 30-dimensional deep autoencoder.
• BoBom: Reconstrucons by the 30-dimennoal PCA.
Informaon Retrieval
2-D LSA space
Legal/Judicial
Leading
Economic
Indicators
European Community
Monetary/Economic
Accounts/
Earnings
Interbank Markets
Government
Borrowings
Disasters and
Accidents
Energy Markets
• The Reuters Corpus Volume II contains 804,414 newswire stories
(randomly split into 402,207 training and 402,207 test).
• “Bag-of-words” representaon: each arcle is represented as a vector
containing the counts of the most frequently used 2000 words.
(Hinton and Salakhutdinov, Science 2006)
Tutorial Roadmap
• Basic Building Blocks:
Ø Sparse Coding
Ø Autoencoders
• Deep Generave Models
Ø Restricted Boltzmann Machines
Ø Deep Boltzmann Machines
Ø Helmholtz Machines / Variaonal Autoencoders
• Generave Adversarial Networks
Fully Observed Models
m likelihood
✓⇤
= arg max
✓
Ex⇠pdata
log pmodel(x | ✓)
ble belief net
pmodel(x) = pmodel(x1)
n
Y
i=2
pmodel(xi | x1, . . . , xi 1)
• Explicitly model condional probabilies:
Each condional can be a
complicated neural network
• A number of successful models, including
Ø NADE, RNADE (Larochelle, et.al.
20011)
Ø Pixel CNN (van den Ord et. al. 2016)
Ø Pixel RNN (van den Ord et. al. 2016)
Pixel CNN
Restricted Boltzmann Machines
RBM is a Markov Random Field with:
• Stochasc binary hidden variables
• Biparte connecons.
Pair-wise Unary
• Stochasc binary visible variables
Markov random fields, Boltzmann machines, log-linear models.
Image visible variables
hidden variables
Graphical Models: Powerful
framework for represenng
dependency structure between
random variables.
Feature Detectors
Learned W: “edges”
Subset of 1000 features
Learning Features
= ….
New Image:
Logisc Funcon: Suitable for
modeling binary images
Sparse
representaNons
Observed Data
Subset of 25,000 characters
RBMs for Real-valued  Count Data
Learned features (out of 10,000)
4 million unlabelled images
Learned features: ``topics’’
russian
russia
moscow
yeltsin
soviet
clinton
house
president
bill
congress
computer
system
product
sovware
develop
trade
country
import
world
economy
stock
wall
street
point
dow
Reuters dataset:
804,414 unlabeled
newswire stories
Bag-of-Words
Learned features: ``genre’’
Fahrenheit 9/11
Bowling for Columbine
The People vs. Larry Flynt
Canadian Bacon
La Dolce Vita
Independence Day
The Day Aver Tomorrow
Con Air
Men in Black II
Men in Black
Friday the 13th
The Texas Chainsaw Massacre
Children of the Corn
Child's Play
The Return of Michael Myers
Scary Movie
Naked Gun
Hot Shots!
American Pie
Police Academy
Nexlix dataset:
480,189 users
17,770 movies
Over 100 million rangs
State-of-the-art performance
on the Nexlix dataset.
Collaborave Filtering
h
v
W1
Mulnomial visible: user rangs
Binary hidden: user preferences
(Salakhutdinov, Mnih, Hinton, ICML 2007)
Different Data Modalies
• It is easy to infer the states of the hidden variables:
• Binary/Gaussian/Sovmax RBMs: All have binary hidden
variables but use them to model different kinds of data.
Binary
Real-valued 1-of-K
0
0
1
0
0
Product of Experts
Marginalizing over hidden variables: Product of Experts
The joint distribuon is given by:
Silvio Berlusconi
government
authority
power
empire
federaon
clinton
house
president
bill
congress
bribery
corrupon
dishonesty
corrupt
fraud
mafia
business
gang
mob
insider
stock
wall
street
point
dow
…
Topics “government”, ”corrupon”
and ”mafia” can combine to give very
high probability to a word “Silvio
Berlusconi”.
Product of Experts
Marginalizing over hidden variables: Product of Experts
The joint distribuon is given by:
Silvio Berlusconi
government
authority
power
empire
federaon
clinton
house
president
bill
congress
bribery
corrupon
dishonesty
corrupt
fraud
mafia
business
gang
mob
insider
stock
wall
street
point
dow
…
Topics “government”, ”corrupon”
and ”mafia” can combine to give very
high probability to a word “Silvio
Berlusconi”.
0.001 0.006 0.051 0.4 1.6 6.4 25.6 100
10
20
30
40
50
Recall (%)
Precision
(%) Replicated
Softmax 50−D
LDA 50−D
Image
Low-level features:
Edges
Input: Pixels
Built from unlabeled inputs.
Deep Boltzmann Machines
(Salakhutdinov 2008, Salakhutdinov  Hinton 2012)
Image
Higher-level features:
Combinaon of edges
Low-level features:
Edges
Input: Pixels
Built from unlabeled inputs.
Deep Boltzmann Machines
Learn simpler representaons,
then compose more complex ones
(Salakhutdinov 2008, Salakhutdinov  Hinton 2009)
Model Formulaon
model parameters
• Dependencies between hidden variables.
• All connecons are undirected.
h3
h2
h1
v
W3
W2
W1
• Bokom-up and Top-down:
Top-down Bokom-up
Input
• Hidden variables are dependent even when condiNoned on
the input.
Same as RBMs
Approximate Learning
(Approximate) Maximum Likelihood:
Not factorial any more!
h3
h2
h1
v
W3
W2
W1
• Both expectaons are intractable!
Data
Approximate Learning
(Approximate) Maximum Likelihood:
h3
h2
h1
v
W3
W2
W1
Not factorial any more!
Approximate Learning
(Approximate) Maximum Likelihood:
Not factorial any more!
h3
h2
h1
v
W3
W2
W1
Variaonal
Inference
Stochasc
Approximaon
(MCMC-based)
Good Generave Model?
Handwriken Characters
Good Generave Model?
Handwriken Characters
Good Generave Model?
Handwriken Characters
Real Data
Simulated
Good Generave Model?
Handwriken Characters
Real Data Simulated
Good Generave Model?
Handwriken Characters
Handwring Recognion
Learning Algorithm Error
Logisc regression 12.0%
K-NN 3.09%
Neural Net (Plak 2005) 1.53%
SVM (Decoste et.al. 2002) 1.40%
Deep Autoencoder
(Bengio et. al. 2007)
1.40%
Deep Belief Net
(Hinton et. al. 2006)
1.20%
DBM 0.95%
Learning Algorithm Error
Logisc regression 22.14%
K-NN 18.92%
Neural Net 14.62%
SVM (Larochelle et.al. 2009) 9.70%
Deep Autoencoder
(Bengio et. al. 2007)
10.05%
Deep Belief Net
(Larochelle et. al. 2009)
9.68%
DBM 8.40%
MNIST Dataset Opcal Character Recognion
60,000 examples of 10 digits 42,152 examples of 26 English lekers
Permutaon-invariant version.
3-D object Recognion
Learning Algorithm Error
Logisc regression 22.5%
K-NN (LeCun 2004) 18.92%
SVM (Bengio  LeCun 2007) 11.6%
Deep Belief Net (Nair  Hinton
2009)
9.0%
DBM 7.2%
Pakern
Compleon
NORB Dataset: 24,000 examples
Data – Collecon of Modalies
• Mulmedia content on the web -
image + text + audio.
• Product recommendaon
systems.
• Robocs applicaons.
Audio
Vision
Touch sensors
Motor control
sunset,
pacificocean,
bakerbeach,
seashore, ocean
car,
automobile
Challenges - I
Very different input
representaons
Image Text
sunset, pacific ocean,
baker beach, seashore,
ocean • Images – real-valued, dense
Difficult to learn
cross-modal features
from low-level
representaons.
Dense
• Text – discrete, sparse
Sparse
Challenges - II
Noisy and missing data
Image Text
pentax, k10d,
pentaxda50200,
kangarooisland, sa,
australiansealion
mickikrimmel,
mickipedia,
headshot
unseulpixel,
naturey
 no text
Challenges - II
Image Text Text generated by the model
beach, sea, surf, strand,
shore, wave, seascape,
sand, ocean, waves
portrait, girl, woman, lady,
blonde, preky, gorgeous,
expression, model
night, noke, traffic, light,
lights, parking, darkness,
lowlight, nacht, glow
fall, autumn, trees, leaves,
foliage, forest, woods,
branches, path
pentax, k10d,
pentaxda50200,
kangarooisland, sa,
australiansealion
mickikrimmel,
mickipedia,
headshot
unseulpixel,
naturey
 no text
0
0
1
0
0
Dense, real-valued
image features
Gaussian model
Replicated Sovmax
Mulmodal DBM
Word
counts
(Srivastava  Salakhutdinov, NIPS 2012, JMLR 2014)
Mulmodal DBM
0
0
1
0
0
Dense, real-valued
image features
Gaussian model
Replicated Sovmax
Word
counts
(Srivastava  Salakhutdinov, NIPS 2012, JMLR 2014)
Gaussian model
Replicated Sovmax
0
0
1
0
0
Mulmodal DBM
Word
counts
Dense, real-valued
image features
(Srivastava  Salakhutdinov, NIPS 2012, JMLR 2014)
Text Generated from Images
canada, nature,
sunrise, ontario, fog,
mist, bc, morning
insect, bukerfly, insects,
bug, bukerflies,
lepidoptera
graffi, streetart, stencil,
scker, urbanart, graff,
sanfrancisco
portrait, child, kid,
ritrako, kids, children,
boy, cute, boys, italy
dog, cat, pet, kiken,
puppy, ginger, tongue,
kiky, dogs, furry
sea, france, boat, mer,
beach, river, bretagne,
plage, brikany
Given Generated Given Generated
Generang Text from Images
Samples drawn aver
every 50 steps of
Gibbs updates
Text Generated from Images
Given Generated
water, glass, beer, bokle,
drink, wine, bubbles, splash,
drops, drop
portrait, women, army, soldier,
mother, postcard, soldiers
obama, barackobama, elecon,
polics, president, hope, change,
sanfrancisco, convenon, rally
Images from Text
water, red,
sunset
nature, flower,
red, green
blue, green,
yellow, colors
chocolate, cake
Given Retrieved
MIR-Flickr Dataset
Huiskes et. al.
• 1 million images along with user-assigned tags.
sculpture, beauty,
stone
nikon, green, light,
photoshop, apple, d70
white, yellow,
abstract, lines, bus,
graphic
sky, geotagged,
reflecon, cielo,
bilbao, reflejo
food, cupcake,
vegan
d80
anawesomeshot,
theperfectphotographer,
flash, damniwishidtakenthat,
spiritofphotography
nikon, abigfave,
goldstaraward, d80,
nikond80
Results
• Logisc regression on top-level representaon.
• Mulmodal Inputs
Learning Algorithm MAP Precision@50
Random 0.124 0.124
LDA [Huiskes et. al.] 0.492 0.754
SVM [Huiskes et. al.] 0.475 0.758
DBM-Labelled 0.526 0.791
Deep Belief Net 0.638 0.867
Autoencoder 0.638 0.875
DBM 0.641 0.873
Mean Average Precision
Labeled
25K
examples
+ 1 Million
unlabelled
Helmholtz Machines
• Hinton, G. E., Dayan, P., Frey, B. J. and Neal, R., Science 1995
Input data
h3
h2
h1
v
W3
W2
W1
Generave
Process
Approximate
Inference
• Kingma  Welling, 2014
• Rezende, Mohamed, Daan, 2014
• Mnih  Gregor, 2014
• Bornschein  Bengio, 2015
• Tang  Salakhutdinov, 2013
Helmholtz Machines vs. DBMs
Input data
h3
h2
h1
v
W3
W2
W1
Generave
Process
Approximate
Inference h3
h2
h1
v
W3
W2
W1
Deep Boltzmann Machine
Helmholtz Machine
Variaonal Autoencoders (VAEs)
• The VAE defines a generave process in terms of ancestral
sampling through a cascade of hidden stochasc layers:
h3
h2
h1
v
W3
W2
W1
Each term may denote a
complicated nonlinear relaonship
• Sampling and probability
evaluaon is tractable for
each .
Generave
Process
• denotes parameters
of VAE.
• is the number of
stochasNc layers.
Input data
VAE: Example
• The VAE defines a generave process in terms of ancestral
sampling through a cascade of hidden stochasc layers:
This term denotes a one-layer
neural net.
Determinisc
Layer
Stochasc Layer
Stochasc Layer
• denotes parameters
of VAE.
• Sampling and probability
evaluaon is tractable for
each .
• is the number of
stochasNc layers.
Variaonal Bound
• The VAE is trained to maximize the variaonal lower bound:
Input data
h3
h2
h1
v
W3
W2
W1
• Hard to opmize the variaonal bound
with respect to the recognion network
(high-variance).
• Key idea of Kingma and Welling is to use
reparameterizaon trick.
• Trading off the data log-likelihood and the KL divergence
from the true posterior.
Reparameterizaon Trick
• Assume that the recognion distribuon is Gaussian:
with mean and covariance computed from the state of the hidden
units at the previous layer.
• Alternavely, we can express this in term of auxiliary variable:
• Assume that the recognion distribuon is Gaussian:
• Or
Determinisc
Encoder
• The recognion distribuon can be expressed in
terms of a determinisc mapping:
Distribuon of
does not depend on
Reparameterizaon Trick
Compung the Gradients
• The gradient w.r.t the parameters: both recognion and
generave:
Gradients can be
computed by backprop
The mapping h is a determinisc
neural net for fixed .
Autoencoder
Importance Weighted Autoencoders
• Can improve VAE by using following k-sample importance
weighng of the log-likelihood:
where are sampled
from the recognion network.
Input data
h3
h2
h1
v
W3
W2
W1
unnormalized
importance weights
Burda, Grosse, Salakhutdinov, 2015
Generang Images from Capons
• Generave Model: Stochasc Recurrent Network, chained
sequence of Variaonal Autoencoders, with a single stochasc layer.
• Recognion Model: Determinisc Recurrent Network.
Stochasc
Layer
Gregor et. al. 2015 (Mansimov, Parisoko, Ba, Salakhutdinov, 2015)
Movang Example
• Can we generate images from natural language descripons?
A stop sign is flying in
blue skies
A pale yellow school bus
is flying in blue skies
A herd of elephants is
flying in blue skies
A large commercial airplane
is flying in blue skies
(Mansimov, Parisoko, Ba, Salakhutdinov, 2015)
Flipping Colors
A yellow school bus parked
in the parking lot
A red school bus parked in
the parking lot
A green school bus parked in
the parking lot
A blue school bus parked in
the parking lot
(Mansimov, Parisoko, Ba, Salakhutdinov, 2015)
Novel Scene Composions
A toilet seat sits open in the
bathroom
Ask Google?
A toilet seat sits open in the
grass field
(Some) Open Problems
• Reasoning, Akenon, and Memory
• Natural Language Understanding
• Deep Reinforcement Learning
• Unsupervised Learning / Transfer Learning /
One-Shot Learning
(Some) Open Problems
• Reasoning, Akenon, and Memory
• Natural Language Understanding
• Deep Reinforcement Learning
• Unsupervised Learning / Transfer Learning /
One-Shot Learning
• Query: President-elect Barack Obama said Tuesday he was not
aware of alleged corrupon by X who was arrested on charges of
trying to sell Obama’s senate seat.
Who-Did-What Dataset
• Document: “…arrested Illinois governor Rod Blagojevich and his
chief of staff John Harris on corrupon charges … included
Blogojevich allegedly conspiring to sell or trade the senate seat lev
vacant by President-elect Barack Obama…”
• Answer: Rod Blagojevich
Onishi, Wang, Bansal, Gimpel, McAllester, EMNLP, 2016
Recurrent Neural Network
x1 x2 x3
h1 h2 h3
Nonlinearity Hidden State at
previous me step
Input at me
step t
Ø Use element-wise mulplicaon
to model the interacons
between document and query:
Gated Akenon Mechanism
• Use Recurrent Neural Networks (RNNs) to encode a document
and a query:
(Dhingra, Liu, Yang, Cohen, Salakhutdinov, ACL 2017)
Mul-hop Architecture
• Reasoning requires several passes over the context
(Dhingra, Liu, Yang, Cohen, Salakhutdinov, ACL 2017)
Analysis of Akenon
• Context: “…arrested Illinois governor Rod Blagojevich and his chief of staff John
Harris on corrupon charges … included Blogojevich allegedly conspiring to sell
or trade the senate seat lev vacant by President-elect Barack Obama…”
• Query: “President-elect Barack Obama said Tuesday he was not aware of
alleged corrupon by X who was arrested on charges of trying to sell Obama’s
senate seat.”
• Answer: Rod Blagojevich
Layer 1 Layer 2
Analysis of Akenon
• Context: “…arrested Illinois governor Rod Blagojevich and his chief of staff John
Harris on corrupon charges … included Blogojevich allegedly conspiring to sell
or trade the senate seat lev vacant by President-elect Barack Obama…”
• Query: “President-elect Barack Obama said Tuesday he was not aware of
alleged corrupon by X who was arrested on charges of trying to sell Obama’s
senate seat.”
• Answer: Rod Blagojevich
Layer 1 Layer 2
Code + Data: hkps://github.com/bdhingra/ga-reader
there
ball
the
left
She
kitchen
the
to
went
She
football
the
got
Mary
Coreference
Hyper/Hyponymy
RNN
Dhingra, Yang, Cohen, Salakhutdinov 2017
Incorporang Prior Knowledge
Memory as Acyclic Graph
Encoding (MAGE) - RNN
there
ball
the
left
She
kitchen
the
to
went
She
football
the
got
Mary
Coreference
Hyper/Hyponymy
RNN
RNN
xt
Mt
h0
h1
.
.
.
ht 1
e1 e|E|
. . .
ht
Mt+1
gt
Dhingra, Yang, Cohen, Salakhutdinov 2017
Incorporang Prior Knowledge
Her plain face broke into
a huge smile when she
saw Terry. “Terry!” she
called out. She rushed
to meet him and they
embraced. “Hon, I want
you to meet an old
friend, Owen McKenna.
Owen, please meet
Emily.'’ She gave me a
quick nod and turned
back to X
Coreference
Dependency Parses
Enty relaons
Word relaons
Core NLP
Freebase
WordNet
Recurrent Neural Network
Text Representaon
Incorporang Prior Knowledge
Neural Story Telling
Sample from the GeneraNve Model
(recurrent neural network):
The sun had risen from the ocean, making her feel more alive than
normal. She is beauful, but the truth is that I do not know what to
do. The sun was just starng to fade away, leaving people scakered
around the Atlanc Ocean.
She was in love with him for the first
me in months, so she had no
intenon of escaping.
(Kiros et al., NIPS 2015)
(Some) Open Problems
• Reasoning, Akenon, and Memory
• Natural Language Understanding
• Deep Reinforcement Learning
• Unsupervised Learning / Transfer Learning /
One-Shot Learning
Learning to map sequences of observaons to acons,
for a parcular goal
Acon
Observaon
Learning Behaviors
Differentiable Neural Computer, Graves et al., Nature, 2016;
Neural Turing Machine, Graves et al., 2014
Observaon / State
Acon
Reward
Learned
External
Memory
Reinforcement Learning with
Memory
Differentiable Neural Computer, Graves et al., Nature, 2016;
Neural Turing Machine, Graves et al., 2014
Observaon / State
Acon
Reward
Learned
External
Memory
Reinforcement Learning with
Memory
Learning 3-D game
without memory
Chaplot, Lample, AAAI 2017
Parisotto, Salakhutdinov, 2017
Observaon / State
Acon
Reward
Learned
Structured
Memory
Deep RL with Memory
• Indicator: Either blue or pink
Ø If blue, find the green block
Ø If pink, find the red block
• Negave reward if agent does not find correct
block in N steps or goes to wrong block.
Parisotto, Salakhutdinov, 2017
Random Maze with Indicator
Write
Mt
Write
Mt+1
Read with
Akenon
Parisotto, Salakhutdinov, 2017
Random Maze with Indicator
Random Maze with Indicator
Building Intelligent Agents
Observaon / State
Acon
Reward
Learned
External
Memory
Knowledge
Base
Building Intelligent Agents
Observaon / State
Acon
Reward
Learned
External
Memory
Knowledge
Base
Learning from Fewer
Examples, Fewer
Experiences
Summary
• Efficient learning algorithms for Deep Unsupervised Models
• Deep models improve the current state-of-the art in many
applicaon domains:
Ø Object recognion and detecon, text and image retrieval, handwriken
character and speech recognion, and others.
HMM decoder
Speech RecogniNon
sunset, pacific ocean,
beach, seashore
MulNmodal Data
Object DetecNon
Text  image retrieval /
Object recogniNon
Learning a Category
Hierarchy
mosque, tower,
building, cathedral,
dome, castle
Image Tagging
Thank you

More Related Content

Similar to Deep learning unsupervised learning diapo

Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)Oswald Campesato
 
Computer Vision Landscape : Present and Future
Computer Vision Landscape : Present and FutureComputer Vision Landscape : Present and Future
Computer Vision Landscape : Present and FutureSanghamitra Deb
 
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakLearn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakPyData
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLSpark Summit
 
Introduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowIntroduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowOswald Campesato
 
Object Detection with Transformers
Object Detection with TransformersObject Detection with Transformers
Object Detection with TransformersDatabricks
 
Reducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networksReducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networksHakky St
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningCastLabKAIST
 
Machine Learning from a Software Engineer's perspective
Machine Learning from a Software Engineer's perspectiveMachine Learning from a Software Engineer's perspective
Machine Learning from a Software Engineer's perspectiveMarijn van Zelst
 
Machine learning from a software engineer's perspective - Marijn van Zelst - ...
Machine learning from a software engineer's perspective - Marijn van Zelst - ...Machine learning from a software engineer's perspective - Marijn van Zelst - ...
Machine learning from a software engineer's perspective - Marijn van Zelst - ...Codemotion
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection ProcessBenjamin Bengfort
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspectiveAnirban Santara
 
Alberto Massidda - Images and words: mechanics of automated captioning with n...
Alberto Massidda - Images and words: mechanics of automated captioning with n...Alberto Massidda - Images and words: mechanics of automated captioning with n...
Alberto Massidda - Images and words: mechanics of automated captioning with n...Codemotion
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningOswald Campesato
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用CHENHuiMei
 

Similar to Deep learning unsupervised learning diapo (20)

Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)
 
Computer Vision Landscape : Present and Future
Computer Vision Landscape : Present and FutureComputer Vision Landscape : Present and Future
Computer Vision Landscape : Present and Future
 
Novi sad ai event 3-2018
Novi sad ai event 3-2018Novi sad ai event 3-2018
Novi sad ai event 3-2018
 
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakLearn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
 
Convolutional neural networks
Convolutional neural  networksConvolutional neural  networks
Convolutional neural networks
 
Android and Deep Learning
Android and Deep LearningAndroid and Deep Learning
Android and Deep Learning
 
Introduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowIntroduction to Deep Learning and Tensorflow
Introduction to Deep Learning and Tensorflow
 
MXNet Workshop
MXNet WorkshopMXNet Workshop
MXNet Workshop
 
Object Detection with Transformers
Object Detection with TransformersObject Detection with Transformers
Object Detection with Transformers
 
Reducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networksReducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networks
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Machine Learning from a Software Engineer's perspective
Machine Learning from a Software Engineer's perspectiveMachine Learning from a Software Engineer's perspective
Machine Learning from a Software Engineer's perspective
 
Machine learning from a software engineer's perspective - Marijn van Zelst - ...
Machine learning from a software engineer's perspective - Marijn van Zelst - ...Machine learning from a software engineer's perspective - Marijn van Zelst - ...
Machine learning from a software engineer's perspective - Marijn van Zelst - ...
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection Process
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
Alberto Massidda - Images and words: mechanics of automated captioning with n...
Alberto Massidda - Images and words: mechanics of automated captioning with n...Alberto Massidda - Images and words: mechanics of automated captioning with n...
Alberto Massidda - Images and words: mechanics of automated captioning with n...
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用
 

Recently uploaded

Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdfKamal Acharya
 
Explosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdfExplosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdf884710SadaqatAli
 
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdfDR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdfDrGurudutt
 
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...Roi Lipman
 
Dairy management system project report..pdf
Dairy management system project report..pdfDairy management system project report..pdf
Dairy management system project report..pdfKamal Acharya
 
Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfAbrahamGadissa
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Krakówbim.edu.pl
 
retail automation billing system ppt.pptx
retail automation billing system ppt.pptxretail automation billing system ppt.pptx
retail automation billing system ppt.pptxfaamieahmd
 
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...Prakhyath Rai
 
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfA CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfKamal Acharya
 
Soil Testing Instruments by aimil ltd.- California Bearing Ratio apparatus, c...
Soil Testing Instruments by aimil ltd.- California Bearing Ratio apparatus, c...Soil Testing Instruments by aimil ltd.- California Bearing Ratio apparatus, c...
Soil Testing Instruments by aimil ltd.- California Bearing Ratio apparatus, c...Aimil Ltd
 
A case study of cinema management system project report..pdf
A case study of cinema management system project report..pdfA case study of cinema management system project report..pdf
A case study of cinema management system project report..pdfKamal Acharya
 
Supermarket billing system project report..pdf
Supermarket billing system project report..pdfSupermarket billing system project report..pdf
Supermarket billing system project report..pdfKamal Acharya
 
Arduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectArduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectRased Khan
 
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxCloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxMd. Shahidul Islam Prodhan
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGKOUSTAV SARKAR
 
grop material handling.pdf and resarch ethics tth
grop material handling.pdf and resarch ethics tthgrop material handling.pdf and resarch ethics tth
grop material handling.pdf and resarch ethics tthAmanyaSylus
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringC Sai Kiran
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdfKamal Acharya
 
2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edgePaco Orozco
 

Recently uploaded (20)

Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdf
 
Explosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdfExplosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdf
 
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdfDR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
 
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
 
Dairy management system project report..pdf
Dairy management system project report..pdfDairy management system project report..pdf
Dairy management system project report..pdf
 
Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdf
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Kraków
 
retail automation billing system ppt.pptx
retail automation billing system ppt.pptxretail automation billing system ppt.pptx
retail automation billing system ppt.pptx
 
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
 
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfA CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
 
Soil Testing Instruments by aimil ltd.- California Bearing Ratio apparatus, c...
Soil Testing Instruments by aimil ltd.- California Bearing Ratio apparatus, c...Soil Testing Instruments by aimil ltd.- California Bearing Ratio apparatus, c...
Soil Testing Instruments by aimil ltd.- California Bearing Ratio apparatus, c...
 
A case study of cinema management system project report..pdf
A case study of cinema management system project report..pdfA case study of cinema management system project report..pdf
A case study of cinema management system project report..pdf
 
Supermarket billing system project report..pdf
Supermarket billing system project report..pdfSupermarket billing system project report..pdf
Supermarket billing system project report..pdf
 
Arduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectArduino based vehicle speed tracker project
Arduino based vehicle speed tracker project
 
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxCloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
 
grop material handling.pdf and resarch ethics tth
grop material handling.pdf and resarch ethics tthgrop material handling.pdf and resarch ethics tth
grop material handling.pdf and resarch ethics tth
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdf
 
2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge
 

Deep learning unsupervised learning diapo

  • 1. Deep Learning Unsupervised Learning Russ Salakhutdinov Machine Learning Department Carnegie Mellon University Canadian Institute for Advanced Research
  • 2. Tutorial Roadmap Part 1: Supervised (Discrimina>ve) Learning: Deep Networks Part 2: Unsupervised Learning: Deep Genera>ve Models Part 3: Open Research Ques>ons
  • 3. Unsupervised Learning Non-probabilis>c Models Ø Sparse Coding Ø Autoencoders Ø Others (e.g. k-means) Explicit Density p(x) Probabilis>c (Genera>ve) Models Tractable Models Ø Fully observed Belief Nets Ø NADE Ø PixelRNN Non-Tractable Models Ø Boltzmann Machines Ø Varia>onal Autoencoders Ø Helmholtz Machines Ø Many others… Ø Genera>ve Adversarial Networks Ø Moment Matching Networks Implicit Density
  • 4. Tutorial Roadmap • Basic Building Blocks: Ø Sparse Coding Ø Autoencoders • Deep Genera>ve Models Ø Restricted Boltzmann Machines Ø Deep Boltzmann Machines Ø Helmholtz Machines / Varia>onal Autoencoders • Genera>ve Adversarial Networks
  • 5. Tutorial Roadmap • Basic Building Blocks: Ø Sparse Coding Ø Autoencoders • Deep Genera>ve Models Ø Restricted Boltzmann Machines Ø Deep Boltzmann Machines Ø Helmholtz Machines / Varia>onal Autoencoders • Genera>ve Adversarial Networks
  • 6. Sparse Coding • Sparse coding (Olshausen & Field, 1996). Originally developed to explain early visual processing in the brain (edge detec>on). • Objec>ve: Given a set of input data vectors learn a dic>onary of bases such that: • Each data vector is represented as a sparse linear combina>on of bases. Sparse: mostly zeros
  • 7. Natural Images [0, 0, … 0.8, …, 0.3, …, 0.5, …] = coefficients (feature representa>on) New example Sparse Coding Learned bases: “Edges” x = 0.8 * + 0.3 * + 0.5 * Slide Credit: Honglak Lee = 0.8 * + 0.3 * + 0.5 *
  • 8. Sparse Coding: Training • Input image patches: • Learn dic>onary of bases: Reconstruc>on error Sparsity penalty • Alterna>ng Op>miza>on: 1. Fix dic>onary of bases and solve for ac>va>ons a (a standard Lasso problem). 2. Fix ac>va>ons a, op>mize the dic>onary of bases (convex QP problem).
  • 9. Sparse Coding: Tes>ng Time • Input: a new image patch x* , and K learned bases • Output: sparse representa>on a of an image patch x*. = 0.8 * + 0.3 * + 0.5 * x* = 0.8 * + 0.3 * + 0.5 * [0, 0, … 0.8, …, 0.3, …, 0.5, …] = coefficients (feature representa>on)
  • 10. Evaluated on Caltech101 object category dataset. Classification Algorithm (SVM) Algorithm Accuracy Baseline (Fei-Fei et al., 2004) 16% PCA 37% Sparse Coding 47% Input Image Features (coefficients) Learned bases Image Classifica>on 9K images, 101 classes Lee, Bakle, Raina, Ng, 2006 Slide Credit: Honglak Lee
  • 11. g(a) Interpre>ng Sparse Coding x’ Explicit Linear Decoding a f(x) Implicit nonlinear encoding x a • Sparse, over-complete representa>on a. • Encoding a = f(x) is implicit and nonlinear func>on of x. • Reconstruc>on (or decoding) x’ = g(a) is linear and explicit. Sparse features
  • 12. Autoencoder Encoder Decoder Input Image Feature Representation Feed-back, generative, top-down path Feed-forward, bottom-up • Details of what goes insider the encoder and decoder maker! • Need constraints to avoid learning an iden>ty.
  • 13. Autoencoder z=σ(Wx) Dz Input Image x Binary Features z Decoder filters D Linear function path Encoder filters W. Sigmoid function
  • 14. Autoencoder • An autoencoder with D inputs, D outputs, and K hidden units, with K<D. • Given an input x, its reconstruc>on is given by: Encoder z=σ(Wx) Dz Input Image x Binary Features z Decoder
  • 15. Autoencoder • An autoencoder with D inputs, D outputs, and K hidden units, with K<D. z=σ(Wx) Dz Input Image x Binary Features z • We can determine the network parameters W and D by minimizing the reconstruc>on error:
  • 16. Autoencoder • If the hidden and output layers are linear, it will learn hidden units that are a linear func>on of the data and minimize the squared error. • The K hidden units will span the same space as the first k principal components. The weight vectors may not be orthogonal. z=Wx Wz Input Image x Linear Features z • With nonlinear hidden units, we have a nonlinear generaliza>on of PCA.
  • 17. Another Autoencoder Model z=σ(Wx) σ(WTz) Binary Input x Binary Features z Decoder filters D path Encoder filters W. Sigmoid function • Relates to Restricted Boltzmann Machines (later). • Need addi>onal constraints to avoid learning an iden>ty.
  • 18. Predic>ve Sparse Decomposi>on z=σ(Wx) Dz Real-valued Input x Binary Features z Decoder filters D path Encoder filters W. Sigmoid function L1 Sparsity Encoder Decoder At training time path Kavukcuoglu, Ranzato, Fergus, LeCun, 2009
  • 19. Stacked Autoencoders Input x Features Encoder Decoder Class Labels Encoder Decoder Sparsity Features Encoder Decoder Sparsity
  • 20. Stacked Autoencoders Input x Features Encoder Decoder Features Class Labels Encoder Decoder Encoder Decoder Sparsity Sparsity Greedy Layer-wise Learning.
  • 21. Stacked Autoencoders Input x Features Encoder Features Class Labels Encoder Encoder • Remove decoders and use feed-forward part. • Standard, or convolu>onal neural network architecture. • Parameters can be fine-tuned using backpropaga>on.
  • 22. Deep Autoencoders W W W + W W W W W + W + W + W W + W + W + + W W W W W W 1 2000 RBM 2 2000 1000 500 500 1000 1000 500 1 1 2000 2000 500 500 1000 1000 2000 500 2000 T 4 T RBM Pretraining Unrolling 1000 RBM 3 4 30 30 Finetuning 4 4 2 2 3 3 4 T 5 3 T 6 2 T 7 1 T 8 Encoder 1 2 3 30 4 3 2 T 1 T Code layer Decoder RBM Top
  • 23. Deep Autoencoders • 25x25 – 2000 – 1000 – 500 – 30 autoencoder to extract 30-D real- valued codes for Olives face patches. • Top: Random samples from the test dataset. • Middle: Reconstrucons by the 30-dimensional deep autoencoder. • BoBom: Reconstrucons by the 30-dimennoal PCA.
  • 24. Informaon Retrieval 2-D LSA space Legal/Judicial Leading Economic Indicators European Community Monetary/Economic Accounts/ Earnings Interbank Markets Government Borrowings Disasters and Accidents Energy Markets • The Reuters Corpus Volume II contains 804,414 newswire stories (randomly split into 402,207 training and 402,207 test). • “Bag-of-words” representaon: each arcle is represented as a vector containing the counts of the most frequently used 2000 words. (Hinton and Salakhutdinov, Science 2006)
  • 25. Tutorial Roadmap • Basic Building Blocks: Ø Sparse Coding Ø Autoencoders • Deep Generave Models Ø Restricted Boltzmann Machines Ø Deep Boltzmann Machines Ø Helmholtz Machines / Variaonal Autoencoders • Generave Adversarial Networks
  • 26. Fully Observed Models m likelihood ✓⇤ = arg max ✓ Ex⇠pdata log pmodel(x | ✓) ble belief net pmodel(x) = pmodel(x1) n Y i=2 pmodel(xi | x1, . . . , xi 1) • Explicitly model condional probabilies: Each condional can be a complicated neural network • A number of successful models, including Ø NADE, RNADE (Larochelle, et.al. 20011) Ø Pixel CNN (van den Ord et. al. 2016) Ø Pixel RNN (van den Ord et. al. 2016) Pixel CNN
  • 27. Restricted Boltzmann Machines RBM is a Markov Random Field with: • Stochasc binary hidden variables • Biparte connecons. Pair-wise Unary • Stochasc binary visible variables Markov random fields, Boltzmann machines, log-linear models. Image visible variables hidden variables Graphical Models: Powerful framework for represenng dependency structure between random variables. Feature Detectors
  • 28. Learned W: “edges” Subset of 1000 features Learning Features = …. New Image: Logisc Funcon: Suitable for modeling binary images Sparse representaNons Observed Data Subset of 25,000 characters
  • 29. RBMs for Real-valued Count Data Learned features (out of 10,000) 4 million unlabelled images Learned features: ``topics’’ russian russia moscow yeltsin soviet clinton house president bill congress computer system product sovware develop trade country import world economy stock wall street point dow Reuters dataset: 804,414 unlabeled newswire stories Bag-of-Words
  • 30. Learned features: ``genre’’ Fahrenheit 9/11 Bowling for Columbine The People vs. Larry Flynt Canadian Bacon La Dolce Vita Independence Day The Day Aver Tomorrow Con Air Men in Black II Men in Black Friday the 13th The Texas Chainsaw Massacre Children of the Corn Child's Play The Return of Michael Myers Scary Movie Naked Gun Hot Shots! American Pie Police Academy Nexlix dataset: 480,189 users 17,770 movies Over 100 million rangs State-of-the-art performance on the Nexlix dataset. Collaborave Filtering h v W1 Mulnomial visible: user rangs Binary hidden: user preferences (Salakhutdinov, Mnih, Hinton, ICML 2007)
  • 31. Different Data Modalies • It is easy to infer the states of the hidden variables: • Binary/Gaussian/Sovmax RBMs: All have binary hidden variables but use them to model different kinds of data. Binary Real-valued 1-of-K 0 0 1 0 0
  • 32. Product of Experts Marginalizing over hidden variables: Product of Experts The joint distribuon is given by: Silvio Berlusconi government authority power empire federaon clinton house president bill congress bribery corrupon dishonesty corrupt fraud mafia business gang mob insider stock wall street point dow … Topics “government”, ”corrupon” and ”mafia” can combine to give very high probability to a word “Silvio Berlusconi”.
  • 33. Product of Experts Marginalizing over hidden variables: Product of Experts The joint distribuon is given by: Silvio Berlusconi government authority power empire federaon clinton house president bill congress bribery corrupon dishonesty corrupt fraud mafia business gang mob insider stock wall street point dow … Topics “government”, ”corrupon” and ”mafia” can combine to give very high probability to a word “Silvio Berlusconi”. 0.001 0.006 0.051 0.4 1.6 6.4 25.6 100 10 20 30 40 50 Recall (%) Precision (%) Replicated Softmax 50−D LDA 50−D
  • 34. Image Low-level features: Edges Input: Pixels Built from unlabeled inputs. Deep Boltzmann Machines (Salakhutdinov 2008, Salakhutdinov Hinton 2012)
  • 35. Image Higher-level features: Combinaon of edges Low-level features: Edges Input: Pixels Built from unlabeled inputs. Deep Boltzmann Machines Learn simpler representaons, then compose more complex ones (Salakhutdinov 2008, Salakhutdinov Hinton 2009)
  • 36. Model Formulaon model parameters • Dependencies between hidden variables. • All connecons are undirected. h3 h2 h1 v W3 W2 W1 • Bokom-up and Top-down: Top-down Bokom-up Input • Hidden variables are dependent even when condiNoned on the input. Same as RBMs
  • 37. Approximate Learning (Approximate) Maximum Likelihood: Not factorial any more! h3 h2 h1 v W3 W2 W1 • Both expectaons are intractable!
  • 38. Data Approximate Learning (Approximate) Maximum Likelihood: h3 h2 h1 v W3 W2 W1 Not factorial any more!
  • 39. Approximate Learning (Approximate) Maximum Likelihood: Not factorial any more! h3 h2 h1 v W3 W2 W1 Variaonal Inference Stochasc Approximaon (MCMC-based)
  • 42. Good Generave Model? Handwriken Characters Real Data Simulated
  • 43. Good Generave Model? Handwriken Characters Real Data Simulated
  • 45. Handwring Recognion Learning Algorithm Error Logisc regression 12.0% K-NN 3.09% Neural Net (Plak 2005) 1.53% SVM (Decoste et.al. 2002) 1.40% Deep Autoencoder (Bengio et. al. 2007) 1.40% Deep Belief Net (Hinton et. al. 2006) 1.20% DBM 0.95% Learning Algorithm Error Logisc regression 22.14% K-NN 18.92% Neural Net 14.62% SVM (Larochelle et.al. 2009) 9.70% Deep Autoencoder (Bengio et. al. 2007) 10.05% Deep Belief Net (Larochelle et. al. 2009) 9.68% DBM 8.40% MNIST Dataset Opcal Character Recognion 60,000 examples of 10 digits 42,152 examples of 26 English lekers Permutaon-invariant version.
  • 46. 3-D object Recognion Learning Algorithm Error Logisc regression 22.5% K-NN (LeCun 2004) 18.92% SVM (Bengio LeCun 2007) 11.6% Deep Belief Net (Nair Hinton 2009) 9.0% DBM 7.2% Pakern Compleon NORB Dataset: 24,000 examples
  • 47. Data – Collecon of Modalies • Mulmedia content on the web - image + text + audio. • Product recommendaon systems. • Robocs applicaons. Audio Vision Touch sensors Motor control sunset, pacificocean, bakerbeach, seashore, ocean car, automobile
  • 48. Challenges - I Very different input representaons Image Text sunset, pacific ocean, baker beach, seashore, ocean • Images – real-valued, dense Difficult to learn cross-modal features from low-level representaons. Dense • Text – discrete, sparse Sparse
  • 49. Challenges - II Noisy and missing data Image Text pentax, k10d, pentaxda50200, kangarooisland, sa, australiansealion mickikrimmel, mickipedia, headshot unseulpixel, naturey no text
  • 50. Challenges - II Image Text Text generated by the model beach, sea, surf, strand, shore, wave, seascape, sand, ocean, waves portrait, girl, woman, lady, blonde, preky, gorgeous, expression, model night, noke, traffic, light, lights, parking, darkness, lowlight, nacht, glow fall, autumn, trees, leaves, foliage, forest, woods, branches, path pentax, k10d, pentaxda50200, kangarooisland, sa, australiansealion mickikrimmel, mickipedia, headshot unseulpixel, naturey no text
  • 51. 0 0 1 0 0 Dense, real-valued image features Gaussian model Replicated Sovmax Mulmodal DBM Word counts (Srivastava Salakhutdinov, NIPS 2012, JMLR 2014)
  • 52. Mulmodal DBM 0 0 1 0 0 Dense, real-valued image features Gaussian model Replicated Sovmax Word counts (Srivastava Salakhutdinov, NIPS 2012, JMLR 2014)
  • 53. Gaussian model Replicated Sovmax 0 0 1 0 0 Mulmodal DBM Word counts Dense, real-valued image features (Srivastava Salakhutdinov, NIPS 2012, JMLR 2014)
  • 54. Text Generated from Images canada, nature, sunrise, ontario, fog, mist, bc, morning insect, bukerfly, insects, bug, bukerflies, lepidoptera graffi, streetart, stencil, scker, urbanart, graff, sanfrancisco portrait, child, kid, ritrako, kids, children, boy, cute, boys, italy dog, cat, pet, kiken, puppy, ginger, tongue, kiky, dogs, furry sea, france, boat, mer, beach, river, bretagne, plage, brikany Given Generated Given Generated
  • 55. Generang Text from Images Samples drawn aver every 50 steps of Gibbs updates
  • 56. Text Generated from Images Given Generated water, glass, beer, bokle, drink, wine, bubbles, splash, drops, drop portrait, women, army, soldier, mother, postcard, soldiers obama, barackobama, elecon, polics, president, hope, change, sanfrancisco, convenon, rally
  • 57. Images from Text water, red, sunset nature, flower, red, green blue, green, yellow, colors chocolate, cake Given Retrieved
  • 58. MIR-Flickr Dataset Huiskes et. al. • 1 million images along with user-assigned tags. sculpture, beauty, stone nikon, green, light, photoshop, apple, d70 white, yellow, abstract, lines, bus, graphic sky, geotagged, reflecon, cielo, bilbao, reflejo food, cupcake, vegan d80 anawesomeshot, theperfectphotographer, flash, damniwishidtakenthat, spiritofphotography nikon, abigfave, goldstaraward, d80, nikond80
  • 59. Results • Logisc regression on top-level representaon. • Mulmodal Inputs Learning Algorithm MAP Precision@50 Random 0.124 0.124 LDA [Huiskes et. al.] 0.492 0.754 SVM [Huiskes et. al.] 0.475 0.758 DBM-Labelled 0.526 0.791 Deep Belief Net 0.638 0.867 Autoencoder 0.638 0.875 DBM 0.641 0.873 Mean Average Precision Labeled 25K examples + 1 Million unlabelled
  • 60. Helmholtz Machines • Hinton, G. E., Dayan, P., Frey, B. J. and Neal, R., Science 1995 Input data h3 h2 h1 v W3 W2 W1 Generave Process Approximate Inference • Kingma Welling, 2014 • Rezende, Mohamed, Daan, 2014 • Mnih Gregor, 2014 • Bornschein Bengio, 2015 • Tang Salakhutdinov, 2013
  • 61. Helmholtz Machines vs. DBMs Input data h3 h2 h1 v W3 W2 W1 Generave Process Approximate Inference h3 h2 h1 v W3 W2 W1 Deep Boltzmann Machine Helmholtz Machine
  • 62. Variaonal Autoencoders (VAEs) • The VAE defines a generave process in terms of ancestral sampling through a cascade of hidden stochasc layers: h3 h2 h1 v W3 W2 W1 Each term may denote a complicated nonlinear relaonship • Sampling and probability evaluaon is tractable for each . Generave Process • denotes parameters of VAE. • is the number of stochasNc layers. Input data
  • 63. VAE: Example • The VAE defines a generave process in terms of ancestral sampling through a cascade of hidden stochasc layers: This term denotes a one-layer neural net. Determinisc Layer Stochasc Layer Stochasc Layer • denotes parameters of VAE. • Sampling and probability evaluaon is tractable for each . • is the number of stochasNc layers.
  • 64. Variaonal Bound • The VAE is trained to maximize the variaonal lower bound: Input data h3 h2 h1 v W3 W2 W1 • Hard to opmize the variaonal bound with respect to the recognion network (high-variance). • Key idea of Kingma and Welling is to use reparameterizaon trick. • Trading off the data log-likelihood and the KL divergence from the true posterior.
  • 65. Reparameterizaon Trick • Assume that the recognion distribuon is Gaussian: with mean and covariance computed from the state of the hidden units at the previous layer. • Alternavely, we can express this in term of auxiliary variable:
  • 66. • Assume that the recognion distribuon is Gaussian: • Or Determinisc Encoder • The recognion distribuon can be expressed in terms of a determinisc mapping: Distribuon of does not depend on Reparameterizaon Trick
  • 67. Compung the Gradients • The gradient w.r.t the parameters: both recognion and generave: Gradients can be computed by backprop The mapping h is a determinisc neural net for fixed . Autoencoder
  • 68. Importance Weighted Autoencoders • Can improve VAE by using following k-sample importance weighng of the log-likelihood: where are sampled from the recognion network. Input data h3 h2 h1 v W3 W2 W1 unnormalized importance weights Burda, Grosse, Salakhutdinov, 2015
  • 69. Generang Images from Capons • Generave Model: Stochasc Recurrent Network, chained sequence of Variaonal Autoencoders, with a single stochasc layer. • Recognion Model: Determinisc Recurrent Network. Stochasc Layer Gregor et. al. 2015 (Mansimov, Parisoko, Ba, Salakhutdinov, 2015)
  • 70. Movang Example • Can we generate images from natural language descripons? A stop sign is flying in blue skies A pale yellow school bus is flying in blue skies A herd of elephants is flying in blue skies A large commercial airplane is flying in blue skies (Mansimov, Parisoko, Ba, Salakhutdinov, 2015)
  • 71. Flipping Colors A yellow school bus parked in the parking lot A red school bus parked in the parking lot A green school bus parked in the parking lot A blue school bus parked in the parking lot (Mansimov, Parisoko, Ba, Salakhutdinov, 2015)
  • 72. Novel Scene Composions A toilet seat sits open in the bathroom Ask Google? A toilet seat sits open in the grass field
  • 73. (Some) Open Problems • Reasoning, Akenon, and Memory • Natural Language Understanding • Deep Reinforcement Learning • Unsupervised Learning / Transfer Learning / One-Shot Learning
  • 74. (Some) Open Problems • Reasoning, Akenon, and Memory • Natural Language Understanding • Deep Reinforcement Learning • Unsupervised Learning / Transfer Learning / One-Shot Learning
  • 75. • Query: President-elect Barack Obama said Tuesday he was not aware of alleged corrupon by X who was arrested on charges of trying to sell Obama’s senate seat. Who-Did-What Dataset • Document: “…arrested Illinois governor Rod Blagojevich and his chief of staff John Harris on corrupon charges … included Blogojevich allegedly conspiring to sell or trade the senate seat lev vacant by President-elect Barack Obama…” • Answer: Rod Blagojevich Onishi, Wang, Bansal, Gimpel, McAllester, EMNLP, 2016
  • 76. Recurrent Neural Network x1 x2 x3 h1 h2 h3 Nonlinearity Hidden State at previous me step Input at me step t
  • 77. Ø Use element-wise mulplicaon to model the interacons between document and query: Gated Akenon Mechanism • Use Recurrent Neural Networks (RNNs) to encode a document and a query: (Dhingra, Liu, Yang, Cohen, Salakhutdinov, ACL 2017)
  • 78. Mul-hop Architecture • Reasoning requires several passes over the context (Dhingra, Liu, Yang, Cohen, Salakhutdinov, ACL 2017)
  • 79. Analysis of Akenon • Context: “…arrested Illinois governor Rod Blagojevich and his chief of staff John Harris on corrupon charges … included Blogojevich allegedly conspiring to sell or trade the senate seat lev vacant by President-elect Barack Obama…” • Query: “President-elect Barack Obama said Tuesday he was not aware of alleged corrupon by X who was arrested on charges of trying to sell Obama’s senate seat.” • Answer: Rod Blagojevich Layer 1 Layer 2
  • 80. Analysis of Akenon • Context: “…arrested Illinois governor Rod Blagojevich and his chief of staff John Harris on corrupon charges … included Blogojevich allegedly conspiring to sell or trade the senate seat lev vacant by President-elect Barack Obama…” • Query: “President-elect Barack Obama said Tuesday he was not aware of alleged corrupon by X who was arrested on charges of trying to sell Obama’s senate seat.” • Answer: Rod Blagojevich Layer 1 Layer 2 Code + Data: hkps://github.com/bdhingra/ga-reader
  • 82. Memory as Acyclic Graph Encoding (MAGE) - RNN there ball the left She kitchen the to went She football the got Mary Coreference Hyper/Hyponymy RNN RNN xt Mt h0 h1 . . . ht 1 e1 e|E| . . . ht Mt+1 gt Dhingra, Yang, Cohen, Salakhutdinov 2017 Incorporang Prior Knowledge
  • 83. Her plain face broke into a huge smile when she saw Terry. “Terry!” she called out. She rushed to meet him and they embraced. “Hon, I want you to meet an old friend, Owen McKenna. Owen, please meet Emily.'’ She gave me a quick nod and turned back to X Coreference Dependency Parses Enty relaons Word relaons Core NLP Freebase WordNet Recurrent Neural Network Text Representaon Incorporang Prior Knowledge
  • 84. Neural Story Telling Sample from the GeneraNve Model (recurrent neural network): The sun had risen from the ocean, making her feel more alive than normal. She is beauful, but the truth is that I do not know what to do. The sun was just starng to fade away, leaving people scakered around the Atlanc Ocean. She was in love with him for the first me in months, so she had no intenon of escaping. (Kiros et al., NIPS 2015)
  • 85. (Some) Open Problems • Reasoning, Akenon, and Memory • Natural Language Understanding • Deep Reinforcement Learning • Unsupervised Learning / Transfer Learning / One-Shot Learning
  • 86. Learning to map sequences of observaons to acons, for a parcular goal Acon Observaon Learning Behaviors
  • 87. Differentiable Neural Computer, Graves et al., Nature, 2016; Neural Turing Machine, Graves et al., 2014 Observaon / State Acon Reward Learned External Memory Reinforcement Learning with Memory
  • 88. Differentiable Neural Computer, Graves et al., Nature, 2016; Neural Turing Machine, Graves et al., 2014 Observaon / State Acon Reward Learned External Memory Reinforcement Learning with Memory Learning 3-D game without memory Chaplot, Lample, AAAI 2017
  • 89.
  • 90. Parisotto, Salakhutdinov, 2017 Observaon / State Acon Reward Learned Structured Memory Deep RL with Memory
  • 91. • Indicator: Either blue or pink Ø If blue, find the green block Ø If pink, find the red block • Negave reward if agent does not find correct block in N steps or goes to wrong block. Parisotto, Salakhutdinov, 2017 Random Maze with Indicator
  • 93. Random Maze with Indicator
  • 94. Building Intelligent Agents Observaon / State Acon Reward Learned External Memory Knowledge Base
  • 95. Building Intelligent Agents Observaon / State Acon Reward Learned External Memory Knowledge Base Learning from Fewer Examples, Fewer Experiences
  • 96. Summary • Efficient learning algorithms for Deep Unsupervised Models • Deep models improve the current state-of-the art in many applicaon domains: Ø Object recognion and detecon, text and image retrieval, handwriken character and speech recognion, and others. HMM decoder Speech RecogniNon sunset, pacific ocean, beach, seashore MulNmodal Data Object DetecNon Text image retrieval / Object recogniNon Learning a Category Hierarchy mosque, tower, building, cathedral, dome, castle Image Tagging