Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Deep	
  Learning	
  II	
  
Ruslan	
  Salakhutdinov	
  
Department of Computer Science!
University of Toronto à CMU!
Canad...
Talk	
  Roadmap	
  
•  Learning	
  structured	
  and	
  Robust	
  Models	
  
•  Transfer	
  and	
  One-­‐Shot	
  Learning	...
Face	
  RecogniEon	
  
Due	
  to	
  extreme	
  illuminaEon	
  variaEons,	
  deep	
  models	
  
perform	
  quite	
  poorly	...
Consider	
  More	
  Structured	
  Models:	
  undirected	
  +	
  directed	
  models.	
  
Deep	
  LamberEan	
  Model	
  
Dee...
LamberEan	
  Reflectance	
  Model	
  
• 	
  A	
  simple	
  model	
  of	
  the	
  image	
  formaEon	
  process.	
  
• 	
  Al...
Deep	
  LamberEan	
  Model	
  
Observed	
  
Image	
  
Image	
  
albedo	
  
Surface	
  	
  
normals	
  
Light	
  	
  
sourc...
Deep	
  LamberEan	
  Model	
  
Observed	
  
Image	
  
Image	
  
albedo	
  
Surface	
  	
  
normals	
  
Light	
  	
  
sourc...
Yale	
  B	
  Extended	
  Face	
  Dataset	
  
• 	
  38	
  subjects,	
  ∼	
  45	
  images	
  of	
  varying	
  illuminaEons	
...
Face	
  RelighEng	
  
One	
  Test	
  Image	
  
Face	
  RelighEng	
  Observed	
  
Inferred	
  
albedo	
  
RecogniEon	
  Results	
  
RecogniEon	
  as	
  funcEon	
  of	
  the	
  number	
  of	
  training	
  images	
  for	
  
10	
  ...
RecogniEon	
  Results	
  
RecogniEon	
  as	
  funcEon	
  of	
  the	
  number	
  of	
  training	
  images	
  for	
  
10	
  ...
Robust	
  Boltzmann	
  Machines	
  
• 	
  Build	
  more	
  structured	
  models	
  that	
  can	
  deal	
  with	
  occlusio...
Robust	
  Boltzmann	
  Machines	
  
• 	
  Build	
  more	
  structured	
  models	
  that	
  can	
  deal	
  with	
  occlusio...
Robust	
  Boltzmann	
  Machines	
  
• 	
  Build	
  more	
  structured	
  models	
  that	
  can	
  deal	
  with	
  occlusio...
#	
  of	
  iteraEons	
  	
  	
  	
  	
  1 	
   	
  3 	
  	
  	
  	
  	
  5 	
  	
  	
  	
  	
  	
  	
  	
  	
  7 	
  	
  1...
#	
  of	
  iteraEons	
  	
  	
  	
  	
  1 	
   	
  3 	
  	
  	
  	
  	
  5 	
  	
  	
  	
  	
  	
  	
  	
  	
  7 	
  	
  1...
#	
  of	
  iteraEons	
  	
  	
  	
  	
  1 	
   	
  3 	
  	
  	
  	
  	
  5 	
  	
  	
  	
  	
  	
  	
  	
  	
  7 	
  	
  1...
Talk	
  Roadmap	
  
•  Learning	
  structured	
  and	
  Robust	
  Models	
  
•  Transfer	
  and	
  One-­‐Shot	
  Learning	...
Transfer	
  Learning	
  
“zarc” “segway”
How	
  can	
  we	
  learn	
  a	
  novel	
  concept	
  –	
  a	
  high	
  dimension...
Supervised	
  Learning	
  
Segway	
   Motorcycle	
  
Test:	
  	
  
Learning	
  to	
  Learn	
  
Test:	
  	
  
Millions	
  of	
  unlabeled	
  images	
  	
  
Some	
  labeled	
  images	
  
Bicy...
One shot learning of simple visual concepts
Brenden M. Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua B. Tenenbaum
De...
Lower-­‐level	
  generic	
  features:	
  	
  
DBM	
  Model	
  
Images	
  
	
  horse	
   	
  cow	
   	
  car	
   	
  van	
 ...
Lower-­‐level	
  generic	
  features:	
  	
  
DBM	
  Model	
  
Images	
  
Higher-­‐level	
  class-­‐sensi@ve	
  features:	...
 horse	
   	
  cow	
   	
  car	
   	
  van	
   	
  truck	
  
Hierarchical-­‐Deep	
  Model	
  
 horse	
   	
  cow	
   	
  car	
   	
  van	
   	
  truck	
  
“animal”	
  
“vehicle”	
   	
   	
  (Nested	
  Chinese	
  Res...
 horse	
   	
  cow	
   	
  car	
   	
  van	
   	
  truck	
  
“animal”	
  
“vehicle”	
  
Tree	
  hierarchy	
  of	
  	
  
cl...
“animal”	
  
“vehicle”	
  
Tree	
  hierarchy	
  of	
  	
  
classes	
  is	
  learned	
  
	
   	
  (Nested	
  Chinese	
  Res...
 horse	
   	
  cow	
   	
  car	
   	
  van	
   	
  truck	
  
CIFAR	
  Object	
  RecogniEon	
  
“animal”	
  
“vehicle”	
  
...
 horse	
   	
  cow	
   	
  car	
   	
  van	
   	
  truck	
  
“animal”	
  
“vehicle”	
  
Lower-­‐level	
  
generic	
  	
  f...
Learning	
  the	
  Hierarchy	
  
woman	
  
The	
  model	
  learns	
  how	
  to	
  share	
  the	
  knowledge	
  across	
  m...
Sharing	
  Features	
  
Shape	
   Color	
  
Sunflower	
  
orange	
  apple	
   sunflower	
  
Learning	
  to	
  
Learn	
  
“fr...
Learning	
  from	
  3	
  Examples	
  
Given	
  only	
  3	
  Examples	
  
Generated	
  Samples	
  
Willow	
  Tree	
   Rocke...
Learned	
  
higher-­‐level	
  	
  
features	
  
Learned	
  lower-­‐
level	
  features	
  
	
  char	
  1	
   	
  char	
  2	...
SimulaEng	
  New	
  Characters	
  
Global	
  
Super	
  
class	
  1	
  
Super	
  
class	
  2	
  
Real	
  data	
  within	
  ...
SimulaEng	
  New	
  Characters	
  
Real	
  data	
  within	
  super	
  class	
  
Class	
  1	
   Class	
  2	
  
Super	
  
cl...
SimulaEng	
  New	
  Characters	
  
Real	
  data	
  within	
  super	
  class	
  
Class	
  1	
   Class	
  2	
  
Super	
  
cl...
SimulaEng	
  New	
  Characters	
  
Real	
  data	
  within	
  super	
  class	
  
Class	
  1	
   Class	
  2	
  
Super	
  
cl...
SimulaEng	
  New	
  Characters	
  
Real	
  data	
  within	
  super	
  class	
  
Class	
  1	
   Class	
  2	
  
Super	
  
cl...
SimulaEng	
  New	
  Characters	
  
Real	
  data	
  within	
  super	
  class	
  
Class	
  1	
   Class	
  2	
  
Super	
  
cl...
SimulaEng	
  New	
  Characters	
  
Real	
  data	
  within	
  super	
  class	
  
Class	
  1	
   Class	
  2	
  
Super	
  
cl...
DiscriminaEve	
  Transfer	
  Learning	
  	
  	
  
with	
  Deep	
  Nets	
  
Transfer	
  Learning	
  with	
  Shared	
  Priors	
  
Srivastava and Salakhutdinov, NIPS 2013	
  
Talk	
  Roadmap	
  
•  Learning	
  structured	
  and	
  Robust	
  Models	
  
•  Transfer	
  and	
  One-­‐Shot	
  Learning	...
Data	
  –	
  CollecEon	
  of	
  ModaliEes	
  
• 	
  MulEmedia	
  content	
  on	
  the	
  web	
  -­‐	
  
image	
  +	
  text...
Shared	
  Concept	
  
“Modality-­‐free”	
  representaEon	
  	
  
“Modality-­‐full”	
  representaEon	
  	
  
“Concept”	
  
...
•  Improve	
  ClassificaEon	
  	
  
MulE-­‐Modal	
  Input	
  
pentax,	
  k10d,	
  kangarooisland	
  
southaustralia,	
  sa	...
Challenges	
  -­‐	
  I	
  	
  
Very	
  different	
  input	
  
representaEons	
  
Image	
   Text	
  
sunset,	
  pacific	
  oc...
Challenges	
  -­‐	
  II	
  	
  
Noisy	
  and	
  missing	
  data	
  
Image	
   Text	
  
pentax,	
  k10d,	
  
pentaxda50200,...
Challenges	
  -­‐	
  II	
  	
  
Image	
   Text	
   Text	
  generated	
  by	
  the	
  model	
  
beach,	
  sea,	
  surf,	
  ...
A	
  Simple	
  MulEmodal	
  Model	
  
• 	
  Use	
  a	
  joint	
  binary	
  hidden	
  layer.	
  
• 	
  Problem:	
  	
  Inpu...
0	
  
0	
  
1	
  
0	
  
0	
  
Dense,	
  real-­‐valued	
  
image	
  features	
  
Gaussian	
  model	
  
Replicated	
  So9max...
MulEmodal	
  DBM	
  
0	
  
0	
  
1	
  
0	
  
0	
  
Dense,	
  real-­‐valued	
  
image	
  features	
  
Gaussian	
  model	
  ...
Gaussian	
  model	
  
Replicated	
  So9max	
  
0	
  
0	
  
1	
  
0	
  
0	
  
MulEmodal	
  DBM	
  
Word	
  
counts	
  
Dens...
0	
  
0	
  
1	
  
0	
  
0	
  
Dense,	
  real-­‐valued	
  
image	
  features	
  
Word	
  
counts	
  
Gaussian	
  model	
  
...
0	
  
0	
  
1	
  
0	
  
0	
  
Dense,	
  real-­‐valued	
  
image	
  features	
  
Word	
  
counts	
  
Gaussian	
  RBM	
  
Re...
Text	
  Generated	
  from	
  Images	
  
canada,	
  nature,	
  
sunrise,	
  ontario,	
  fog,	
  
mist,	
  bc,	
  morning	
 ...
Text	
  Generated	
  from	
  Images	
  
Given Generated 	
   	
  	
  
water,	
  glass,	
  beer,	
  bonle,	
  
drink,	
  wi...
Images	
  from	
  Text	
  
water,	
  red,	
  
sunset	
  
nature,	
  flower,	
  
red,	
  green	
  
blue,	
  green,	
  
yello...
MIR-­‐Flickr	
  Dataset	
  
Huiskes	
  et.	
  al.	
  
• 	
  1	
  million	
  images	
  along	
  with	
  user-­‐assigned	
  ...
Data	
  and	
  Architecture	
  
12	
  Million	
  parameters	
  
2048	
  
1024	
  
1024	
  
2000	
  
1024	
  
1024	
  
3857...
Results	
  
• 	
  LogisEc	
  regression	
  on	
  top-­‐level	
  representaEon.	
  
• 	
  MulEmodal	
  Inputs	
  
Learning	...
Results	
  
• 	
  LogisEc	
  regression	
  on	
  top-­‐level	
  representaEon.	
  
• 	
  MulEmodal	
  Inputs	
  
Learning	...
GeneraEng	
  Sentences	
  
Input	
  
A	
  man	
  skiing	
  down	
  the	
  snow	
  	
  
covered	
  mountain	
  with	
  a	
 ...
Talk	
  Roadmap	
  
•  Learning	
  structured	
  and	
  Robust	
  Models	
  
•  Transfer	
  and	
  One-­‐Shot	
  Learning	...
Encode-­‐Decode	
  Framework	
  
• 	
  Decoder:	
  A	
  neural	
  language	
  model	
  that	
  combines	
  structure	
  
a...
RepresentaEon	
  of	
  Words	
  
•  Key	
  Idea:	
  Each	
  word	
  w	
  is	
  represented	
  as	
  a	
  D-­‐dimensional	
...
An	
  Image-­‐Text	
  Encoder	
  
Joint	
  Feature	
  space	
  
Encoder:	
  
ConvNet	
  	
  
ship	
   water	
  
Socher	
  ...
An	
  Image-­‐Text	
  Encoder	
  
Recurrent	
  Neural	
  Network	
  
(LSTM)	
  
1-­‐of-­‐V	
  encoding	
  of	
  words	
  
...
An	
  Image-­‐Text	
  Encoder	
  
Joint	
  Feature	
  space	
  
A	
  castle	
  and	
  	
  
reflecEng	
  water	
  
A	
  ship...
Retrieving	
  Sentences	
  for	
  Images	
  
Tagging	
  and	
  Retrieval	
  
mosque,	
  tower,	
  
building,	
  cathedral,	
  
dome,	
  castle	
  
kitchen,	
  stove,	
...
Retrieval	
  with	
  AdjecEves	
  
fluffy	
  
delicious	
  
MulEmodal	
  LinguisEc	
  RegulariEes	
  
Nearest Images!
(Kiros, Salakhutdinov, Zemel, TACL 2015)	
  
MulEmodal	
  LinguisEc	
  RegulariEes	
  
Nearest Images!
(Kiros, Salakhutdinov, Zemel, TACL 2015)	
  
Can	
  you	
  solve	
  Zero-­‐Shot	
  Problem?	
  
Can	
  you	
  solve	
  Zero-­‐Shot	
  Problem?	
  
Canada	
  Warbler	
   Yellow	
  Warbler	
   Sharpe-­‐tailed	
  
Sparrow...
Model	
  Architecture	
  	
  
CNN
MLP
Class
score
Dot
product
Wikipedia article
TF-IDF
Image
gf
The Cardinals or Cardinali...
AlternaEve	
  View	
  	
  
Joint	
  SemanEc	
  
Feature	
  space	
  
• 	
  Minimize	
  the	
  cross-­‐entropy	
  or	
  hin...
The	
  Model	
  
•  Consider	
  binary	
  one	
  vs.	
  all	
  classifier	
  for	
  class	
  c:	
  
•  Assume	
  that	
  we...
Problem	
  Setup	
  	
  
• 	
  The	
  training	
  set	
  contains	
  N	
  images	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
...
Caltech	
  UCSD	
  Bird	
  and	
  	
  
Oxford	
  Flower	
  Datasets	
  
• 	
  The	
  CUB200-­‐2010	
  contains	
  6,033	
 ...
Results:	
  Area	
  Under	
  ROC	
  
Learning	
  Algorithm	
   Unseen	
   Seen	
   Mean	
  
DA	
  [Elhoseiny,	
  et.al.,	
...
Results:	
  Area	
  Under	
  ROC	
  
Learning	
  Algorithm	
   Unseen	
   Seen	
   Mean	
  
DA	
  [Elhoseiny,	
  et.al.]	
...
-­‐ 	
  Tanagers	
   	
  -­‐	
  Thraupidae	
  
-­‐ 	
  Scarlet 	
   	
  -­‐	
  Cardinalidae	
  
-­‐	
  Tanagers	
  
Anribu...
-­‐ 	
  rhizome 	
   	
  -­‐	
  depth	
  
-­‐ 	
  freezing 	
   	
  -­‐	
  compost	
  
-­‐	
  iris	
  
Anribute	
  Discove...
How	
  About	
  GeneraEng	
  Sentences!	
  
Input	
  
A	
  man	
  skiing	
  down	
  the	
  snow	
  	
  
covered	
  mountai...
Log-­‐bilinear	
  Neural	
  Language	
  Model	
  
•  Each	
  word	
  w	
  is	
  represented	
  as	
  a	
  K-­‐dim	
  real-...
Log-­‐bilinear	
  Neural	
  Language	
  Model	
  
•  The	
  condiEonal	
  probability	
  of	
  the	
  next	
  
word	
  giv...
MulEplicaEve	
  Model	
  
•  We	
  represent	
  words	
  as	
  a	
  tensor:	
  	
  
where	
  G	
  is	
  the	
  number	
  o...
•  Let	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ...
•  The	
  condiEonal	
  probability	
  of	
  the	
  next	
  
word	
  given	
  by:	
  
1-­‐of-­‐V	
  encoding	
  of	
  word...
Decoding:	
  Neural	
  Language	
  Model	
  
•  Image	
  features	
  are	
  gaEng	
  the	
  hidden-­‐to-­‐output	
  connec...
Decoding:	
  Structured	
  NLM	
  
Decoding:	
  Structured	
  NLM	
  
Decoding:	
  Structured	
  NLM	
  
Decoding:	
  Structured	
  NLM	
  
Decoding:	
  Structured	
  NLM	
  
CapEon	
  GeneraEon	
  
CapEon	
  GeneraEon	
  
CapEon	
  GeneraEon	
  
Filling	
  in	
  the	
  Blanks	
  
CapEon	
  GeneraEon	
  	
  
Model	
  Samples	
  
• 	
  Two	
  men	
  in	
  a	
  room	
  talking	
  on	
  a	
  table	
  .	
...
Results	
  
• 	
  R@K	
  is	
  Recall@K	
  (high	
  is	
  good).	
  	
  
• 	
  Med	
  r	
  is	
  the	
  median	
  rank	
  ...
CapEon	
  GeneraEon	
  with	
  	
  
Visual	
  AnenEon	
  	
  
A	
  woman	
  is	
  throwing	
  a	
  frisbee	
  in	
  a	
  p...
CapEon	
  GeneraEon	
  with	
  	
  
Visual	
  AnenEon	
  	
  
A	
  woman	
  is	
  throwing	
  a	
  frisbee	
  in	
  a	
  p...
CapEon	
  GeneraEon	
  with	
  	
  
Visual	
  AnenEon	
  	
  
CapEon	
  GeneraEon	
  with	
  	
  
Visual	
  AnenEon	
  	
  
Xu	
  et.al.,	
  ICML	
  2015	
  
• 	
  Montreal/Toronto	
  ...
Recurrent	
  AnenEon	
  Model	
  
Coarse	
  
Image	
  
ClassificaEon	
  or	
  
CapEon	
  
GeneraEon	
  
Recurrent	
  Neural...
RelaEonship	
  To	
  Helmholtz	
  Machines	
  
Coarse	
  image	
  
StochasEc	
  Units	
  
DeterminisEc	
  
Units	
  
•  Go...
RelaEonship	
  To	
  Helmholtz	
  Machines	
  
Coarse	
  image	
  
DeterminisEc	
  
Units	
  
StochasEc	
  
Units	
  
•  V...
RelaEonship	
  To	
  Helmholtz	
  Machines	
  
Coarse	
  image	
  
DeterminisEc	
  
Units	
  
StochasEc	
  
Units	
  
•  V...
Improving	
  AcEon	
  RecogniEon	
  with	
  
Visual	
  AnenEon	
  
Cycling	
   Horse	
  back	
  riding	
   Basketball	
  S...
Talk	
  Roadmap	
  
•  Learning	
  structured	
  and	
  Robust	
  Models	
  
•  Transfer	
  and	
  One-­‐Shot	
  Learning	...
MoEvaEon	
  	
  
•  Goal:	
  Unsupervised	
  learning	
  of	
  high-­‐quality	
  sentence	
  
representaEons.	
  	
  
•  P...
Decoder	
  
Sequence	
  to	
  Sequence	
  Learning	
  
•  RNN	
  Encoder-­‐Decoders	
  
for	
  Machine	
  TranslaEon	
  
(...
Skip-­‐Thought	
  Model	
  	
  
• 	
  Given	
  a	
  tuple	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	...
Encoder	
  
Sentence	
   Generate	
  Forward	
  Sentence	
  
Generate	
  Previous	
  Sentence	
  
Skip-­‐Thought	
  Model	...
Learning	
  ObjecEve	
  
• 	
  We	
  are	
  given	
  a	
  tuple	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
 ...
Book	
  11K	
  corpus	
  
• 	
  Query	
  sentence	
  along	
  with	
  its	
  nearest	
  neighbor	
  from	
  500K	
  senten...
Book	
  11K	
  corpus	
  
• 	
  Query	
  sentence	
  along	
  with	
  its	
  nearest	
  neighbor	
  from	
  500K	
  senten...
SemanEc	
  Relatedness	
  	
  
•  SemEval	
  2014	
  Task	
  1:	
  semanEc	
  relatedness	
  SICK	
  dataset:	
  	
  
Give...
SemanEc	
  Relatedness	
  	
  
• 	
  Our	
  models	
  outperform	
  all	
  previous	
  systems	
  from	
  the	
  SemEval	
...
SemanEc	
  Relatedness	
  	
  
• 	
  Example	
  predicEons	
  from	
  the	
  SICK	
  test	
  set.	
  GT	
  is	
  the	
  gr...
 Paraphrase	
  DetecEon	
  
•  Microso9	
  Research	
  Paraphrase	
  Corpus:	
  For	
  two	
  sentences	
  one	
  
must	
 ...
ClassificaEon	
  Benchmarks	
  
•  5	
  datasets:	
  movie	
  review	
  senEment	
  (MR),	
  customer	
  product	
  
review...
Summary	
  
•  We	
  believe	
  our	
  model	
  for	
  learning	
  skip-­‐thought	
  vectors	
  only	
  
scratches	
  the	...
MulE-­‐Modal	
  Models	
  
Laser	
  scans	
  
Images	
  
Video	
  
Text	
  &	
  Language	
  	
  
Time	
  series	
  	
  
da...
Summary	
  
•  Efficient	
  learning	
  algorithms	
  for	
  Hierarchical	
  GeneraEve	
  Models.	
  
Learning	
  more	
  ad...
Thank	
  you	
  
Code	
  is	
  available	
  at:	
  
hnp://deeplearning.cs.toronto.edu/	
  
References	
  
•  J.	
  Ba,	
  V.	
  Mnih,	
  K.	
  KavukcuogluMulEple	
  Object	
  RecogniEon	
  with	
  Visual	
  AnenEo...
References	
  
•  H.	
  Lee,	
  A.	
  Banle,	
  R.	
  Raina,	
  A.	
  Ng,	
  Efficient	
  sparse	
  coding	
  algorithms,	
 ...
References	
  
•  R.	
  Salakhutdinov,	
  G.	
  Hinton,	
  Replicated	
  So9max:	
  an	
  Undirected	
  Topic	
  Model,	
 ...
Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)
Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)
Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)
Upcoming SlideShare
Loading in …5
×

Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

3,278 views

Published on

Building intelligent systems that are capable of extracting high-level representations from high-dimensional sensory data lies at the core of solving many AI related tasks, including visual object or pattern recognition, speech perception, and language understanding. Theoretical and biological arguments strongly suggest that building such systems requires deep architectures that involve many layers of nonlinear processing.
Many existing learning algorithms use shallow architectures, including neural networks with only one hidden layer, support vector machines, kernel logistic regression, and many others. The internal representations learned by such systems are necessarily simple and are incapable of extracting some types of complex structure from high-dimensional input.
In the past few years, researchers across many different communities, from applied statistics to engineering, computer science and neuroscience, have proposed several deep (hierarchical) models that are capable of extracting useful, high level structured representations. An important property of these models is that they can extract complex statistical dependencies from high-dimensional sensory input and efficiently learn high-level representations by re-using and combining intermediate concepts, allowing these models to generalize well across a wide variety of tasks. The learned high-level representations have been shown to give state-of-the-art results in many challenging learning problems, where data patterns often exhibit a high degree of variations, and have been successfully applied in a wide variety of application domains, including visual object recognition, information retrieval, natural language processing, and speech perception. A few notable examples of such models include Deep Belief Networks, Deep Boltzmann Machines, Deep Autoencoders, and sparse coding-based methods.

Published in: Science
  • D0WNL0AD FULL ▶ ▶ ▶ ▶ http://1lite.top/5gMwq ◀ ◀ ◀ ◀
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • How can I improve my memory and concentration? How can I improve my memory for studying?★★★ https://bit.ly/2GEWG9T
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Use this weird secret involving text messages to get your Ex to come crawling back! Learn how ★★★ http://scamcb.com/exback123/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

  1. 1. Deep  Learning  II   Ruslan  Salakhutdinov   Department of Computer Science! University of Toronto à CMU! Canadian Institute for Advanced Research   Microso9  Machine  Learning  and     Intelligence  School  
  2. 2. Talk  Roadmap   •  Learning  structured  and  Robust  Models   •  Transfer  and  One-­‐Shot  Learning   Part  1:  Advanced  Deep  Learning  Models   Part  2:  MulEmodal  Deep  Learning   •  MulEmodal  DBMs  for  images  and  text   •  CapEon  GeneraEon  with  Recurrent  RNNs   •  Skip-­‐Thought  vectors  for  NLP  
  3. 3. Face  RecogniEon   Due  to  extreme  illuminaEon  variaEons,  deep  models   perform  quite  poorly  on  this  dataset.     Yale  B  Extended  Face  Dataset   4  subsets  of  increasing  illuminaEon  variaEons  
  4. 4. Consider  More  Structured  Models:  undirected  +  directed  models.   Deep  LamberEan  Model   Deep   Undirected   Directed   Combines  the  elegant  properEes  of  the  LamberEan  model  with  the   Gaussian  DBM  model.   Observed   Image   Inferred   (Tang et. Al., ICML 2012, Tang et. al. CVPR 2012)!
  5. 5. LamberEan  Reflectance  Model   •   A  simple  model  of  the  image  formaEon  process.   •   Albedo  -­‐-­‐  diffuse  reflecEvity  of  a  surface,  material   dependent,  illuminaEon  independent.   Image   albedo   Surface     normal   Light     source   •   Images  with  different  illuminaEon  can  be  generated  by  varying  light   direcEons   •   Surface  normal  -­‐-­‐  perpendicular  to  the  tangent   plane  at  a  point  on  the  surface.  
  6. 6. Deep  LamberEan  Model   Observed   Image   Image   albedo   Surface     normals   Light     source  
  7. 7. Deep  LamberEan  Model   Observed   Image   Image   albedo   Surface     normals   Light     source   Gaussian  Deep     Boltzmann  Machine   Albedo  DBM:   Pretrained  using   Toronto  Face  Database   Transfer  Learning   Inference:  VariaEonal  Inference.   Learning:  StochasEc  ApproximaEon   Inferred  
  8. 8. Yale  B  Extended  Face  Dataset   •   38  subjects,  ∼  45  images  of  varying  illuminaEons  per  subject,   divided  into  4  subsets  of  increasing  illuminaEon  variaEons.       •   28  subjects  for  training,  and  10  for  tesEng.    
  9. 9. Face  RelighEng   One  Test  Image   Face  RelighEng  Observed   Inferred   albedo  
  10. 10. RecogniEon  Results   RecogniEon  as  funcEon  of  the  number  of  training  images  for   10  test  subjects.     One-­‐Shot   RecogniEon  
  11. 11. RecogniEon  Results   RecogniEon  as  funcEon  of  the  number  of  training  images  for   10  test  subjects.     One-­‐Shot   RecogniEon   What  about  dealing  with   occlusions  or  structured   noise?  
  12. 12. Robust  Boltzmann  Machines   •   Build  more  structured  models  that  can  deal  with  occlusions  or   structured  noise.     Observed   Image   Inferred   Truth   Inferred   Binary  Mask   (Tang et. Al., ICML 2012, Tang et. al. CVPR 2012) !
  13. 13. Robust  Boltzmann  Machines   •   Build  more  structured  models  that  can  deal  with  occlusions  or   structured  noise.     Gaussian  RBM,  modeling   clean  faces   Binary  RBM     modeling  occlusions   Binary  pixel-­‐wise   Mask   Gaussian  noise  model   Observed   Image  
  14. 14. Robust  Boltzmann  Machines   •   Build  more  structured  models  that  can  deal  with  occlusions  or   structured  noise.     Gaussian  RBM,  modeling   clean  faces   Binary  RBM     modeling  occlusions   Binary  pixel-­‐wise   Mask   Gaussian  noise  model                                is  a  heavy-­‐tailed  distribuEon   Inference:  VariaEonal  Inference.   Learning:  StochasEc  ApproximaEon  
  15. 15. #  of  iteraEons          1    3          5                  7    10              20      30            40  50           Inference  on  the  test  subjects   #  of  iteraEons   RecogniEon  Results  on     AR  Face  Database   Internal  states  of   RoBM  during   learning.     Inferred  
  16. 16. #  of  iteraEons          1    3          5                  7    10              20      30            40  50           Inference  on  the  test  subjects   #  of  iteraEons   RecogniEon  Results  on     AR  Face  Database   Internal  states  of   RoBM  during   learning.     Inferred  
  17. 17. #  of  iteraEons          1    3          5                  7    10              20      30            40  50           Inference  on  the  test  subjects   #  of  iteraEons   RecogniEon  Results  on     AR  Face  Database   Internal  states  of   RoBM  during   learning.     Inferred   Learning  Algorithm   Sunglasses   Scarf   Robust  BM   84.5%   80.7%   RBM   61.7%   32.9%   Eigenfaces   66.9%   38.6%   LDA   56.1%   27.0%   Pixel   51.3%   17.5%  
  18. 18. Talk  Roadmap   •  Learning  structured  and  Robust  Models   •  Transfer  and  One-­‐Shot  Learning   Part  1:  Advanced  Deep  Learning  Models   Part  2:  MulEmodal  Deep  Learning   •  MulEmodal  DBMs  for  images  and  text   •  CapEon  GeneraEon  with  Recurrent  RNNs   •  Skip-­‐Thought  vectors  
  19. 19. Transfer  Learning   “zarc” “segway” How  can  we  learn  a  novel  concept  –  a  high  dimensional   staEsEcal  object  –  from  few  examples.       (Lake et. al., 2010) !
  20. 20. Supervised  Learning   Segway   Motorcycle   Test:    
  21. 21. Learning  to  Learn   Test:     Millions  of  unlabeled  images     Some  labeled  images   Bicycle   Elephant   Dolphin   Tractor   Background  Knowledge   Learn  novel  concept   from  one  example   Learn  to  Transfer   Knowledge  
  22. 22. One shot learning of simple visual concepts Brenden M. Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua B. Tenenbaum Department of Brain and Cognitive Sciences Massachusetts Institute of Technology Abstract le can learn visual concepts from just one example, but mains a mystery how this is accomplished. Many authors proposed that transferred knowledge from more familiar epts is a route to one shot learning, but what is the form is abstract knowledge? One hypothesis is that the shar- f parts is core to one shot learning, and we evaluate this in the domain of handwritten characters, using a massive dataset. These simple visual concepts have a rich inter- art structure, yet they are particularly tractable for com- ional models. We introduce a generative model of how cters are composed from strokes, where knowledge from ous characters helps to infer the latent strokes in novel cters. The stroke model outperforms a competing state- e-art character model on a challenging one shot learning and it provides a good fit to human perceptual data. words: category learning; transfer learning; Bayesian eling; neural networks llmark of human cognition is learning from just a few es. For instance, a person only needs to see one Seg- acquire the concept and be able to discriminate future ys from other vehicles like scooters and unicycles (Fig. Similarly, children can acquire a new word from one ter (Carey & Bartlett, 1978). How is one shot learning e? concepts are almost never learned in a vacuum. Past nce with other concepts in a domain can support the earning of novel concepts, by showing the learner what for generalization. Many authors have suggested this ute to one shot learning: transfer of abstract knowledge Figure 1: Test yourself on one shot learning. From the example boxed in red, can you find the others in the array? On the left is a Segway and on the right is the first character of the Bengali alphabet. AnswerfortheBengalicharacter:Row2,Column3;Row4,Column2. Figure 2: Examples from a new 1600 character database. useful representational basis for many different vehicle and artifact concepts – a representation that is likely learned in the course of learning the concepts that they support. Several Hierarchical-­‐Deep  Model   HD  Models:  Integrate   hierarchical  Bayesian   models  with  deep  models.   Hierarchical  Bayes:   •   Learn  hierarchies  of  categories  for   sharing  abstract  knowledge.   Super-­‐category   One-­‐Shot  Learning   Deep  Models:   •   Learn  hierarchies  of  features.   •   Unsupervised  feature  learning  –  no  need   to  rely  on  human-­‐cra9ed  input  features.   Shared  higher-­‐level  features   Shared  low-­‐level  features   (Salakhutdinov,  Tenenbaum,  Torralba,  NIPS  2011,  PAMI  2013)!
  23. 23. Lower-­‐level  generic  features:     DBM  Model   Images    horse    cow    car    van    truck   Higher-­‐level  class-­‐sensi@ve  features:   Hierarchical  Dirichlet   Process  Model   •   capture  disEncEve  perceptual   structure  of  a  specific  concept   •   edges,  combinaEon  of  edges   Hierarchical-­‐Deep  Model  
  24. 24. Lower-­‐level  generic  features:     DBM  Model   Images   Higher-­‐level  class-­‐sensi@ve  features:   Hierarchical  Organiza@on  of  Categories:   •   modular  data-­‐parameter  relaEons         •   capture  disEncEve  perceptual   structure  of  a  specific  concept   •   edges,  combinaEon  of  edges   •   express  priors  on  the  features  that  are   typical  of  different  kinds  of  concepts    horse    cow    car    van    truck   “animal”   “vehicle”   Hierarchical  Dirichlet   Process  Model   Hierarchical-­‐Deep  Model  
  25. 25.  horse    cow    car    van    truck   Hierarchical-­‐Deep  Model  
  26. 26.  horse    cow    car    van    truck   “animal”   “vehicle”      (Nested  Chinese  Restaurant  Process)   prior:  a  nonparametric  prior  over  tree   structures   Tree  hierarchy  of     classes  is  learned   Hierarchical-­‐Deep  Model  
  27. 27.  horse    cow    car    van    truck   “animal”   “vehicle”   Tree  hierarchy  of     classes  is  learned      (Nested  Chinese  Restaurant  Process)   prior:  a  nonparametric  prior  over  tree   structures      (Hierarchical  Dirichlet  Process)  prior:   a  nonparametric  prior  allowing  categories  to   share  higher-­‐level  features,  or  parts.   Hierarchical-­‐Deep  Model  
  28. 28. “animal”   “vehicle”   Tree  hierarchy  of     classes  is  learned      (Nested  Chinese  Restaurant  Process)   prior:  a  nonparametric  prior  over  tree   structures   Deep  Boltzmann  Machine   Enforce  (approximate)  global  consistency     through  many  local  constraints.      horse    cow    car    van    truck      (Hierarchical  Dirichlet  Process)  prior:   a  nonparametric  prior  allowing  categories  to   share  higher-­‐level  features,  or  parts.   Hierarchical-­‐Deep  Model  
  29. 29.  horse    cow    car    van    truck   CIFAR  Object  RecogniEon   “animal”   “vehicle”   50,000  images  of  100  classes   4  million  unlabeled  images   32  x  32  pixels  x  3    RGB   Lower-­‐level   generic    features     Higher-­‐level  class   sensiEve  features     Tree  hierarchy  of     classes  is  learned   Inference:  Markov  chain     Monte  Carlo  
  30. 30.  horse    cow    car    van    truck   “animal”   “vehicle”   Lower-­‐level   generic    features     Higher-­‐level   features     Tree  hierarchy  of     classes  is  learned   4  million  Images   Each  image  is  made  up  of   learned  high-­‐level  features  features.   Each  higher-­‐level  feature  is  made  up   of  lower-­‐level  features.   CIFAR  Object  RecogniEon  (Salakhutdinov,  Tenenbaum,  Torralba,  NIPS  2011,  PAMI  2013)!
  31. 31. Learning  the  Hierarchy   woman   The  model  learns  how  to  share  the  knowledge  across  many  visual   categories.   “human”  “fruit”   “aqua@c   animal”   Learned  super-­‐ class  hierarchy   shark   ray  turtle  dolphin   baby   man  girl  sunflower  orange  apple   Basic  level     class   Learned  low-­‐level   generic  features   “global”   …   …   Learned  higher-­‐level   class-­‐sensi@ve  features  
  32. 32. Sharing  Features   Shape   Color   Sunflower   orange  apple   sunflower   Learning  to   Learn   “fruit”   0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 detectionrate false alarm rate 5  3      1  ex’s     Sunflower  ROC  curve   Pixel-­‐ space     distance   Learning  a  hierarchy  for  sharing  parameters  –                  rapid  learning  of  a  novel  concept.   Dolphin  Apple   Real   Reconst-­‐   rucEons   Orange  
  33. 33. Learning  from  3  Examples   Given  only  3  Examples   Generated  Samples   Willow  Tree   Rocket  
  34. 34. Learned   higher-­‐level     features   Learned  lower-­‐ level  features    char  1    char  2    char  3      char  4    char  5   “alphabet  1”   “alphabet  2”   Edges   Strokes   Handwrinen  Character  RecogniEon   25,000     characters  
  35. 35. SimulaEng  New  Characters   Global   Super   class  1   Super   class  2   Real  data  within  super  class   Class  1   Class  2   New  class   Simulated  new  characters  
  36. 36. SimulaEng  New  Characters   Real  data  within  super  class   Class  1   Class  2   Super   class  1   Super   class  2   Global   New  class   Simulated  new  characters  
  37. 37. SimulaEng  New  Characters   Real  data  within  super  class   Class  1   Class  2   Super   class  1   Super   class  2   Global   New  class   Simulated  new  characters  
  38. 38. SimulaEng  New  Characters   Real  data  within  super  class   Class  1   Class  2   Super   class  1   Super   class  2   Global   New  class   Simulated  new  characters  
  39. 39. SimulaEng  New  Characters   Real  data  within  super  class   Class  1   Class  2   Super   class  1   Super   class  2   Global   New  class   Simulated  new  characters  
  40. 40. SimulaEng  New  Characters   Real  data  within  super  class   Class  1   Class  2   Super   class  1   Super   class  2   Global   New  class   Simulated  new  characters  
  41. 41. SimulaEng  New  Characters   Real  data  within  super  class   Class  1   Class  2   Super   class  1   Super   class  2   Global   New  class   Simulated  new  characters   The  same  model  can  be  applied  to     speech,  text,  video,  or  any  other     high-­‐dimensional  data.  
  42. 42. DiscriminaEve  Transfer  Learning       with  Deep  Nets  
  43. 43. Transfer  Learning  with  Shared  Priors   Srivastava and Salakhutdinov, NIPS 2013  
  44. 44. Talk  Roadmap   •  Learning  structured  and  Robust  Models   •  Transfer  and  One-­‐Shot  Learning   Part  1:  Advanced  Deep  Learning  Models   Part  2:  MulEmodal  Deep  Learning   •  MulEmodal  DBMs  for  images  and  text   •  CapEon  GeneraEon  with  Recurrent  RNNs   •  Skip-­‐Thought  vectors  
  45. 45. Data  –  CollecEon  of  ModaliEes   •   MulEmedia  content  on  the  web  -­‐   image  +  text  +  audio.   •   Product  recommendaEon   systems.   •   RoboEcs  applicaEons.   Audio   Vision   Touch  sensors   Motor  control   sunset,   pacificocean,   bakerbeach,   seashore,  ocean   car,   automobile  
  46. 46. Shared  Concept   “Modality-­‐free”  representaEon     “Modality-­‐full”  representaEon     “Concept”   sunset,  pacific  ocean,   baker  beach,  seashore,   ocean  
  47. 47. •  Improve  ClassificaEon     MulE-­‐Modal  Input   pentax,  k10d,  kangarooisland   southaustralia,  sa  australia   australiansealion  300mm   SEA  /  NOT  SEA   •  Retrieve  data  from  one  modality  when  queried  using  data  from   another  modality   beach,  sea,  surf,   strand,  shore,   wave,  seascape,   sand,  ocean,  waves   •  Fill  in  Missing  ModaliEes   beach,  sea,  surf,   strand,  shore,   wave,  seascape,   sand,  ocean,  waves  
  48. 48. Challenges  -­‐  I     Very  different  input   representaEons   Image   Text   sunset,  pacific  ocean,   baker  beach,  seashore,   ocean   •   Images  –  real-­‐valued,  dense   Difficult  to  learn   cross-­‐modal  features   from  low-­‐level   representaEons.   Dense   •   Text  –  discrete,  sparse     Sparse  
  49. 49. Challenges  -­‐  II     Noisy  and  missing  data   Image   Text   pentax,  k10d,   pentaxda50200,   kangarooisland,  sa,   australiansealion   mickikrimmel,   mickipedia,   headshot   unseulpixel,   naturey,  crap   <  no  text>  
  50. 50. Challenges  -­‐  II     Image   Text   Text  generated  by  the  model   beach,  sea,  surf,  strand,   shore,  wave,  seascape,   sand,  ocean,  waves   portrait,  girl,  woman,  lady,   blonde,  preny,  gorgeous,   expression,  model   night,  none,  traffic,  light,   lights,  parking,  darkness,   lowlight,  nacht,  glow   fall,  autumn,  trees,  leaves,   foliage,  forest,  woods,   branches,  path   pentax,  k10d,   pentaxda50200,   kangarooisland,  sa,   australiansealion   mickikrimmel,   mickipedia,   headshot   unseulpixel,   naturey,  crap   <  no  text>  
  51. 51. A  Simple  MulEmodal  Model   •   Use  a  joint  binary  hidden  layer.   •   Problem:    Inputs  have  very  different  staEsEcal   properEes.   •   Difficult  to  learn  cross-­‐modal  features.   0   0   1   0   0   Real-­‐valued   1-­‐of-­‐K  
  52. 52. 0   0   1   0   0   Dense,  real-­‐valued   image  features   Gaussian  model   Replicated  So9max   MulEmodal  DBM   Word   counts   (Srivastava & Salakhutdinov, NIPS 2012)  
  53. 53. MulEmodal  DBM   0   0   1   0   0   Dense,  real-­‐valued   image  features   Gaussian  model   Replicated  So9max   Word   counts   (Srivastava & Salakhutdinov, NIPS 2012)  
  54. 54. Gaussian  model   Replicated  So9max   0   0   1   0   0   MulEmodal  DBM   Word   counts   Dense,  real-­‐valued   image  features   (Srivastava & Salakhutdinov, NIPS 2012)  
  55. 55. 0   0   1   0   0   Dense,  real-­‐valued   image  features   Word   counts   Gaussian  model   Replicated  So9max   MulEmodal  DBM   Bonom-­‐up   +   Top-­‐down  
  56. 56. 0   0   1   0   0   Dense,  real-­‐valued   image  features   Word   counts   Gaussian  RBM   Replicated  So9max   MulEmodal  DBM   Bonom-­‐up   +   Top-­‐down  
  57. 57. Text  Generated  from  Images   canada,  nature,   sunrise,  ontario,  fog,   mist,  bc,  morning   insect,  bunerfly,  insects,   bug,  bunerflies,   lepidoptera   graffiE,  streetart,  stencil,   sEcker,  urbanart,  graff,   sanfrancisco   portrait,  child,  kid,   ritrano,  kids,  children,   boy,  cute,  boys,  italy   dog,  cat,  pet,  kinen,   puppy,  ginger,  tongue,   kiny,  dogs,  furry   sea,  france,  boat,  mer,   beach,  river,  bretagne,   plage,  brinany   Given Generated       Given Generated      
  58. 58. Text  Generated  from  Images   Given Generated       water,  glass,  beer,  bonle,   drink,  wine,  bubbles,  splash,   drops,  drop   portrait,  women,  army,  soldier,   mother,  postcard,  soldiers   obama,  barackobama,  elecEon,   poliEcs,  president,  hope,  change,   sanfrancisco,  convenEon,  rally  
  59. 59. Images  from  Text   water,  red,   sunset   nature,  flower,   red,  green   blue,  green,   yellow,  colors   chocolate,  cake   Given Retrieved  
  60. 60. MIR-­‐Flickr  Dataset   Huiskes  et.  al.   •   1  million  images  along  with  user-­‐assigned  tags.   sculpture,  beauty,   stone   nikon,  green,  light,   photoshop,  apple,  d70   white,  yellow,   abstract,  lines,  bus,   graphic   sky,  geotagged,   reflecEon,  cielo,   bilbao,  reflejo   food,  cupcake,   vegan   d80   anawesomeshot,   theperfectphotographer,   flash,  damniwishidtakenthat,   spiritofphotography   nikon,  abigfave,   goldstaraward,  d80,   nikond80  
  61. 61. Data  and  Architecture   12  Million  parameters   2048   1024   1024   2000   1024   1024   3857   •   200  most  frequent  tags.     •   25K  labeled  subset  (15K   training,  10K  tesEng)   •   38  classes  -­‐  sky,  tree,   baby,  car,  cloud  …   •   AddiEonal  1  million   unlabeled  data  
  62. 62. Results   •   LogisEc  regression  on  top-­‐level  representaEon.   •   MulEmodal  Inputs   Learning  Algorithm   MAP   Precision@50   Random   0.124   0.124   LDA  [Huiskes  et.  al.]   0.492   0.754   SVM  [Huiskes  et.  al.]   0.475   0.758   DBM-­‐Labelled   0.526   0.791   Mean  Average  Precision   Labeled   25K   examples  
  63. 63. Results   •   LogisEc  regression  on  top-­‐level  representaEon.   •   MulEmodal  Inputs   Learning  Algorithm   MAP   Precision@50   Random   0.124   0.124   LDA  [Huiskes  et.  al.]   0.492   0.754   SVM  [Huiskes  et.  al.]   0.475   0.758   DBM-­‐Labelled   0.526   0.791   Deep  Belief  Net   0.638   0.867   Autoencoder   0.638   0.875   DBM   0.641   0.873   Mean  Average  Precision   Labeled   25K   examples   +  1  Million   unlabelled  
  64. 64. GeneraEng  Sentences   Input   A  man  skiing  down  the  snow     covered  mountain  with  a  dark     sky  in  the  background.       Output   •   More  challenging  problem.   •   How  can  we  generate  complete  descripEons  of  images?  
  65. 65. Talk  Roadmap   •  Learning  structured  and  Robust  Models   •  Transfer  and  One-­‐Shot  Learning   Part  1:  Advanced  Deep  Learning  Models   Part  2:  MulEmodal  Deep  Learning   •  MulEmodal  DBMs  for  images  and  text   •  CapEon  GeneraEon  with  Recurrent  RNNs   •  Skip-­‐Thought  vectors  
  66. 66. Encode-­‐Decode  Framework   •   Decoder:  A  neural  language  model  that  combines  structure   and  content  vectors  for  generaEng  a  sequence  of  words     •   Encoder:  CNN  and  Recurrent  Neural  Net  for  a  joint  image-­‐ sentence  embedding.    
  67. 67. RepresentaEon  of  Words   •  Key  Idea:  Each  word  w  is  represented  as  a  D-­‐dimensional   real-­‐valued  vector  rw  2  RK.   Dimension  2   Dimension  2   SemanEc  Space   table   chair   dolphin   whale   November   Bengio et.al., 2003, Mnih et. al., 2008, Mikolov et. al., 2009, Kiros et.al. 2014!
  68. 68. An  Image-­‐Text  Encoder   Joint  Feature  space   Encoder:   ConvNet     ship   water   Socher  2013,  Frome  2013,  Kiros  2014   •  Learn  a  joint  embedding  space  of  images  and  text:   -  Can  condiEon  on  anything  (images,  words,  phrases,  etc)   -  Natural  definiEon  of  a  scoring  funcEon  (inner  products  in  the   joint  space).  
  69. 69. An  Image-­‐Text  Encoder   Recurrent  Neural  Network   (LSTM)   1-­‐of-­‐V  encoding  of  words   w1   w2   w3   ConvoluEonal  Neural  Network     Sentence   RepresentaEon     Image   RepresentaEon     See Skip-Thought Vectors, (Kiros et.al. arXiv 2015)  
  70. 70. An  Image-­‐Text  Encoder   Joint  Feature  space   A  castle  and     reflecEng  water   A  ship  sailing     in  the  ocean   Images:   Text:   A  plane  flying   in  the  sky   Minimize  the  following  objecEve:  
  71. 71. Retrieving  Sentences  for  Images  
  72. 72. Tagging  and  Retrieval   mosque,  tower,   building,  cathedral,   dome,  castle   kitchen,  stove,  oven,   refrigerator,   microwave   ski,  skiing,   skiers,  skiiers,   snowmobile   bowl,  cup,   soup,  cups,   coffee   beach   snow  
  73. 73. Retrieval  with  AdjecEves   fluffy   delicious  
  74. 74. MulEmodal  LinguisEc  RegulariEes   Nearest Images! (Kiros, Salakhutdinov, Zemel, TACL 2015)  
  75. 75. MulEmodal  LinguisEc  RegulariEes   Nearest Images! (Kiros, Salakhutdinov, Zemel, TACL 2015)  
  76. 76. Can  you  solve  Zero-­‐Shot  Problem?  
  77. 77. Can  you  solve  Zero-­‐Shot  Problem?   Canada  Warbler   Yellow  Warbler   Sharpe-­‐tailed   Sparrow  
  78. 78. Model  Architecture     CNN MLP Class score Dot product Wikipedia article TF-IDF Image gf The Cardinals or Cardinalidae are a family of passerine birds found in North and South America The South American cardinals in the genus… family northgenus birds south america … Cxk 1xk 1xC
  79. 79. AlternaEve  View     Joint  SemanEc   Feature  space   •   Minimize  the  cross-­‐entropy  or  hinge-­‐loss  objecEve.   L.  Ba,  K.  Swersky,  S.  Fidler,  Salakhutdinov,  arXiv  2015  
  80. 80. The  Model   •  Consider  binary  one  vs.  all  classifier  for  class  c:   •  Assume  that  we  are  given  an  addiEon  text  feature                                  .   •  We  can  use  this  idea  to  predict  the  output  weights  of  a  classifier   (both  the  fully  connected  and  convoluEonal  layers  of  a  CNN).     •  Simple  idea:  Instead  of  learning  a  staEc  weight  vector        w,  the   text  feature  can  be  used  to  predict  the  weights:            where                                                      is  a  mapping  that  transforms  the  text   features  to  the  visual  image  feature  space.  
  81. 81. Problem  Setup     •   The  training  set  contains  N  images                                        and  their  associated   class  labels                                                      .  There  are  C  disEnct  class  labels.   •   During  test  Eme,  we  are  given  addiEonal              number  of  the   previously  unseen  classes,  such  that                                                                                            .   •   Our  goal  is  to  predict  previously  unseen  classes  and  perform   well  on  the  previously  seen  classes.     •  InterpretaEon:  Learning  a  good  similarity  kernel  between   images  and  encyclopedia  arEcles  .    
  82. 82. Caltech  UCSD  Bird  and     Oxford  Flower  Datasets   •   The  CUB200-­‐2010  contains  6,033  images  from  200  bird  species   (about  30  images  per  class).     •   There  is  one  Wikipedia  arEcle  associated  with  each  bird  class  (200   arEcles  in  total).  The  average  number  of  words  per  arEcles  is  400.     •   Out  of  200  classes,  40  classes  are  defined  as  unseen  and  the   remaining  160  classes  are  defined  as  seen.   •   The  Oxford  Flower-­‐102  dataset  contains  102  classes  with  a   total  of  8189  images.     •   Each  class  contains  40  to  260  images.  82  flower  classes  are  used   for  training  and  20  classes  are  used  as  unseen  during  tesEng.  
  83. 83. Results:  Area  Under  ROC   Learning  Algorithm   Unseen   Seen   Mean   DA  [Elhoseiny,  et.al.,  2013]   0.59   -­‐   -­‐   DA  (VGG  features)   0.62   -­‐   -­‐   Ours  (fc)   0.82   0.96   0.934   Ours  (conv)   0.73   0.96   0.91   Ours  (fc+conv)   0.80   0.987   0.95   M.  Elhoseiny,  B.  Saleh,  and  A.  Elgammal,  ICCV,  2013.   The  CUB200-­‐2010  Dataset      
  84. 84. Results:  Area  Under  ROC   Learning  Algorithm   Unseen   Seen   Mean   DA  [Elhoseiny,  et.al.]   0.62   -­‐   -­‐   GPR+DA  [Elhoseiny,  et.al.]   0.68   -­‐   -­‐   Ours  (fc)   0.70   0.987   0.90   Ours  (conv)   0.65   0.97   0.85   Ours  (fc+conv)   0.71   0.989   0.93   M.  Elhoseiny,  B.  Saleh,  and  A.  Elgammal,  ICCV,  2013.   The  Oxford  Flower  Dataset  
  85. 85. -­‐   Tanagers    -­‐  Thraupidae   -­‐   Scarlet    -­‐  Cardinalidae   -­‐  Tanagers   Anribute  Discovery   Word  sensiEviEes  of  unseen  classes   Nearest  Neighbors   •   Wikipedia  arEcle  for  each  class  is  projected  onto  its  feature  space   and  the  nearest  image  neighbors  from  the  test-­‐set  are  shown.  
  86. 86. -­‐   rhizome    -­‐  depth   -­‐   freezing    -­‐  compost   -­‐  iris   Anribute  Discovery   Word  sensiEviEes  of  unseen  classes   Nearest  Neighbors   •   Wikipedia  arEcle  for  each  class  is  projected  onto  its  feature  space   and  the  nearest  image  neighbors  from  the  test-­‐set  are  shown.   Bearded  Iris  
  87. 87. How  About  GeneraEng  Sentences!   Input   A  man  skiing  down  the  snow     covered  mountain  with  a  dark     sky  in  the  background.       Output   We  want  to  model:  
  88. 88. Log-­‐bilinear  Neural  Language  Model   •  Each  word  w  is  represented  as  a  K-­‐dim  real-­‐ valued  vector  rw  2  RK.   •  Feedforward  neural  network  with  a  single   linear  hidden  layer.   •  R  denote  the  V  £  K  matrix  of  word   representaEon  vectors,  where  V  is  the   vocabulary  size.   •  (w1,  …,  wn-­‐1)  is  tuple  of  n-­‐1  words,   where  n-­‐1  is  the  context  size.  The  next   word  representaEon  becomes:     K  £  K  context  parameter  matrices         1-­‐of-­‐V  encoding   of  words   rw1   rw2   rw3   r   w4   w1   w2   w3  
  89. 89. Log-­‐bilinear  Neural  Language  Model   •  The  condiEonal  probability  of  the  next   word  given  by:   Predicted  representaEon  of  rwn.     1-­‐of-­‐V  encoding   of  words   rw1   rw2   rw3   r   w4   w1   w2   w3   Can  be  expensive  to   compute   Bengio et.al. 2003  
  90. 90. MulEplicaEve  Model   •  We  represent  words  as  a  tensor:     where  G  is  the  number  of  tensor  slices.     •  Re-­‐represent  Tensor  in  terms  of  3  lower-­‐rank  matrices  (where  F  is   the  number  of  pre-­‐chosen  factors):     •  Given  an  anribute  vector  u  2  RG  (e.g.  image  features),  we  can   compute  anribute-­‐gated  word  representaEons  as:   (Kiros, Zemel, Salakhutdinov, NIPS 2014)  
  91. 91. •  Let                                                                                        denote  a   folded  K  £  V  matrix  of  word  embeddings.   MulEplicaEve  Log-­‐bilinear  Model   •  Then  the  predicted  next  word   representaEon  is:     •  Given  next  word  representaEon  r,  the   factor  outputs  are:   Component-­‐wise  product   1-­‐of-­‐V  encoding  of  words   Ew1   Ew2   Ew3   Low  rank   r   w4   f   u   GaEng   anributes   Low   rank   w1   w2   w3   (Kiros, Zemel, Salakhutdinov, NIPS 2014)  
  92. 92. •  The  condiEonal  probability  of  the  next   word  given  by:   1-­‐of-­‐V  encoding  of  words   Ew1   Ew2   Ew3   Low  rank   r   w4   f   u   GaEng   anributes   MulEplicaEve  Log-­‐bilinear  Model   Low  rank   w1   w2   w3  
  93. 93. Decoding:  Neural  Language  Model   •  Image  features  are  gaEng  the  hidden-­‐to-­‐output  connecEons   when  predicEng  the  next  word!       •  We  can  also  condiEon  on  POS  tags  when  generaEng  a   sentence.       (Kiros, Salakhutdinov, Zemel, ICML 2014)  
  94. 94. Decoding:  Structured  NLM  
  95. 95. Decoding:  Structured  NLM  
  96. 96. Decoding:  Structured  NLM  
  97. 97. Decoding:  Structured  NLM  
  98. 98. Decoding:  Structured  NLM  
  99. 99. CapEon  GeneraEon  
  100. 100. CapEon  GeneraEon  
  101. 101. CapEon  GeneraEon  
  102. 102. Filling  in  the  Blanks  
  103. 103. CapEon  GeneraEon     Model  Samples   •   Two  men  in  a  room  talking  on  a  table  .   •   Two  men  are  si|ng  next  to  each  other  .   •   Two  men  are  having  a  conversaEon  at  a  table  .   •   Two  men  si|ng  at  a  desk  next  to  each  other  .   colleagues    waiters    waiter     entrepreneurs    busboy   TAGS:  
  104. 104. Results   •   R@K  is  Recall@K  (high  is  good).     •   Med  r  is  the  median  rank  (low  is  good).  
  105. 105. CapEon  GeneraEon  with     Visual  AnenEon     A  woman  is  throwing  a  frisbee  in  a  park.     Xu  et.al.,  ICML  2015  
  106. 106. CapEon  GeneraEon  with     Visual  AnenEon     A  woman  is  throwing  a  frisbee  in  a  park.     Xu  et.al.,  ICML  2015  
  107. 107. CapEon  GeneraEon  with     Visual  AnenEon    
  108. 108. CapEon  GeneraEon  with     Visual  AnenEon     Xu  et.al.,  ICML  2015   •   Montreal/Toronto  team  takes  3rd  place  on  Microso9  COCO   capEon  generaEon  compeEEon,  finishing  slightly  behind  Google   and  Microso9.  This  is  based  on  the  human  evaluaEon  results.      
  109. 109. Recurrent  AnenEon  Model   Coarse   Image   ClassificaEon  or   CapEon   GeneraEon   Recurrent  Neural   Network   Sample  acEon:   •   Hard  AYen@on:  Sample   acEon  (latent  gaze  locaEon)   •   SoZ  AYen@on:  Take   expectaEon  instead  of   sampling  
  110. 110. RelaEonship  To  Helmholtz  Machines   Coarse  image   StochasEc  Units   DeterminisEc   Units   •  Goal:  Maximize  the  probability  of  correct  class  (sequence  of  words)   by  marginalizing  over  the  acEons  (or  latent  gaze  locaEons):   StochasEc   Units   •  Neural  Network   with  stochasEc  and   determinisEc  units.     Ba  et.al.,  2015  
  111. 111. RelaEonship  To  Helmholtz  Machines   Coarse  image   DeterminisEc   Units   StochasEc   Units   •  View  this  model  as  a  condiEonal  Helmholtz  Machine  with   stochasEc  and  determinisEc  units.   •  Can  use  Wake-­‐Sleep,  variaEonal  autoencoders,  and  their  variants   to  learns  good  anenEon  policy!   Input  data   h3 h2 h1 v W3 W2 W1 Helmholtz  Machine   Ba  et.al.,  2015  
  112. 112. RelaEonship  To  Helmholtz  Machines   Coarse  image   DeterminisEc   Units   StochasEc   Units   •  View  this  model  as  a  condiEonal  Helmholtz  Machine  with   stochasEc  and  determinisEc  units.   •  Can  use  Wake-­‐Sleep,  variaEonal  autoencoders,  and  their  variants   to  learns  good  anenEon  policy!   Input  data   h3 h2 h1 v W3 W2 W1 Helmholtz  Machine   Wake-­‐Sleep  Recurrent     AnenEon  Models   Ba  et.al.,  2015  
  113. 113. Improving  AcEon  RecogniEon  with   Visual  AnenEon   Cycling   Horse  back  riding   Basketball  ShooEng  Soccer  juggling   Sharma  et.al.,  2015  
  114. 114. Talk  Roadmap   •  Learning  structured  and  Robust  Models   •  Transfer  and  One-­‐Shot  Learning   Part  1:  Advanced  Deep  Learning  Models   Part  2:  MulEmodal  Deep  Learning   •  MulEmodal  DBMs  for  images  and  text   •  CapEon  GeneraEon  with  Recurrent  RNNs   •  Skip-­‐Thought  vectors  
  115. 115. MoEvaEon     •  Goal:  Unsupervised  learning  of  high-­‐quality  sentence   representaEons.     •  Previous  approaches:  Learning  composiEon  operators  that   map  word  vectors  to  sentence  vectors,  including  recursive   networks,  recurrent  net,  etc.       •  Instead  of  using  a  word  to  predict  its  surrounding  context,  we   encode  a  sentence  to  predict  the  sentences  around  it.   •  We  develop  an  objecEve  that  abstracts  the  skip-­‐gram  model  to   the  sentence  level.  
  116. 116. Decoder   Sequence  to  Sequence  Learning   •  RNN  Encoder-­‐Decoders   for  Machine  TranslaEon   (Sutskever  et  al.  2014;   Cho  et  al.  2014;   Kalchbrenner  et  al.  2013,   Srivastava  et.al.,  2015)   Input  Sequence   Encoder   Learned   Representa@on   Output  Sequence  
  117. 117. Skip-­‐Thought  Model     •   Given  a  tuple                                                              of  conEguous  sentences:   -   the  sentence              is  encoded  using  LSTM.     -   the  sentence              anempts  to  reconstruct  the  previous   sentence  and  next  sentence                  .     •   The  input  is  the  sentence  triplet:   -   I  got  back  home.     -   I  could  see  the  cat  on  the  steps.     -   This  was  strange.    Kiros  et.al.,  arXiv  2015  
  118. 118. Encoder   Sentence   Generate  Forward  Sentence   Generate  Previous  Sentence   Skip-­‐Thought  Model    
  119. 119. Learning  ObjecEve   •   We  are  given  a  tuple                                                                  of  conEguous  sentences.     •   ObjecEve:  The  sum  of  the  log-­‐probabiliEes  for  the  next  and   previous  sentences  condiEoned  on  the  encoder  representaEon:   representaEon  of   encoder   Forward  sentence     Previous  sentence    
  120. 120. Book  11K  corpus   •   Query  sentence  along  with  its  nearest  neighbor  from  500K  sentences   using  cosine  similarity:   -   He  ran  his  hand  inside  his  coat,  double-­‐checking  that  the  unopened   lener  was  sEll  there.   -   He  slipped  his  hand  between  his  coat  and  his  shirt,  where  the  folded   copies  lay  in  a  brown  envelope.  
  121. 121. Book  11K  corpus   •   Query  sentence  along  with  its  nearest  neighbor  from  500K  sentences   using  cosine  similarity:  
  122. 122. SemanEc  Relatedness     •  SemEval  2014  Task  1:  semanEc  relatedness  SICK  dataset:     Given  two  sentences,  produce  a  score  of  how  semanEcally   related  these  sentences  are  based  on  human  generated   scores  (1  to  5).     •  The  dataset  comes  with  a  predefined  split  of  4500  training   pairs,  500  development  pairs  and  4927  tesEng  pairs.   •  Using  skip-­‐thought  vectors  for  each  sentence,  train  a  linear   regression  to  predict  semanEc  relatedness.     -  For  pair  of  sentences,  compute  component-­‐wise  features   between  pairs  (e.g.  |u-­‐v|).        
  123. 123. SemanEc  Relatedness     •   Our  models  outperform  all  previous  systems  from  the  SemEval   2014  compeEEon.  This  is  remarkable,  given  the  simplicity  of  our   approach  and  the  lack  of  feature  engineering.   SemEval   2014  sub-­‐ missions   Results   reported   by  Tai  et.al.   Ours  
  124. 124. SemanEc  Relatedness     •   Example  predicEons  from  the  SICK  test  set.  GT  is  the  ground   truth  relatedness,  scored  between  1  and  5.     •   The  last  few  results:  slight  changes  in  sentences  result  in  large   changes  in  relatedness  that  we  are  unable  to  score  correctly.  
  125. 125.  Paraphrase  DetecEon   •  Microso9  Research  Paraphrase  Corpus:  For  two  sentences  one   must  predict  whether  or  not  they  are  paraphrases.     •  The  training  set   contains  4076  sentence   pairs  (2753  are  posiEve)     •  The  test  set  contains   1725  pairs  (1147  are   posiEve).   Recursive   Auto-­‐ encoders   Best   published   results   Ours  
  126. 126. ClassificaEon  Benchmarks   •  5  datasets:  movie  review  senEment  (MR),  customer  product   reviews  (CR),  subjecEvity/objecEvity  classificaEon  (SUBJ),  opinion   polarity  (MPQA)  and  quesEon-­‐type  classificaEon  (TREC).     Bag-­‐of-­‐ words   Super-­‐ vised   Ours  
  127. 127. Summary   •  We  believe  our  model  for  learning  skip-­‐thought  vectors  only   scratches  the  surface  of  possible  objecEves.   •         Many  variaEons  have  yet  to  be  explored,  including     -   deep  encoders  and  decoders   -   larger  context  windows   -   encoding  and  decoding  paragraphs   -   other  encoders   •   It  is  likely  the  case  that  more  exploraEon  of  this  space  will  result   in  even  higher  quality  sentence  representaEons.   •  Code  and  Data  are  available  online!              hYp://www.cs.toronto.edu/~mbweb/  
  128. 128. MulE-­‐Modal  Models   Laser  scans   Images   Video   Text  &  Language     Time  series     data   Speech  &     Audio   Develop  learning  systems  that  come     closer  to  displaying  human  like  intelligence  
  129. 129. Summary   •  Efficient  learning  algorithms  for  Hierarchical  GeneraEve  Models.   Learning  more  adapEve,  robust,  and  structured  representaEons.     •  Deep  models  can  improve  current  state-­‐of-­‐the  art  in  many   applicaEon  domains:   Ø  Object  recogniEon  and  detecEon,  text  and  image  retrieval,  handwrinen   character  and  speech  recogniEon,  and  others.   Text  &  image  retrieval  /     Object  recogni@on   Learning  a  Category   Hierarchy   Dealing  with  missing/ occluded  data     HMM  decoder   Speech  Recogni@on   sunset,  pacific  ocean,   beach,  seashore   Mul@modal  Data   Cap@on  Genera@on  
  130. 130. Thank  you   Code  is  available  at:   hnp://deeplearning.cs.toronto.edu/  
  131. 131. References   •  J.  Ba,  V.  Mnih,  K.  KavukcuogluMulEple  Object  RecogniEon  with  Visual  AnenEonI,  CLR,  2015.   •  L.  Ba,  K.  Swersky,  S.  Fidler,  Salakhutdinov,PredicEng  Deep  Zero-­‐Shot  ConvoluEonal  Neural   Networks  using  Textual  DescripEons,  arXiv  2015   •  Y.  Bengio,  R.  Ducharme,  P.  Vincent,  C.  Jauvin,  A  Neural  ProbabilisEc  Language  Model,  JMLR   2003   •  Y.  Bengio,  Y.  LeCun,  Scaling  learning  algorithms  towards  AI,Large-­‐Scale  Kernel  Machines,  MIT   Press,  2007   •  Y.  Bengo,  Learning  Deep  Architectures  for  AI,  FoundaEons  and  Trends  in  Machine  Learning,   2009   •  Y.  Burda,  R.  Grosse,  R.  Salakhutdinov,  Accurate  and  ConservaEve  EsEmates  of  MRF  Log-­‐ likelihood  using  Reverse  Annealing,  In  AI  and  StaEsEcs  (AISTATS),  2015   •  K  Cho,  B  van  Merrienboer,  C  Gulcehre,  F  Bougares,  H  Schwenk,  Y.  Bengio,  Learning  Phrase   RepresentaEons  using  RNN  Encoder-­‐Decoder  for  StaEsEcal  Machine  TranslaEon,  EMLP  2014   •  A.  Frome,  G.  Corrado,  J.  Shlens,  S.  Bengio,  J.  Dean,  M.  Ranzato,  T.  Mikolov,  DeViSE:  A  Deep   Visual-­‐SemanEc  Embedding  Model,  NIPS  2013   •  G.  Hinton,  R.  Salakhutdinov,  Reducing  the  Dimensionality  of  Data  with  Neural  Networks,   Science,  2006.   •  G.  Hinton,  S.  Osindero,  Y.  The,  A  fast  learning  algorithm  for  deep  belief  nets,  Neural   ComputaEon,  18,  2006G.     •  Hinton,  Training  Products  of  Experts  by  Minimizing  ContrasEve  Divergence,  Neural   ComputaEon,  2002.  
  132. 132. References   •  H.  Lee,  A.  Banle,  R.  Raina,  A.  Ng,  Efficient  sparse  coding  algorithms,  NIPS  2007   •  H.  Lee,  R.  Grosse,  R.  Ranganath,  A.  Y.  Ng,  ConvoluEonal  deep  belief  networks  for  scalable   unsupervised  learning  of  hierarchical  representaEon,  .ICML  2009N.     •  Kalchbrenner,  P.  Blunsom,  Recurrent  ConEnuous  TranslaEon  Models,  EMNLP  2013   •  R.  Kiros,  R.  Salakhutdinov,  R.  Zemel,MulEmodal,  Neural  Language  Models,  ICML  2014   •  R.  Kiros,  R.  Salakhutdinov,  R.  Zemel.Unifying,  Visual-­‐SemanEc  Embeddings  with  MulEmodal   Neural  Language  Models,  TACL  2015.   •  R.  Kiros,  Y.  Zhu,  R.  Salakhutdinov,  R.  Zemel,  A.  Torralba,  R.  Urtasun,S.  Fidler,  Skip-­‐Thought   Vectors,  arXiv  2015Y.     •  LeCun,  L.  Bonou,  Y.  Bengio,  P.  Haffner,  Gradient-­‐Based  Learning  Applied  to  Document   RecogniEon,  Proceedings  of  the  IEEE,  1998   •  T.  Mikolov,  I.  Sutskever,  K.  Chen,  G.  Corrado,  J.  Dean,  Distributed  representaEons  of  words   and  phrases  and  their  composiEonality,  NIPS  2013   •  R.  Salakhutdinov,  J.  Tenenbaum,  A.  Torralba,  Learning  with  Hierarchical-­‐Deep  Models,  IEEE   PAMI,  vol.  35,  no.  8,  Aug.  2013,   •  R.  Salakhutdinov,  Learning  Deep  GeneraEve  Models,  Annual  Review  of  StaEsEcs  and  Its   ApplicaEon,  Vol.  2,  2015   •  R.  Salakhutdinov,  G.  Hinton,  An  Efficient  Learning  Procedure  for  Deep  Boltzmann  Machines,   Neural  ComputaEon,  2012,  Vol.  24,  No.  8.   •  R.  Salakhutdinov,  H.  Larochelle,  Efficient  Learning  of  Deep  Boltzmann  Machines,  AI  and   StaEsEcs,  2010  
  133. 133. References   •  R.  Salakhutdinov,  G.  Hinton,  Replicated  So9max:  an  Undirected  Topic  Model,  NIPS  2010   •  R.  Salakhutdinov,  A.  Mnih,  G.  Hinton,  Restricted  Boltzmann  Machines  for  CollaboraEve   Filtering,  ICML  2007   •  N.  Srivastava,  E.  Mansimov,  R.  Salakhutdinov,  Unsupervised  Learning  of  Video   RepresentaEons  using  LSTMs,  ICML,  2015   •  N.  Srivastava,  R.  Salakhutdinov,  MulEmodal  Learning  with  Deep  Boltzmann  Machines,  JMLR,   2014   •  I.  Sutskever,  O.  Vinyals,  Q.  Le,,  Sequence  to  Sequence  Learning  with  Neural  Networks,  NIPS   2014   •  R.  Socher,  M.  Ganjoo,  C.  Manning,  A.  Ng,  Zero-­‐Shot  Learning  Through  Cross-­‐Modal  Transfer,   NIPS  2013   •  Y.  Tang,  R.  Salakhutdinov,  G.  Hinton,  Deep  LamberEan  Networks,  ICM  2012   •  Y.  Tang,  R.  Salakhutdinov,  G.  Hinton,  Robust  Boltzmann  Machines  for  RecogniEon  and   Denoising,  CVPR  2012   •  K.  Xu,  J.  Ba,  R.  Kiros,  K.  Cho,  A.  Courville,  R.  Salakhutdinov,R.  Zemel,  Y.  Bengio,  Show,  Anend   and  Tell:  Neural  Image  CapEon  GeneraEon  with  Visual  AnenEon,  ICML,  2015  

×