P05 deep boltzmann machines cvpr2012 deep learning methods for vision

  • 1,373 views
Uploaded on

 

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,373
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
60
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Deep  Learning     Russ  Salakhutdinov  Department of Statistics and Computer Science! University of Toronto  
  • 2. Mining  for  Structure   Massive  increase  in  both  computa:onal  power  and  the  amount  of   data  available  from  web,  video  cameras,  laboratory  measurements.   Images  &  Video   Text  &  Language     Speech  &  Audio   Gene  Expression   Rela:onal  Data/     Climate  Change  Product     Social  Network   Geological  Data  Recommenda:on   Mostly  Unlabeled  •   Develop  sta:s:cal  models  that  can  discover  underlying  structure,  cause,  or  sta:s:cal  correla:on  from  data  in  unsupervised  or  semi-­‐supervised  way.    •   Mul:ple  applica:on  domains.  
  • 3. Talk  Roadmap   •  Unsupervised  Feature  Learning   –  Restricted  Boltzmann  Machines   DBM   –  Deep  Belief  Networks   –  Deep  Boltzmann  Machines   •  Transfer  Learning  with  Deep   Models  RBM   •  Mul:modal  Learning  
  • 4. Restricted  Boltzmann  Machines      hidden  variables   Bipar:te     Stochas:c  binary  visible  variables   Structure   are  connected  to  stochas:c  binary  hidden   variables                  .     The  energy  of  the  joint  configura:on:    Image            visible  variables   model  parameters.   Probability  of  the  joint  configura:on  is  given  by  the  Boltzmann  distribu:on:   par::on  func:on   poten:al  func:ons   Markov  random  fields,  Boltzmann  machines,  log-­‐linear  models.  
  • 5. Restricted  Boltzmann  Machines      hidden  variables   Bipar:te     Structure   Restricted:      No  interac:on  between          hidden  variables   Inferring  the  distribu:on  over  the  Image            visible  variables   hidden  variables  is  easy:   Factorizes:  Easy  to  compute   Similarly:   Markov  random  fields,  Boltzmann  machines,  log-­‐linear  models.  
  • 6. Model  Learning      hidden  variables   Given  a  set  of  i.i.d.  training  examples                                      ,  we  want  to  learn     model  parameters                .         Maximize  (penalized)  log-­‐likelihood  objec:ve:   Image            visible  variables  Deriva:ve  of  the  log-­‐likelihood:   Regulariza:on     Difficult  to  compute:  exponen:ally  many     configura:ons  
  • 7. Model  Learning      hidden  variables   Given  a  set  of  i.i.d.  training  examples                                      ,  we  want  to  learn     model  parameters                .         Maximize  (penalized)  log-­‐likelihood  objec:ve:   Image            visible  variables  Deriva:ve  of  the  log-­‐likelihood:   Approximate  maximum  likelihood  learning:     Contras:ve  Divergence  (Hinton  2000)          Pseudo  Likelihood  (Besag  1977)     MCMC-­‐MLE  es:mator  (Geyer  1991)              Composite  Likelihoods  (Lindsay,  1988;  Varin  2008)     Tempered  MCMC                    Adap:ve  MCMC     (Salakhutdinov,  NIPS  2009)              (Salakhutdinov,  ICML  2010)  
  • 8. RBMs  for  Images  Gaussian-­‐Bernoulli  RBM:     Define  energy  func:ons  for                                various  data  modali:es:   Image            visible  variables   Gaussian   Bernoulli  
  • 9. RBMs  for  Images  Gaussian-­‐Bernoulli  RBM:     Interpreta:on:  Mixture  of  exponen:al   number  of  Gaussians   Image            visible  variables    where    is  an  implicit  prior,  and     Gaussian  
  • 10. RBMs  for  Images  and  Text  Images:  Gaussian-­‐Bernoulli  RBM   Learned  features  (out  of  10,000)   4  million  unlabelled  images  Text:  Mul:nomial-­‐Bernoulli  RBM   Learned  features:  ``topics’’   russian   clinton   computer   trade   stock   Reuters  dataset:   russia   house   system   country   wall   804,414  unlabeled   moscow   president   product   import   street   newswire  stories   yeltsin   bill   sobware   world   point   soviet   congress   develop   economy   dow   Bag-­‐of-­‐Words    
  • 11. Collabora:ve  Filtering   Bernoulli  hidden:  user  preferences   h Learned  features:  ``genre’’   W1 Fahrenheit  9/11   Independence  Day  v Bowling  for  Columbine   The  Day  Aber  Tomorrow   The  People  vs.  Larry  Flynt   Con  Air   Mul:nomial  visible:  user  ra:ngs   Canadian  Bacon   Men  in  Black  II   La  Dolce  Vita   Men  in  Black   Neglix  dataset:     480,189  users     Friday  the  13th   Scary  Movie   The  Texas  Chainsaw  Massacre   Naked  Gun     17,770  movies     Children  of  the  Corn   Hot  Shots!   Over  100  million  ra:ngs   Childs  Play   American  Pie     The  Return  of  Michael  Myers   Police  Academy   State-­‐of-­‐the-­‐art  performance     on  the  Neglix  dataset.     Relates  to  Probabilis;c  Matrix  Factoriza;on   (Salakhutdinov & Mnih ICML 2007)!
  • 12. Mul:ple  Applica:on  Domains  •  Natural  Images  •  Text/Documents  •  Collabora:ve  Filtering  /  Matrix  Factoriza:on    •  Video  (Langford  et  al.  ICML  2009  ,  Lee  et  al.)    •  Mo:on  Capture  (Taylor  et.al.  NIPS  2007)  •  Speech  Percep:on  (Dahl  et.  al.  NIPS  2010,  Lee  et.al.  NIPS  2010)   Same  learning  algorithm  -­‐-­‐                          mul:ple  input  domains. Limita:ons  on  the  types  of  structure  that  can  be   represented  by  a  single  layer  of  low-­‐level  features!  
  • 13. Talk  Roadmap   •  Unsupervised  Feature  Learning   –  Restricted  Boltzmann  Machines   DBM   –  Deep  Belief  Networks   –  Deep  Boltzmann  Machines   •  Transfer  Learning  with  Deep   Models  RBM   •  Mul:modal  Learning  
  • 14. Deep  Belief  Network     Low-­‐level  features:   Edges   Built  from  unlabeled  inputs.     Input:  Pixels  Image  
  • 15. Deep  Belief  Network   Unsupervised  feature  learning.   Internal  representa:ons  capture   higher-­‐order  sta:s:cal  structure   Higher-­‐level  features:   Combina:on  of  edges   Low-­‐level  features:   Edges   Built  from  unlabeled  inputs.     Input:  Pixels  Image   (Hinton et.al. Neural Computation 2006)!
  • 16. Deep  Belief  Network   The  joint  probability   Deep  Belief  Network   distribu:on  factorizes:  h3 W3 RBM  h2 W2 Sigmoid  Belief     RBM   Sigmoid    h1 Belief     Network   Network   W1v
  • 17. DBNs  for  Classifica:on   2000 W3 500 Softmax Output RBM 10 10 500 WT 4 WT+ 4 4 W2 2000 2000 500 WT 3 WT+ 3 3 RBM 500 500 WT 2 WT+ 2 2 500 500 500 W1 WT 1 WT+ 1 1 RBM Pretraining Unrolling Fine tuning•   Aber  layer-­‐by-­‐layer  unsupervised  pretraining,  discrimina:ve  fine-­‐tuning    by  backpropaga:on  achieves  an  error  rate  of  1.2%  on  MNIST.  SVM’s  get  1.4%  and  randomly  ini:alized  backprop  gets  1.6%.    •   Clearly  unsupervised  learning  helps  generaliza:on.  It  ensures  that  most  of  the  informa:on  in  the  weights  comes  from  modeling  the  input  data.  
  • 18. Deep  Autoencoders   B-=4>-, 6$ !* )4C ($$ %& ) ) !" !" ! ; #$$$ #$$$ ) ) ($$ !# !# ! : !6 "$$$ "$$$ ) ) "$$$ !6 !6 ! 9 %& ($$ ($$ ) ) !* !* ! ( 6$ ?4>-@5/A-, 6$ "$$$ !* !* ! * !# ($$ ($$ #$$$ %& !6 !6 ! 6 "$$$ "$$$ !# !# ! # #$$$ #$$$ #$$$ !" !" !" ! " %& <1=4>-,+,-.,/01012 31,455012 701- .81012
  • 19. Deep  Genera:ve  Model  Model  P(document)   Reuters  dataset:  804,414     newswire  stories:  unsupervised   European Community Interbank Markets Monetary/Economic Energy Markets Disasters and Accidents Leading Legal/Judicial Economic Indicators Bag  of  words   Government Accounts/ Borrowings Earnings (Hinton & Salakhutdinov, Science 2006)!
  • 20. Informa:on  Retrieval   Interbank Markets European Community Monetary/Economic 2-­‐D  LSA  space   Energy Markets Disasters and Accidents Leading Legal/Judicial Economic Indicators Government Accounts/ Borrowings Earnings•   The  Reuters  Corpus  Volume  II  contains  804,414  newswire  stories  (randomly  split  into  402,207  training  and  402,207  test).  •   “Bag-­‐of-­‐words”:  each  ar:cle  is  represented  as  a  vector  containing  the  counts  of  the  most  frequently  used  2000  words  in  the  training  set.  
  • 21. Seman:c  Hashing   European Community Monetary/Economic Address Space Disasters and Accidents Semantically Similar DocumentsSemanticHashing GovernmentFunction Energy Markets Borrowing Document Accounts/Earnings •   Learn  to  map  documents  into  seman;c  20-­‐D  binary  codes.   •   Retrieve  similar  documents  stored  at  the  nearby  addresses  with  no   search  at  all.  
  • 22. Searching  Large  Image  Database   using  Binary  Codes  •   Map  images  into  binary  codes  for  fast  retrieval.  •   Small  Codes,  Torralba,  Fergus,  Weiss,  CVPR  2008  •   Spectral  Hashing,  Y.  Weiss,  A.  Torralba,  R.  Fergus,  NIPS  2008  •   Kulis  and  Darrell,  NIPS  2009,  Gong  and  Lazebnik,  CVPR  20111  •   Norouzi  and  Fleet,  ICML  2011,    
  • 23. Talk  Roadmap   •  Unsupervised  Feature  Learning   –  Restricted  Boltzmann  Machines   DBM   –  Deep  Belief  Networks   –  Deep  Boltzmann  Machines   •  Transfer  Learning  with  Deep   Models  RBM   •  Mul:modal  Learning  
  • 24. DBNs  vs.  DBMs   Deep Belief Network! Deep Boltzmann Machine! h3 h3 W3 W3 h2 h2 W2 W2 h1 h1 W1 W1 v vDBNs  are  hybrid  models:     •   Inference  in  DBNs  is  problema:c  due  to  explaining  away.   •   Only  greedy  pretrainig,  no  joint  op;miza;on  over  all  layers.     •   Approximate  inference  is  feed-­‐forward:  no  boMom-­‐up  and  top-­‐down.        Introduce  a  new  class  of  models  called  Deep  Boltzmann  Machines.  
  • 25. Mathema:cal  Formula:on  Deep  Boltzmann  Machine   model  parameters  h3 •  Dependencies  between  hidden  variables.   W3 •  All  connec:ons  are  undirected.  h2 •  Bopom-­‐up  and  Top-­‐down:   W2h1 W1 v Top-­‐down   Bopom-­‐up   Input   Unlike  many  exis:ng  feed-­‐forward  models:  ConvNet  (LeCun),   HMAX  (Poggio  et.al.),  Deep  Belief  Nets  (Hinton  et.al.)  
  • 26. Mathema:cal  Formula:on   Neural  Network    Deep  Boltzmann  Machine   Deep  Belief  Network   Output  h3 h3 h3 W3 W3 W3h2 h2 h2 W2 W2 W2h1 h1 h1 W1 W1 W1 v v v Input   Unlike  many  exis:ng  feed-­‐forward  models:  ConvNet  (LeCun),   HMAX  (Poggio),  Deep  Belief  Nets  (Hinton)  
  • 27. Mathema:cal  Formula:on   Neural  Network    Deep  Boltzmann  Machine   Deep  Belief  Network   Output  h3 h3 h3 W3 W3 W3h2 h2 h2 inference   W2 W2 W2h1 h1 h1 W1 W1 W1 v v v Input   Unlike  many  exis:ng  feed-­‐forward  models:  ConvNet  (LeCun),   HMAX  (Poggio),  Deep  Belief  Nets  (Hinton)  
  • 28. Mathema:cal  Formula:on  Deep  Boltzmann  Machine   model  parameters  h3 •  Dependencies  between  hidden  variables.   W3 Maximum  likelihood  learning:  h2 W2h1 Problem:  Both  expecta:ons  are   W1 intractable!   v Learning  rule  for  undirected  graphical  models:     MRFs,  CRFs,  Factor  graphs.    
  • 29. Previous  Work  Many  approaches  for  learning  Boltzmann  machines  have  been  proposed  over  the  last  20  years:   •   Hinton  and  Sejnowski  (1983),   •   Peterson  and  Anderson  (1987)   •   Galland  (1991)     Real-­‐world  applica:ons  –  thousands     •   Kappen  and  Rodriguez  (1998)   of  hidden  and  observed  variables   •   Lawrence,  Bishop,  and  Jordan  (1998)   •   Tanaka  (1998)     with  millions  of  parameters.   •   Welling  and  Hinton  (2002)     •   Zhu  and  Liu  (2002)   •   Welling  and  Teh  (2003)   •   Yasuda  and  Tanaka  (2009)      Many  of  the  previous  approaches  were  not  successful  for  learning  general  Boltzmann  machines  with  hidden  variables.  Algorithms  based  on  Contras:ve  Divergence,  Score  Matching,  Pseudo-­‐Likelihood,  Composite  Likelihood,  MCMC-­‐MLE,  Piecewise  Learning,  cannot  handle  mul:ple  layers  of  hidden  variables.      
  • 30. New  Learning  Algorithm   Posterior  Inference   Simulate  from  the  Model  Condi:onal   Uncondi:onal   Approximate   Approximate  the   condi:onal   joint  distribu:on   (Salakhutdinov, 2008; NIPS 2009)!
  • 31. New  Learning  Algorithm   Posterior  Inference   Simulate  from  the  Model  Condi:onal   Uncondi:onal   Approximate   Approximate  the   condi:onal   joint  distribu:on   Data-­‐dependent   Data-­‐independent   density   Match    
  • 32. New  Learning  Algorithm   Posterior  Inference   Simulate  from  the  Model   Condi:onal   Uncondi:onal   Approximate   Approximate  the   condi:onal   joint  distribu:on   Markov  Chain  Mean-­‐Field   Monte  Carlo   Data-­‐dependent   Data-­‐independent   Match     Key  Idea  of  Our  Approach:   Data-­‐dependent:          Varia;onal  Inference,  mean-­‐field  theory   Data-­‐independent:    Stochas;c  Approxima;on,  MCMC  based  
  • 33. Sampling  from  DBMs   Sampling  from  two-­‐hidden  layer  DBM:  by  running  Markov  chain:  Randomly  ini:alize   …   Sample  
  • 34. Stochas:c  Approxima:on  Time    t=1   t=2   t=3   h2 h2 h2 Update   Update   h1 h1 h1 v v v Update            and            sequen:ally,    where   •  Generate                    by  simula:ng  from  a  Markov  chain   that  leaves                  invariant  (e.g.  Gibbs  or  M-­‐H  sampler)   •  Update            by  replacing  intractable                      with  a  point   es:mate     In  prac:ce  we  simulate  several  Markov  chains  in  parallel.   Robbins  and  Monro,  Ann.  Math.  Stats,  1957    L.  Younes,    Probability  Theory  1989,  Tieleman,  ICML  2008.    
  • 35. Stochas:c  Approxima:on  Update  rule  decomposes:   True  gradient   Noise  term  Almost  sure  convergence  guarantees  as  learning  rate       Problem:  High-­‐dimensional  data:   Markov  Chain   the  energy  landscape  is  highly     Monte  Carlo   mul:modal   Key  insight:  The  transi:on  operator  can  be     any  valid  transi:on  operator  –  Tempered     Transi:ons,  Parallel/Simulated  Tempering.   Connec:ons  to  the  theory  of  stochas:c  approxima:on  and  adap:ve  MCMC.  
  • 36. Varia:onal  Inference   Approximate  intractable  distribu:on                                  with  simpler,  tractable   distribu:on            :   Posterior  Inference   Mean-­‐Field   Varia:onal  Lower  Bound  Minimize  KL  between  approxima:ng  and  true  distribu:ons  with  respect  to  varia:onal  parameters          .     (Salakhutdinov & Larochelle, AI & Statistics 2010)!
  • 37. Varia:onal  Inference   Approximate  intractable  distribu:on                                  with  simpler,  tractable   distribu:on            :  Posterior  Inference   Varia:onal  Lower  Bound   Mean-­‐Field:  Choose  a  fully  factorized  distribu:on:  Mean-­‐Field   with   Varia;onal  Inference:  Maximize  the  lower  bound  w.r.t.   Varia:onal  parameters          .     Nonlinear  fixed-­‐   point  equa:ons:  
  • 38. Varia:onal  Inference   Approximate  intractable  distribu:on                                  with  simpler,  tractable   distribu:on            :  Posterior  Inference   Varia:onal  Lower  Bound   Uncondi:onal  Simula:on   Markov  Chain  Mean-­‐Field   Fast  Inference   1.  Varia;onal  Inference:  Maximize  the  lower     bound  w.r.t.  varia:onal  parameters             Monte  Carlo   2.  MCMC:  Apply  stochas:c  approxima:on     Learning  can  scale  to   to  update  model  parameters                               millions  of  examples   Almost  sure  convergence  guarantees  to  an  asympto:cally   stable  point.  
  • 39. Good  Genera:ve  Model?  Handwripen  Characters  
  • 40. Good  Genera:ve  Model?  Handwripen  Characters  
  • 41. Good  Genera:ve  Model?  Handwripen  Characters   Simulated   Real  Data  
  • 42. Good  Genera:ve  Model?  Handwripen  Characters   Real  Data   Simulated  
  • 43. Good  Genera:ve  Model?  Handwripen  Characters  
  • 44. Good  Genera:ve  Model?  MNIST  Handwripen  Digit  Dataset  
  • 45. Deep  Boltzmann  Machine   Sanskrit   Model  P(image)   25,000  characters  from  50   alphabets  around  the  world.   •   3,000  hidden  variables   •   784    observed  variables        (28  by  28  images)   •   Over  2  million  parameters   Bernoulli  Markov  Random  Field  
  • 46. Deep  Boltzmann  Machine   Condi:onal   Simula:on  P(image|par:al  image)   Bernoulli  Markov  Random  Field  
  • 47. Handwri:ng  Recogni:on   MNIST  Dataset   Op:cal  Character  Recogni:on   60,000  examples  of  10  digits   42,152  examples  of  26  English  lepers     Learning  Algorithm   Error   Learning  Algorithm   Error   Logis:c  regression   12.0%   Logis:c  regression   22.14%   K-­‐NN     3.09%   K-­‐NN     18.92%   Neural  Net  (Plap  2005)   1.53%   Neural  Net   14.62%   SVM  (Decoste  et.al.  2002)   1.40%   SVM  (Larochelle  et.al.  2009)   9.70%   Deep  Autoencoder   1.40%   Deep  Autoencoder   10.05%   (Bengio  et.  al.  2007)     (Bengio  et.  al.  2007)     Deep  Belief  Net   1.20%   Deep  Belief  Net   9.68%   (Hinton  et.  al.  2006)     (Larochelle  et.  al.  2009)     DBM     0.95%   DBM   8.40%  Permuta:on-­‐invariant  version.  
  • 48. Deep  Boltzmann  Machine   Gaussian-­‐Bernoulli  Markov   Deep  Boltzmann  Machine   Random  Field   12,000  Latent     Variables   Model  P(image)  Planes   96  by  96   images   24,000  Training  Images   Stereo  pair  
  • 49. Genera:ve  Model  of  3-­‐D  Objects  24,000  examples,  5  object  categories,  5  different  objects  within  each    category,  6  lightning  condi:ons,  9  eleva:ons,  18  azimuths.      
  • 50. 3-­‐D  Object  Recogni:on   Papern  Comple:on   Learning  Algorithm   Error   Logis:c  regression   22.5%   K-­‐NN  (LeCun  2004)   18.92%   SVM  (Bengio  &  LeCun    2007)   11.6%   Deep  Belief  Net  (Nair  &   9.0%   Hinton    2009)     DBM   7.2%  Permuta:on-­‐invariant  version.  
  • 51. Learning  Part-­‐based  Hierarchy   Object  parts.   Combina:on  of  edges.   Trained  from  mul:ple  classes                     (cars,  faces,  motorbikes,  airplanes).   Lee  et.al.,  ICML  2009  
  • 52. Robust  Boltzmann  Machines  •   Build  more  complex  models  that  can  deal  with  occlusions  or  structured  noise.     Gaussian  RBM,  modeling   Binary  RBM  modeling   clean  faces   occlusions   Inferred   Binary  pixel-­‐wise   Gaussian  noise   Mask   Observed  Relates  to  Le  Roux,  Heess,  Shopon,  and  Winn,  Neural  Computa:on,  2011  Eslami,  Heess,  Winn,  CVPR  2012   Tang  et.  al.,  CVPR  2012  
  • 53. Robust  Boltzmann  Machines   Internal  States  of   RoBM  during   learning.     Inference  on  the  test   subjects   Comparing  to  Other   Denoising  Algorithms  
  • 54. Spoken  Query  Detec:on  •  630  speaker  TIMIT  corpus:  3,696  training  and  944  test  uperances.  •  10  query  keywords  were  randomly  selected  and  10  examples  of   each  keyword  were  extracted  from  the  training  set.    •  Goal:  For  each  keyword,  rank  all  944  uperances  based  on  the   uperance’s  probability  of  containing  that  keyword.      •  Performance  measure:  The  average  equal  error  rate  (EER).     13.5 Learning  Algorithm   AVG  EER   13 GMM  Unsupervised     16.4%   12.5 12 DBM  Unsupervised     14.7%   Avg. EER 11.5 DBM  (1%  labels)   13.3%   11 DBM  (30%  labels)   10.5%   10.5 DBM  (100%  labels)   9.7%   10 9.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Training Ratio (Yaodong  Zhang  et.al.    ICASSP  2012)!
  • 55. Learning  Hierarchical  Representa:ons  Deep  Boltzmann  Machines:     Learning  Hierarchical  Structure     in  Features:  edges,  combina:on     of  edges.     •   Performs  well  in  many  applica:on  domains   •   Combines  bopom  and  top-­‐down   •   Fast  Inference:  frac:on  of  a  second   •   Learning  scales  to  millions  of  examples   Many  examples,  few  categories  Next:  Few  examples,  many  categories  –  Transfer  Learning  
  • 56. Talk  Roadmap   DBM   •  Unsupervised  Feature  Learning   –  Restricted  Boltzmann  Machines   –  Deep  Belief  Networks   –  Deep  Boltzmann  Machines   •  Transfer  Learning  with  Deep   Models   •  Mul:modal  Learning  “animal”   “vehicle”    horse    cow    car    van    truck  
  • 57. One-­‐shot  Learning   “zarc” “segway”How  can  we  learn  a  novel  concept  –  a  high  dimensional  sta:s:cal  object  –  from  few  examples.      
  • 58. Learning  from  Few  Examples  SUN  database   car   van   truck   bus   Classes  sorted  by  frequency   Rare  objects  are  similar  to  frequent  objects  
  • 59. Tradi:onal  Supervised  Learning   Segway   Motorcycle   Test:     What  is  this?  
  • 60. Learning  to  Transfer   Background  Knowledge  Millions  of  unlabeled  images     Learn  to  Transfer   Knowledge   Some  labeled  images   Learn  novel  concept   from  one  example   Bicycle   Dolphin   Test:     What  is  this?   Elephant   Tractor  
  • 61. Learning  to  Transfer   Background  Knowledge  Millions  of  unlabeled  images     Learn  to  Transfer   Knowledge   Key  problem  in  computer  vision,     speech  percep:on,  natural  language   processing,  and  many  other  domains.   Some  labeled  images   Learn  novel  concept   from  one  example   Bicycle   Dolphin   Test:     What  is  this?   Elephant   Tractor  
  • 62. One-­‐Shot  Learning   Level 3 {τ 0, α0} Hierarchical  Bayesian   Level 2 {µk, τ k, αk}Level 1 Animal Vehicle ... Models  {µc, τ c} Cow Horse Sheep Truck Car Hierarchical  Prior.   Probability of observed Prior probability of data given parameters weight vector W Posterior probability of parameters given the training data D.•   Fei-­‐Fei,  Fergus,  and  Perona,  TPAMI  2006  •   E.  Bart,  I.  Porteous,  P.  Perona,  and  M.  Welling,  CVPR  2007  •   Miller,  Matsakis,  and  Viola,  CVPR  2000  •   Sivic,  Russell,    Zisserman,  Freeman,  and  Efros,  CVPR  2008  
  • 63. Hierarchical-­‐Deep  Models  HD  Models:  Compose  hierarchical  Bayesian   Deep  Nets  models  with  deep  networks,  two  influen:al   Part-­‐based  Hierarchy  approaches  from  unsupervised  learning  Deep  Networks:   •   learn  mul:ple  layers  of  nonlineari;es.   •   trained  in  unsupervised  fashion  -­‐-­‐   unsupervised  feature  learning  –  no  need  to   Marr  and  Nishihara  (1978)   rely  on  human-­‐crabed  input  representa:ons.   •   labeled  data  is  used  to  slightly  adjust  the   Hierarchical  Bayes   model  for  a  specific  task.   Category-­‐based  Hierarchy  Hierarchical  Bayes:  •   explicitly  represent  category  hierarchies  for  sharing  abstract  knowledge.  •   explicitly  iden:fy  only  a  small  number  of  parameters  that  are  relevant  to  the  new  concept  being  learned.   Collins  &  Quillian  (1969)  
  • 64. Mo:va:on   Learning  to  transfer  knowledge:     Super-­‐class  Hierarchical   •   Super-­‐category:  “A  segway  looks   Segway   like  a  funny  kind  of  vehicle”.   •   Higher-­‐level  features,  or  parts,    shared   with  other  classes:   Parts   Ø   wheel,  handle,  post   •   Lower-­‐level  features:   Ø   edges,  composi:on  of  edges    Deep   Edges  
  • 65. Hierarchical  Genera:ve  Model   Hierarchical  Latent  Dirichlet   Alloca:on  Model  “animal”   “vehicle”   K  horse    cow    car    van    truck   DBM  Model   Lower-­‐level  generic  features:     •   edges,  combina:on  of  edges   Images   (Salakhutdinov, Tenenbaum, Torralba, 2011)!
  • 66. Hierarchical  Genera:ve  Model   Hierarchical  Latent  Dirichlet   Alloca:on  Model  “animal”   “vehicle”   Hierarchical  Organiza;on  of  Categories:   •   express  priors  on  the  features  that  are   typical  of  different  kinds  of  concepts   •   modular  data-­‐parameter  rela:ons         K Higher-­‐level  class-­‐sensi;ve  features:    horse    cow    car    van    truck   •   capture  dis:nc:ve  perceptual   structure  of  a  specific  concept   DBM  Model   Lower-­‐level  generic  features:     •   edges,  combina:on  of  edges   Images   (Salakhutdinov, Tenenbaum, Torralba, 2011)!
  • 67. Intui:on   α Pr(topic  |  doc)   πWords  ⇔  ac:va:ons  of  DBM’s  top-­‐level  units.     zTopics  ⇔  distribu:ons  over  top-­‐level  units,  or   θhigher-­‐level  parts.     w Pr(word  |  topic)   DBM  generic  features:     LDA  high-­‐level  features:   Images   Words   Topics   Documents   Each  topic  is  made  up  of  words.   Each  document  is  made  up  of  topics.  
  • 68. Hierarchical  Deep  Model  “animal”   “vehicle”   K Topics    horse    cow    car    van    truck  
  • 69. Hierarchical  Deep  Model   Tree  hierarchy  of     classes  is  learned  “animal”   “vehicle”      (Nested  Chinese  Restaurant  Process)   prior:  a  nonparametric  prior  over  tree   structures.   K Topics    horse    cow    car    van    truck  
  • 70. Hierarchical  Deep  Model   Tree  hierarchy  of     classes  is  learned  “animal”   “vehicle”      (Nested  Chinese  Restaurant  Process)   prior:  a  nonparametric  prior  over  tree   structures.      (Hierarchical  Dirichlet  Process)  prior:   K a  nonparametric  prior  allowing  categories  to   Topics   share  higher-­‐level  features,  or  parts.    horse    cow    car    van    truck  
  • 71. Hierarchical  Deep  Model   Tree  hierarchy  of     classes  is  learned  “animal”   “vehicle”   Unlike  standard  sta:s:cal  models,        (Nested  Chinese  Restaurant  Process)   in  addi:on  ntonparametric  prior  arameters,   prior:  a   structures   o  inferring  p over  tree   we  also  infer  the  hierarchy  for      (Hierarchical  Dirichlet  Process)  prior:   sharing  nonparametric  prior  allowing  categories  to   K a   those  parameters.   Topics   share  higher-­‐level  features,  or  parts.   Condi;onal  Deep  Boltzmann    horse    cow    car    van    truck   Machine.   Enforce  (approximate)  global  consistency     through  many  local  constraints.    
  • 72. CIFAR  Object  Recogni:on   Tree  hierarchy  of     classes  is  learned   50,000  images  of  100  classes  “animal”   “vehicle”   Higher-­‐level  class   Inference:  Markov  chain     sensi:ve  features     Monte  Carlo  –  Later!    horse    cow    car    van    truck   4  million  unlabeled  images   Lower-­‐level   generic    features     32  x  32  pixels  x  3    RGB  
  • 73. Learning  to  Learn   The  model  learns  how  to  share  the  knowledge  across  many  visual   categories.   Learned  super-­‐ “global”   class  hierarchy   “aqua;c   animal”   “fruit”   “human”  dolphin   turtle   shark   ray   apple   orange   sunflower   girl   baby   man   Basic  level     class   woman   …   Learned  higher-­‐level   class-­‐sensi;ve  features   …   Learned  low-­‐level   generic  features  
  • 74. Learning  to  Learn   The  model  learns  how  to  share  the  knowledge  across  many  visual   categories.   “global”   Learned  super-­‐crocodile   spider   class  hierarchy   “aqua;c   snake   lizard   “fruit”   “human”   castle   road   animal”   squirrel   bridge   kangaroo   skyscraper   bus   house   leopard   dolphin   fox   turtle   shark   ray   apple   orange   sunflower   truck   Basic  level     train   ;ger   girl   baby   man   tank   lion   wolf   tractor   streetcar   class   oMer   skunk   woman   shrew   porcupine   pine   dolphin   oak   maple  tree  bear   ray   whale   shark   …   Learned  higher-­‐level   willow  tree   camel   turtle   boMle   elephant   caMle   class-­‐sensi;ve   can   lamp   bowl   cup   chimpanzee   beaver   features   mouse   raccoon   Learned  low-­‐level   …   apple   peer   pepper   man   boy   man   generic  features   hamster   rabbit   possum   orange   sunflower   girl   woman  
  • 75. Sharing  Features   “fruit”   Reconst-­‐   Learning  to   Real   ruc:ons   Shape   Color   Learn  Dolphin   Sunflower   Orange   Apple   apple   orange   sunflower   Sunflower  ROC  curve   5  3      1  ex’s     1 0.9 0.8 0.7 detection rate 0.6 0.5 0.4 Pixel-­‐ 0.3 0.2 space     0.1 distance   0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 false alarm rate Learning  to  Learn:  Learning  a  hierarchy  for  sharing  parameters  –                  rapid  learning  of  a  novel  concept.  
  • 76. Object  Recogni:on     Area under ROC curve for same/different (1 new class vs. 99 distractor classes) 1   HDP-­‐DBM   LDA   DBM        HDP-­‐DBM  0.95   GIST   (class  condi:onal)   (no  super-­‐classes)  0.9  0.85  0.8  0.75  0.7  0.65   1   3   5   10   50   #  examples   [Averaged  over  40  test  classes]   Our  model  outperforms  standard  computer  vision     features  (e.g.  GIST).  
  • 77. Handwripen  Character  Recogni:on  “alphabet  1”   “alphabet  2”   Learned   higher-­‐level     features   Strokes    char  1    char  2    char  3      char  4    char  5   Learned  lower-­‐ level  features   25,000     Edge characters   s  
  • 78. Handwripen  Character  Recogni:on   Area under ROC curve for same/different (1 new class vs. 1000 distractor classes) HDP-­‐DBM   1   HDP-­‐DBM   LDA     DBM   (no  super-­‐classes)  0.95   (class  condi:onal)  0.9   Pixels  0.85  0.8  0.75  0.7  0.65   1   3   5   10   [Averaged  over  40  test  classes]   #  examples  
  • 79. Simula:ng  New  Characters   Real  data  within  super  class   Global   Super   Super   class  1   class  2  Class  1   Class  2   New  class   Simulated  new  characters  
  • 80. Simula:ng  New  Characters   Real  data  within  super  class   Global   Super   Super   class  1   class  2  Class  1   Class  2   New  class   Simulated  new  characters  
  • 81. Simula:ng  New  Characters   Real  data  within  super  class   Global   Super   Super   class  1   class  2  Class  1   Class  2   New  class   Simulated  new  characters  
  • 82. Simula:ng  New  Characters   Real  data  within  super  class   Global   Super   Super   class  1   class  2  Class  1   Class  2   New  class   Simulated  new  characters  
  • 83. Simula:ng  New  Characters   Real  data  within  super  class   Global   Super   Super   class  1   class  2  Class  1   Class  2   New  class   Simulated  new  characters  
  • 84. Simula:ng  New  Characters   Real  data  within  super  class   Global   Super   Super   class  1   class  2  Class  1   Class  2   New  class   Simulated  new  characters  
  • 85. Simula:ng  New  Characters   Real  data  within  super  class   Global   Super   Super   class  1   class  2  Class  1   Class  2   New  class   Simulated  new  characters  
  • 86. Learning from very few examples 3  examples  of      a  new  class  Condi:onal  samples  in  the  same  class  Inferred  super-­‐class  
  • 87. Learning from very few examples
  • 88. Learning from very few examples
  • 89. Learning from very few examples
  • 90. Learning from very few examples
  • 91. Learning from very few examples
  • 92. Learning from very few examples
  • 93. Learning from very few examples
  • 94. Learning from very few examples
  • 95. Hierarchical-­‐Deep     So  far  we  have  considered  directed  +  undirected  models.   Deep  Lamber:an  Networks    “animal”   “vehicle”   Deep   Undirected   K Topics    horse    cow    car    van    truck   Directed   Combines  the  elegant  proper:es  of  the   Lamber:an  model  with  the  Gaussian  RBMs   Low-­‐level  features:   (and  Deep  Belief  Nets,  Deep  Boltzmann   replace  GIST,  SIFT   Machines).   Tang  et.  al.,  ICML  2012  
  • 96. Deep  Lamber:an  Networks  Model  Specifics   Image   Surface     Light     Deep  Lamber:an  Net     Normals   albedo   source   Inferred   Observed   Inference:  Gibbs  sampler.   Learning:  Stochas:c  Approxima:on  
  • 97. Deep  Lamber:an  Networks  Yale  B  Extended  Database  One  Test  Image   Two  Test  Images   Face  Religh:ng  
  • 98. Deep  Lamber:an  Networks   Recogni:on  as  func:on  of  the  number  of  training  images  for  10  test   subjects.    One-­‐Shot  Recogni:on  
  • 99. Recursive  Neural  Networks  Recursive  structure  learning     Local  recursive  networks  are   making  predic:ons  whether   to  merge  the  two  inputs  as   well  as  predic:ng  the  label.     Use  Max-­‐Margin  Es:ma:on.   Socher  et.  al.,  ICML  2011  
  • 100. Recursive  Neural  Networks  Recursive  structure  learning     Socher  et.  al.,  ICML  2011  
  • 101. Learning  from  Few  Examples  SUN  database   car   van   truck   bus   Classes  sorted  by  frequency   Rare  objects  are  similar  to  frequent  objects  
  • 102. Learning  from  Few  Examples   chair   armchair   Swivel  chair   Deck  chair   Classes  sorted  by  frequency  
  • 103. Genera:ve  Model  of  Classifier   Parameters  Many  state-­‐of-­‐the-­‐art  object  detec:on  systems  use  sophis:cated  models,  based  on  mul:ple  parts  with  separate  appearance  and  shape  components.   Detect  objects  by  tes:ng  sub-­‐windows  and   scoring  corresponding  test  patches  with  a  linear   func:on.   Define  hierarchical  prior  over  parameters   of  discrimina;ve  model  and  learn  the   hierarchy.     Image  Specific:  concatena:on  of  the     HOG  feature  pyramid  at  mul:ple  scales.   Felzenszwalb,  McAllester  &  Ramanan,  2008  
  • 104. Genera:ve  Model  of  Classifier   Parameters   Hierarchical   Bayes   By  learning  hierarchical  structure,     we  can  improve  the  current     state-­‐of-­‐the-­‐art.     Sun  Dataset:  32,855  examples  of       200  categories   185  ex   27  ex   12  ex  Hierarchical  Model  Single  Class  
  • 105. Talk  Roadmap   •  Unsupervised  Feature  Learning   –  Restricted  Boltzmann  Machines   DBM   –  Deep  Belief  Networks   –  Deep  Boltzmann  Machines   •  Transfer  Learning  with  Deep   Models  RBM   •  Mul:modal  Learning  
  • 106. Mul:-­‐Modal  Input   Learning  systems  that  combine  mul:ple  input  domains   Images   Text  &  Language     Video   Laser  scans   Speech  &     Audio   Time  series     data  Develop  learning  systems  that  come     One  of  Key  Challenges:    closer  to  displaying  human  like  intelligence                                                      Inference  
  • 107. Mul:-­‐Modal  Input  Learning  systems  that  combine  mul:ple  input  domains   Image   Text  More  robust  percep:on.    Ngiam  et.al.,  ICML  2011  used  deep  autoencoders  (video  +  speech)  •   Guillaumin,  Verbeek,  and  Schmid,  CVPR  2011  •   Huiskes,  Thomee,  and  Lew,  Mul:media  Informa:on  Retrieval,  2010    •   Xing,  Yan,  and  Hauptmann,  UAI  2005.    
  • 108. Training  Data  pentax,  k10d,   camera,  jahdakine,  kangarooisland   lightpain:ng,  southaustralia,  sa   reflec:on  australia   doublepaneglass  australiansealion  300mm   wowiekazowie    sandbanks,  lake,  lakeontario,  sunset,   top20buperflies  walking,  beach,  purple,  sky,  water,  clouds,  overtheexcellence     mickikrimmel,   mickipedia,  headshot   <no  text>   Samples  from  the  MIR  Flickr  Dataset  -­‐  Crea:ve  Commons  License  
  • 109. Mul:-­‐Modal  Input  •  Improve  Classifica:on     pentax,  k10d,  kangarooisland   southaustralia,  sa  australia   SEA  /  NOT  SEA   australiansealion  300mm  •  Fill  in  Missing  Modali:es   beach,  sea,  surf,   strand,  shore,   wave,  seascape,   sand,  ocean,  waves  •  Retrieve  data  from  one  modality  when  queried  using  data  from   another  modality   beach,  sea,  surf,   strand,  shore,   wave,  seascape,   sand,  ocean,  waves  
  • 110. Mul:-­‐Modal  Deep  Belief  Net  Gaussian  RBM   Replicated  Sobmax   Dense   Image   Text   Sparse  counts  
  • 111. Mul:-­‐Modal  Deep  Belief  Net  •   Flickr  Data  -­‐  1  Million  images  along  with  text  tags,  25K  annotated  
  • 112. Recogni:on  Results  •   Mul:modal  Inputs  (images  +  text),  38  classes.       Learning  Algorithm   Mean  Average  Precision   Image-­‐text  SVM   0.475   Image-­‐text  LDA   0.492   Mul:modal  DBN   0.566  •   Unimodal  Inputs  (images  only).     Learning  Algorithm   Mean  Average  Precision   Image-­‐SVM   0.375   Image-­‐LDA   0.315   Image  DBN   0.413  
  • 113. Papern  Comple:on  Given  a  test  image,  we  generate  associated  text  –  achieve  far  beper  classifica:on  results.      
  • 114. Thank  you  Code  for  learning  RBMs,  DBNs,  and  DBMs  is  available  at:  hpp://www.mit.edu/~rsalakhu/