Deep	
  Learning	
  Primer	
  
Anantharaman	
  Narayana	
  Iyer	
  	
  
7th	
  June	
  2014	
  
What	
  is	
  Deep	
  Learning?	
  
Deep	
  learning	
  is	
  a	
  Machine	
  Learning	
  
technique	
  disBnguished	
  by	
  2	
  defining	
  
characterisBcs:	
  
1.  Deep	
  Architecture	
  
•  MulBple	
  layers	
  of	
  learning.	
  	
  
•  Methodologies	
  to	
  train	
  these	
  layers	
  that	
  
gets	
  close	
  to	
  global	
  opBmum,	
  alleviaBng	
  
the	
  effect	
  of	
  local	
  minima	
  arising	
  due	
  to	
  
non-­‐convex	
  objecBve	
  funcBon	
  
2.  	
  Feature	
  Learning	
  (aka	
  RepresentaBon	
  
Learning)	
  
•  TradiBonal	
  machine	
  learning	
  system	
  
design,	
  such	
  as	
  LogisBc	
  Regression,	
  
involve	
  manual	
  feature	
  design.	
  In	
  
contrast,	
  a	
  deep	
  learning	
  system	
  
automaBcally	
  learns	
  the	
  features	
  given	
  
the	
  input.	
  
AutomaBc	
  Feature	
  ExtracBon	
  
Machine	
  Learning	
  System	
  
Input	
  
Output	
  
Features	
  
Why	
  is	
  there	
  a	
  phenomenal	
  interest?	
  
•  Considered	
  the	
  next	
  big	
  thing	
  in	
  
Machine	
  Learning	
  by	
  several	
  
experts	
  
•  Breakthrough	
  results	
  reported	
  in:	
  
–  Speech	
  RecogniBon	
  
•  MicrosoY	
  Audio	
  Video	
  Indexing	
  Service	
  
(MAVIS),	
  reduced	
  word	
  error	
  rates	
  by	
  about	
  
30%	
  on	
  4	
  major	
  benchmarks	
  
–  Object	
  RecogniBon	
  
•  MNIST	
  digits	
  recogniBon:	
  error	
  rate	
  0.27%	
  
•  Successful	
  image	
  recogniBon	
  by	
  Google	
  	
  
–  Natural	
  Language	
  Processing	
  
•  SENNA	
  system	
  that	
  reported	
  state	
  of	
  the	
  art	
  
results	
  in	
  tasks	
  like	
  POS	
  tagging,	
  Chunking,	
  
Named	
  EnBty	
  RecogniBon	
  etc	
  
•  Substan3al	
  investments	
  on	
  this	
  
technology	
  recently	
  by	
  top	
  
technology	
  companies	
  
Building	
  a	
  deep	
  learning	
  system	
  
•  Many	
  ways	
  to	
  build	
  a	
  deep	
  learning	
  system,	
  
with	
  the	
  defining	
  characterisBcs	
  being:	
  
–  MulBple	
  layers	
  where	
  each	
  layer	
  performs	
  a	
  
nonlinear	
  transformaBon	
  of	
  the	
  output	
  
generated	
  by	
  its	
  preceding	
  layer.	
  
–  AutomaBc	
  feature	
  learning	
  where	
  the	
  
features	
  are	
  progressively	
  more	
  abstract	
  
–  Hierarchical	
  in	
  nature.	
  
•  Broad	
  approaches/categorizaBons	
  
–  Unsupervised	
  or	
  generaBve	
  models	
  
–  Supervised	
  discriminaBve	
  models	
  
–  Hybrid	
  (use	
  an	
  unsupervised	
  model	
  as	
  an	
  
aid	
  to	
  perform	
  superior	
  discriminaBon)	
  
•  Common	
  building	
  blocks	
  for	
  unsupervised	
  
and	
  hybrid	
  approaches	
  
–  Restricted	
  Boltzmann	
  Machines	
  (RBM)	
  	
  
–  Autoencoders	
  
ApplicaBon	
  Example	
  
Problem:	
  Suppose	
  we	
  need	
  to	
  build	
  a	
  deep	
  learning	
  system	
  to	
  
detect	
  if	
  a	
  given	
  digital	
  image	
  contains	
  a	
  human	
  face	
  or	
  not.	
  
Inputs	
  are	
  the	
  image	
  pixels	
  and	
  the	
  output	
  is	
  a	
  binary.	
  
•  We	
  can	
  think	
  of	
  the	
  human	
  face	
  to	
  be	
  composed	
  of	
  a	
  few	
  key	
  facial	
  
consBtuents	
  such	
  as	
  ears,	
  eyes,	
  nose	
  etc.	
  These	
  further	
  can	
  be	
  thought	
  of	
  
contours	
  with	
  well	
  defined	
  edges,	
  which	
  in	
  turn	
  are	
  consBtuted	
  by	
  specific	
  
paderns	
  of	
  pixels.	
  
•  We	
  think	
  of	
  this	
  as	
  generaBng	
  edges	
  from	
  input	
  pixels,	
  from	
  edges	
  
generate	
  the	
  facial	
  aspects	
  and	
  from	
  those	
  detect	
  a	
  human	
  face.	
  	
  
•  The	
  role	
  of	
  a	
  hidden	
  layer	
  in	
  this	
  system	
  is	
  to	
  perform	
  a	
  nonlinear	
  
transform	
  of	
  its	
  inputs	
  (lower	
  level	
  of	
  abstracBon)	
  and	
  produce	
  a	
  more	
  
abstract	
  output	
  (as	
  e.g.	
  generaBng	
  a	
  nose	
  object	
  from	
  the	
  given	
  contours).	
  	
  
•  Thus	
  we	
  progressively	
  move	
  up	
  in	
  abstracBon	
  starBng	
  from	
  raw	
  pixels	
  and	
  
ending	
  up	
  with	
  a	
  face	
  object.	
  	
  
High	
  level	
  implementaBon	
  steps	
  
•  Suppose	
  we	
  implement	
  the	
  given	
  applicaBon	
  as	
  a	
  deep	
  neural	
  network	
  as:	
  
–  Pixel	
  values	
  consBtute	
  the	
  input	
  layer	
  
–  A	
  single	
  output	
  unit	
  consBtuBng	
  the	
  output	
  layer	
  
–  We	
  will	
  have	
  2	
  hidden	
  layers	
  
•  We	
  will	
  use	
  a	
  stacked	
  autoencoder	
  as	
  the	
  basic	
  building	
  block.	
  
–  An	
  autoencoder	
  (AE)	
  neural	
  network	
  learns	
  to	
  produce	
  an	
  output	
  that	
  is	
  same	
  as	
  input	
  using	
  
unsupervised	
  learning.	
  Thus,	
  given	
  pixel	
  values	
  x	
  as	
  input,	
  the	
  goal	
  of	
  AE	
  is	
  to	
  produce	
  an	
  output	
  
image	
  to	
  be	
  same	
  as	
  input.	
  
–  As	
  we	
  have	
  2	
  hidden	
  layers	
  we	
  will	
  require	
  2	
  AE’s	
  –	
  say	
  AE1,	
  AE2.	
  We	
  will	
  create	
  a	
  bodleneck	
  by	
  
having	
  a	
  smaller	
  number	
  of	
  hidden	
  units	
  compared	
  to	
  number	
  of	
  input	
  units.	
  
•  Layerwise	
  pretraining	
  
–  Train	
  the	
  AE1	
  with	
  the	
  available	
  images	
  (that	
  may	
  or	
  may	
  not	
  have	
  an	
  human	
  image)	
  
unsupervised.	
  Now,	
  the	
  output	
  of	
  hidden	
  units	
  of	
  AE1	
  consBtute	
  the	
  “learnt”	
  features	
  at	
  an	
  
abstracBon	
  higher	
  than	
  the	
  input	
  pixels.	
  (e.g.	
  Edges	
  from	
  pixels)	
  
–  Cascade	
  the	
  output	
  of	
  the	
  hidden	
  layer	
  of	
  the	
  AE	
  in	
  the	
  previous	
  step	
  with	
  AE2	
  and	
  train	
  AE2	
  to	
  
learn	
  more	
  abstract	
  features	
  (e.g.	
  facial	
  components	
  from	
  edges)	
  
•  Add	
  a	
  logisBc	
  regression	
  layer	
  as	
  the	
  output	
  layer	
  and	
  stack	
  the	
  2	
  AE’s	
  and	
  the	
  output	
  
layer	
  to	
  consBtute	
  a	
  Neural	
  Network	
  
•  Fine	
  tune	
  this	
  network	
  using	
  backpropagaBon	
  with	
  a	
  smaller	
  number	
  of	
  labeled	
  images	
  

Deep Learning Primer - a brief introduction

  • 1.
    Deep  Learning  Primer   Anantharaman  Narayana  Iyer     7th  June  2014  
  • 2.
    What  is  Deep  Learning?   Deep  learning  is  a  Machine  Learning   technique  disBnguished  by  2  defining   characterisBcs:   1.  Deep  Architecture   •  MulBple  layers  of  learning.     •  Methodologies  to  train  these  layers  that   gets  close  to  global  opBmum,  alleviaBng   the  effect  of  local  minima  arising  due  to   non-­‐convex  objecBve  funcBon   2.   Feature  Learning  (aka  RepresentaBon   Learning)   •  TradiBonal  machine  learning  system   design,  such  as  LogisBc  Regression,   involve  manual  feature  design.  In   contrast,  a  deep  learning  system   automaBcally  learns  the  features  given   the  input.   AutomaBc  Feature  ExtracBon   Machine  Learning  System   Input   Output   Features  
  • 3.
    Why  is  there  a  phenomenal  interest?   •  Considered  the  next  big  thing  in   Machine  Learning  by  several   experts   •  Breakthrough  results  reported  in:   –  Speech  RecogniBon   •  MicrosoY  Audio  Video  Indexing  Service   (MAVIS),  reduced  word  error  rates  by  about   30%  on  4  major  benchmarks   –  Object  RecogniBon   •  MNIST  digits  recogniBon:  error  rate  0.27%   •  Successful  image  recogniBon  by  Google     –  Natural  Language  Processing   •  SENNA  system  that  reported  state  of  the  art   results  in  tasks  like  POS  tagging,  Chunking,   Named  EnBty  RecogniBon  etc   •  Substan3al  investments  on  this   technology  recently  by  top   technology  companies  
  • 4.
    Building  a  deep  learning  system   •  Many  ways  to  build  a  deep  learning  system,   with  the  defining  characterisBcs  being:   –  MulBple  layers  where  each  layer  performs  a   nonlinear  transformaBon  of  the  output   generated  by  its  preceding  layer.   –  AutomaBc  feature  learning  where  the   features  are  progressively  more  abstract   –  Hierarchical  in  nature.   •  Broad  approaches/categorizaBons   –  Unsupervised  or  generaBve  models   –  Supervised  discriminaBve  models   –  Hybrid  (use  an  unsupervised  model  as  an   aid  to  perform  superior  discriminaBon)   •  Common  building  blocks  for  unsupervised   and  hybrid  approaches   –  Restricted  Boltzmann  Machines  (RBM)     –  Autoencoders  
  • 5.
    ApplicaBon  Example   Problem:  Suppose  we  need  to  build  a  deep  learning  system  to   detect  if  a  given  digital  image  contains  a  human  face  or  not.   Inputs  are  the  image  pixels  and  the  output  is  a  binary.   •  We  can  think  of  the  human  face  to  be  composed  of  a  few  key  facial   consBtuents  such  as  ears,  eyes,  nose  etc.  These  further  can  be  thought  of   contours  with  well  defined  edges,  which  in  turn  are  consBtuted  by  specific   paderns  of  pixels.   •  We  think  of  this  as  generaBng  edges  from  input  pixels,  from  edges   generate  the  facial  aspects  and  from  those  detect  a  human  face.     •  The  role  of  a  hidden  layer  in  this  system  is  to  perform  a  nonlinear   transform  of  its  inputs  (lower  level  of  abstracBon)  and  produce  a  more   abstract  output  (as  e.g.  generaBng  a  nose  object  from  the  given  contours).     •  Thus  we  progressively  move  up  in  abstracBon  starBng  from  raw  pixels  and   ending  up  with  a  face  object.    
  • 6.
    High  level  implementaBon  steps   •  Suppose  we  implement  the  given  applicaBon  as  a  deep  neural  network  as:   –  Pixel  values  consBtute  the  input  layer   –  A  single  output  unit  consBtuBng  the  output  layer   –  We  will  have  2  hidden  layers   •  We  will  use  a  stacked  autoencoder  as  the  basic  building  block.   –  An  autoencoder  (AE)  neural  network  learns  to  produce  an  output  that  is  same  as  input  using   unsupervised  learning.  Thus,  given  pixel  values  x  as  input,  the  goal  of  AE  is  to  produce  an  output   image  to  be  same  as  input.   –  As  we  have  2  hidden  layers  we  will  require  2  AE’s  –  say  AE1,  AE2.  We  will  create  a  bodleneck  by   having  a  smaller  number  of  hidden  units  compared  to  number  of  input  units.   •  Layerwise  pretraining   –  Train  the  AE1  with  the  available  images  (that  may  or  may  not  have  an  human  image)   unsupervised.  Now,  the  output  of  hidden  units  of  AE1  consBtute  the  “learnt”  features  at  an   abstracBon  higher  than  the  input  pixels.  (e.g.  Edges  from  pixels)   –  Cascade  the  output  of  the  hidden  layer  of  the  AE  in  the  previous  step  with  AE2  and  train  AE2  to   learn  more  abstract  features  (e.g.  facial  components  from  edges)   •  Add  a  logisBc  regression  layer  as  the  output  layer  and  stack  the  2  AE’s  and  the  output   layer  to  consBtute  a  Neural  Network   •  Fine  tune  this  network  using  backpropagaBon  with  a  smaller  number  of  labeled  images