Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
[course	
  site]	
  
Verónica Vilaplana
veronica.vilaplana@upc.edu
Associate Professor
Universitat Politecnica de Cataluny...
Previously	
  in	
  DLAI…	
  
•  Mul.layer	
  perceptron	
  
•  Training:	
  (stochas.c	
  /	
  mini-­‐batch)	
  gradient	...
Index	
  
•  Op6miza6on	
  for	
  a	
  machine	
  learning	
  task;	
  difference	
  between	
  learning	
  and	
  pure	
  ...
Differences	
  between	
  learning	
  and	
  pure	
  
op6miza6on	
  
Op6miza6on	
  for	
  NN	
  training	
  
•  Goal:	
  Find	
  the	
  parameters	
  that	
  minimize	
  the	
  expected	
  ri...
Surrogate	
  loss	
  
•  O]en	
  minimizing	
  the	
  real	
  loss	
  is	
  intractable	
  
•  e.g.	
  0-­‐1	
  loss	
  (0...
Surrogate	
  loss	
  func6ons	
  
7	
  
Probabilistic
classifier
Outputs	
  probability	
  of	
  class	
  1	
  
g(x) ≈ P(y=...
Early	
  stopping	
  
•  Training	
  algorithms	
  usually	
  do	
  not	
  halt	
  at	
  a	
  local	
  minimum	
  
•  Earl...
Batch	
  and	
  mini-­‐batch	
  algorithms	
  
•  In	
  most	
  op.miza.on	
  methods	
  used	
  in	
  ML	
  the	
  objec....
Batch	
  and	
  mini-­‐batch	
  algorithms	
  
Mini-­‐batch	
  size?	
  
•  Larger	
  batches:	
  more	
  accurate	
  es.m...
Challenges	
  in	
  NN	
  op6miza6on	
  
Local	
  minima	
  
•  Convex	
  op.miza.on	
  
•  any	
  local	
  minimum	
  is	
  a	
  global	
  minimum	
  
•  there	
 ...
Local	
  minima	
  and	
  saddle	
  points	
  
•  Cri.cal	
  points:	
  	
  
•  For	
  high	
  dimensional	
  loss	
  func...
Local	
  minima	
  and	
  saddle	
  points	
  
•  It	
  is	
  believed	
  that	
  for	
  many	
  problems	
  
including	
 ...
Saddle	
  points	
  
•  How	
  to	
  escape	
  from	
  saddle	
  points?	
  
•  First	
  order	
  methods	
  
•  ini.ally	...
Other	
  difficul6es	
  
•  Cliffs	
  and	
  exploding	
  gradients	
  
•  Nets	
  with	
  many	
  layers	
  /	
  recurrent	
...
Algorithms	
  
Stochas6c	
  Gradient	
  Descent	
  (SGD)	
  
•  Most	
  used	
  algorithm	
  for	
  deep	
  learning	
  
•  Do	
  not	
  ...
Momentum	
  
19	
  
•  Designed	
  to	
  accelerate	
  learning,	
  especially	
  for	
  high	
  curvature,	
  small	
  bu...
Momentum	
  	
  
•  New	
  variable	
  v	
  (velocity).	
  direc.on	
  and	
  speed	
  at	
  which	
  parameters	
  move:	...
Nesterov	
  accelerated	
  gradient	
  (NAG)	
  
•  A	
  variant	
  of	
  momentum,	
  where	
  gradient	
  is	
  evaluate...
Nesterov	
  accelerated	
  gradient	
  (NAG)	
  
22	
  
current	
  loca.on	
  wt
vt
∇L(wt) vt+1
S.	
  Credit:	
  K.	
  McG...
SGD:	
  learning	
  rate	
  
•  Learning	
  rate	
  is	
  a	
  crucial	
  parameter	
  for	
  SGD	
  
•  To	
  large:	
  o...
Adap6ve	
  learning	
  rates	
  
•  Learning	
  rate	
  is	
  one	
  of	
  the	
  hyperparameters	
  that	
  is	
  the	
  ...
AdaGrad	
  
•  Adapts	
  the	
  learning	
  rate	
  of	
  each	
  parameter	
  based	
  on	
  sizes	
  of	
  previous	
  u...
Root	
  Mean	
  Square	
  Propaga6on	
  (RMSProp)	
  
•  Modifies	
  AdaGrad	
  to	
  perform	
  beaer	
  in	
  non-­‐conve...
ADAp6ve	
  Moments	
  (Adam)	
  
•  Combina.on	
  of	
  RMSProp	
  and	
  momentum,	
  but:	
  
•  Keep	
  decaying	
  ave...
Example:	
  test	
  func6on	
  
28	
  
Image	
  credit:	
  Alec	
  Radford.	
  
Beale’s	
  func.on	
  
Example:	
  saddle	
  point	
  
29	
  
Image	
  credit:	
  Alec	
  Radford.	
  
Second	
  order	
  op6miza6on	
  
30	
  
First	
  order 	
   	
   	
   	
   	
   	
   	
   	
  Second	
  order	
  
Second	
  order	
  op6miza6on	
  
•  Second	
  order	
  Taylor	
  expansion	
  
	
  
•  Solving	
  for	
  the	
  cri.cal	
...
Parameter	
  ini6aliza6on	
  
•  Weights	
  
•  Can’t	
  ini.alize	
  weights	
  to	
  0	
  	
  (gradients	
  would	
  be	...
Batch	
  normaliza6on	
  
•  As	
  learning	
  progresses,	
  the	
  distribu.on	
  of	
  the	
  layer	
  inputs	
  change...
Batch	
  normaliza6on	
  
•  Can	
  be	
  applied	
  to	
  any	
  input	
  or	
  hidden	
  layer	
  
•  For	
  a	
  mini-­...
Batch	
  normaliza6on	
  
35	
  
1.  Improves	
  gradient	
  flow	
  
through	
  the	
  network	
  
2.  Allows	
  higher	
 ...
Batch	
  normaliza6on	
  
36	
  
At	
  test	
  6me	
  BN	
  layers	
  func6ons	
  
differently:	
  
	
  
1.  Mean	
  and	
 ...
Summary	
  
37	
  
•  Op.miza.on	
  for	
  NN	
  is	
  different	
  from	
  pure	
  op.miza.on:	
  
•  GD	
  with	
  mini-­...
Bibliograpy	
  
•  Goodfellow,	
  I.,	
  Bengio,	
  Y.,	
  and	
  A.,	
  C.	
  (2016),	
  Deep	
  Learning,	
  MIT	
  Pres...
Upcoming SlideShare
Loading in …5
×

Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

348 views

Published on

https://telecombcn-dl.github.io/2017-dlai/

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

  1. 1. [course  site]   Verónica Vilaplana veronica.vilaplana@upc.edu Associate Professor Universitat Politecnica de Catalunya Technical University of Catalonia Optimization for neural network training Day 4 Lecture 1 #DLUPC
  2. 2. Previously  in  DLAI…   •  Mul.layer  perceptron   •  Training:  (stochas.c  /  mini-­‐batch)  gradient  descent   •  Backpropaga.on     but…   What  type  of  op.miza.on  problem?   Do  local  minima  and  saddle  points  cause  problems?   Does  gradient  descent  perform  well?   How  to  set  the  learning  rate?   How  to  ini.alize  weights?   How  does  batch  size  affect  training?         2  
  3. 3. Index   •  Op6miza6on  for  a  machine  learning  task;  difference  between  learning  and  pure  op6miza6on   •  Expected  and  empirical  risk   •  Surrogate  loss  func.ons  and  early  stopping   •  Batch  and  mini-­‐batch  algorithms   •  Challenges   •  Local  minima     •  Saddle  points  and  other  flat  regions   •  Cliffs  and  exploding  gradients   •  Prac6cal  algorithms   •  Stochas.c  Gradient  Descent   •  Momentum   •  Nesterov  Momentum   •  Learning  rate   •  Adap.ve  learning  rates:  adaGrad,  RMSProp,  Adam   •  Approximate  second-­‐order  methods   •  Parameter  ini6aliza6on   •  Batch  Normaliza6on   3  
  4. 4. Differences  between  learning  and  pure   op6miza6on  
  5. 5. Op6miza6on  for  NN  training   •  Goal:  Find  the  parameters  that  minimize  the  expected  risk  (generaliza.on  error)   •  x  input,                                predicted  output,  y  target  output,  E  expecta.on   •  pdata  true  (unknown)  data  distribu.on   •  L    loss  func6on  (how  wrong  predic6ons  are)   •  But  we  only  have  a  training  set  of  samples:  we  minimize  the  empirical  risk,  average   loss  on  a  finite  dataset  D   J(θ) = Ε(x,y)∼pdata L( f (x;θ), y) f (x,θ) J(θ) = Ε(x,y)∼ ˆpdata L( f (x;θ), y) = 1 D L( f (x(i) ,θ), y(i) ) (x(i) ,y(i) )∈D ∑ where                            is  the  empirical  distribu.on,  |D|  is  the  number  of  examples  in  D   5   ˆpdata
  6. 6. Surrogate  loss   •  O]en  minimizing  the  real  loss  is  intractable   •  e.g.  0-­‐1  loss  (0  if  correctly  classified,  1  if  it  is  not)                (intractable  even  for  linear  classifiers  (Marcoae  1992)       •  Minimize  a  surrogate  loss  instead   •  e.g.  nega.ve  log-­‐likelihood  for  the  0-­‐1  loss   •  Some.mes  the  surrogate  loss  may  learn  more   •  test  error  0-­‐1  loss  keeps  decreasing  even  a]er   training  0-­‐1  loss  is  zero   •  further  pushing  classes  apart  from  each  other   6   0-­‐1  loss  (blue)  and  surrogate   losses  (square,  hinge,  logis.c)     0-­‐1  loss  (blue)  and     nega.ve  log  likelihood  (red)  
  7. 7. Surrogate  loss  func6ons   7   Probabilistic classifier Outputs  probability  of  class  1   g(x) ≈ P(y=1 | x) Probability for class 0 is 1-g(x) Binary cross-entropy loss: L(g(x),y) = -(y log(g(x)) + (1-y) log(1-g(x)) Decision function:f(x) = Ig(x)>0.5 Outputs  a  vector  of  probabili.es:   g(x) ≈ ( P(y=0|x), ..., P(y=m-1|x) ) Negative conditional log likelihood loss L(g(x),y) = -log g(x)y Decision function:f(x) = argmax(g(x)) Non- Hinge  loss:  probabilistic classifier Outputs a «score» g(x) for class 1. score for the other class is -g(x) L(g(x),t) = max(0, 1-tg(x)) where t=2y-1 Decision function: f(x) = Ig(x)>0 Outputs  a  vector  g(x) of  real-­‐valued   scores  for  the  m  classes. Mul.class  margin  loss   L(g(x),y) = max(0,1+max(g(x)k)-g(x)y ) k≠y Decision function: f(x) = argmax(g(x)) Binary classifier Multiclass classifier
  8. 8. Early  stopping   •  Training  algorithms  usually  do  not  halt  at  a  local  minimum   •  Early  stopping:   •  based  on  the  true  underlying  loss  (ex  0-­‐1  loss)  measured  on  a  valida6on  set   •  #  training  steps  =  hyperparameter  controlling  the  effec.ve  capacity  of  the  model   •  simple,  effec.ve,  must  keep  a  copy  of  the  best  parameters   •  acts  as  a  regularizer  (Bishop  1995,…)   8   Training  error  decreases  steadily   Valida.on  error  begins  to  increase     Return  parameters  at  point  with   lowest  valida6on  error  
  9. 9. Batch  and  mini-­‐batch  algorithms   •  In  most  op.miza.on  methods  used  in  ML  the  objec.ve  func.on  decomposes  as  a  sum  over  a   training  set   •  Gradient  descent:      examples  from  the  training  set                                                with  corresponding  targets     •  Using  the  complete  training  set  can  be  very  expensive  (the  gain  of  using  more  samples  is  less   than  linear  –  standard  error  of  mean  drops  propor.onally  to  sqrt(m)-­‐,  training  set  may  be   redundant:  use  a  subset  of  the  training  set   •  How  many  samples  in  each  update  step?   •  Determinis.c  or  batch  gradient  methods:  process  all  training  samples  in  a  large  batch   •  Stochas.c  methods:  use  a  single  example  at  a  .me   •  online  methods:  samples  are  drawn  from  a  stream  of  con.nually  created  samples   •  Mini-­‐batch  stochas.c  methods:  use  several  (not  all)  samples   9   ∇θ J(θ) = Ε(x,y)∼ ˆpdata ∇θ L( f (x;θ), y) = 1 m ∇θ L( f (x(i) ;θ), y(i) )i∑ {x(i) }i=1...m {y(i) }i=1...m
  10. 10. Batch  and  mini-­‐batch  algorithms   Mini-­‐batch  size?   •  Larger  batches:  more  accurate  es.mate  of  the  gradient  but  less  than  linear  return     •  Very  small  batches:  Mul.core  architectures  under-­‐u.lized   •  If  samples  processed  in  parallel:  memory  scales  with  batch  size   •  Smaller  batches  provide  noisier  gradient  es.mates   •  Small  batches  may  offer  a  regularizing  effect    (add  noise)   •  but  may  require  small  learning  rate   •  may  increase  number  of  steps  for  convergence       Minbatches  should  be  selected  randomly  (shuffle  samples)   •  unbiased  es.mate  of  gradients   10  
  11. 11. Challenges  in  NN  op6miza6on  
  12. 12. Local  minima   •  Convex  op.miza.on   •  any  local  minimum  is  a  global  minimum   •  there  are  several  opt.  algorithms  (polynomial-­‐.me)     •  Non-­‐convex  op.miza.on   •  objec6ve  func6on  in  deep  networks  is  non-­‐convex   •  deep  models  may  have  several  local  minima   •  but  this  is  not  necessarily  a  major  problem!   12  
  13. 13. Local  minima  and  saddle  points   •  Cri.cal  points:     •  For  high  dimensional  loss  func.ons,  local  minima  are  rare  compared  to  saddle  points   •  Hessian  matrix:      real,  symmetric      eigenvector/eigenvalue  decomposi.on     •  Intui.on:  eigenvalues  of  the  Hessian  matrix     •  local  minimum/maximum:  all  posi.ve  /  all  nega.ve  eigenvalues:  exponen.ally  unlikely  as  n  grows   •  saddle  points:  both  posi.ve  and  nega.ve  eigenvalues   13  Dauphin  et  al.  Iden.fying  and  aaacking  the  saddle  point  problem  in  high-­‐dimensional  non-­‐convex  op.miza.on.  NIPS  2014     Hij = ∂2 f ∂xi ∂xj f :!n → ! ∇x f (x) = 0
  14. 14. Local  minima  and  saddle  points   •  It  is  believed  that  for  many  problems   including  learning  deep  nets,  almost  all  local   minimum  have  very  similar  func.on  value  to   the  global  op.mum   •  Finding  a  local  minimum  is  good  enough   14   Value  of  local  minima  found  by  running  SGD  for  200   itera.ons  on  a  simplified  version  of  MNIST  from  different   ini.al  star.ng  points.  As  number  of  parameters  increases,   local  minima  tend  to  cluster  more  .ghtly.   •  For  many  random  func.ons  local  minima  are  more  likely  to  have  low  cost  than  high   cost.   Dauphin  et  al.  Iden.fying  and  aaacking  the  saddle  point  problem  in  high-­‐dimensional  non-­‐convex  op.miza.on.  NIPS  2014    
  15. 15. Saddle  points   •  How  to  escape  from  saddle  points?   •  First  order  methods   •  ini.ally  aaracted  to  saddle  points,  but  unless   exact  hit,  it  will  be  repelled  when  close   •  hiqng  cri.cal  point  exactly  is  unlikely  (es.mated   gradient  is  noisy)   •  saddle  points  are  very  unstable:  noise  (stochas.c   gradient  descent)  helps  convergence,  trajectory   escapes  quickly   •  Second  order  moments:   •  Netwon’s  method  can  jump  to  saddle  points   (where  gradient  is  0)   15  S.  Credit:  K.McGuinness   SGD  tends  to  oscillate  between  slowly  approaching   a  saddle  point  and  quickly  escaping  from  it  
  16. 16. Other  difficul6es   •  Cliffs  and  exploding  gradients   •  Nets  with  many  layers  /  recurrent  nets  can  contain  very  steep  regions  (cliffs)   mul.plica.on  of  several  parameters):  gradient  descent  can  move  parameters  too   far,  jumping  off  of  the  cliff.  (solu.ons:  gradient  clipping)   •  Long  term  dependencies:   •  computa.onal  graph  becomes  very  deep:  vanishing  and  exploding  gradients   16  
  17. 17. Algorithms  
  18. 18. Stochas6c  Gradient  Descent  (SGD)   •  Most  used  algorithm  for  deep  learning   •  Do  not  confuse  with  determinis.c  gradient  descent:  stochas.c  uses  mini-­‐batches   Algorithm   •  Require:  Learning  rate  α,  ini.al  parameter  θ   •  while  stopping  criterion  not  met  do   •  sample  a  minibatch  of  m  examples  from  the  training  set                                          with   corresponding  targets     •  compute  gradient  es.mate   •  apply  update     •  end  while   18   {x(i) }i=1...m {y(i) }i=1...m ˆg ← + 1 m ∇θ L( f (x(i) ;θ), y(i) )i∑ θ ←θ −α ˆg
  19. 19. Momentum   19   •  Designed  to  accelerate  learning,  especially  for  high  curvature,  small  but  consistent   gradients  or  noisy  gradients   •  Momentum  aims  to  solve:  poor  condi.oning  of  Hessian  matrix  and  variance  in  the   stochas.c  gradient   Contour  lines=  a  quadra.c  loss  with  poor  condi.oning  of  Hessian   Path  (red)  followed  by  SGD  (le])  and  momentum  (right)  
  20. 20. Momentum     •  New  variable  v  (velocity).  direc.on  and  speed  at  which  parameters  move:   exponen.ally  decaying  average  of  nega.ve  gradient   Algorithm   •  Require:  learning  rate  α,  ini.al  parameter  θ,    momentum  parameter  λ    ,  ini6al  velocity  v •  Update  rule:   •  compute  gradient  es.mate   •  compute  velocity  update   •  apply  update         •  Typical  values  λ=.5,  .9,.99  (in  [0,1})   •  Size  of  step  depends  on  how  large  and  aligned  a  sequence  of  gradients  are.   •  Read  physical  analogy  in  Deep  Learning  book  (Goodfellow  et  al)   20   g ← + 1 m ∇θ L( f (x(i) ;θ), y(i) )i∑ θ ←θ + v v ← λv −αg
  21. 21. Nesterov  accelerated  gradient  (NAG)   •  A  variant  of  momentum,  where  gradient  is  evaluated  a]er  current  velocity  is  applied:   •  Approximate  where  the  parameters  will  be  on  the  next  .me  step  using  current  velocity   •  Update  velocity  using  gradient  where  we  predict  parameters  will  be   Algorithm   •  Require:  learning  rate  α,  ini.al  parameter  θ,  momentum  parameter  λ    ,  ini.al  velocity  v •  Update:   •  apply  interim  update   •  compute  gradient  (at  interim  point)   •  compute  velocity  update   •  apply  update     •  Interpreta.on:  add  a  correc.on  factor  to  momentum   21   g ← + 1 m ∇!θ L( f (x(i) ; !θ), y(i) )i∑ θ ←θ + v v ← λv −αg !θ ←θ + λv
  22. 22. Nesterov  accelerated  gradient  (NAG)   22   current  loca.on  wt vt ∇L(wt) vt+1 S.  Credit:  K.  McGuinness   predicted  loca.on  based  on  velocity  alone  wt + 𝛾v ∇L(wt + 𝛾vt) vt vt+1
  23. 23. SGD:  learning  rate   •  Learning  rate  is  a  crucial  parameter  for  SGD   •  To  large:  overshoots  local  minimum,  loss  increases   •  Too  small:  makes  very  slow  progress,  can  get  stuck   •  Good  learning  rate:  makes  steady  progress  toward  local  minimum   •  In  prac.ce  it  is  necessary  to  gradually  decrease  learning  rate   •  step  decay  (e.g.  decay  by  half  every  few  epochs)   •  exponen6al  decay   •  1/t  decay       •  Sufficient  condi.ons  for  convergence:     •  Usually:  adapt  learning  rate  by  monitoring  learning  curves  that  plot  the  objec.ve  func.on   as  a  func.on  of  .me  (more  of  an  art  than  a  science!)   23   αt = ∞ t=1 ∞ ∑ αt 2 = ∞ t=1 ∞ ∑ α = α0 e−kt α = α0 / (1+ kt) t  =  itera/on  number  
  24. 24. Adap6ve  learning  rates   •  Learning  rate  is  one  of  the  hyperparameters  that  is  the  most  difficult  to  set;  it  has  a   significant  impact  on  the  model  performance   •  Cost  if  o]en  sensi.ve  to  some  direc.ons  and  insensi.ve  to  others   •  Momentum/Nesterov  mi.gate  this  issue  but  introduce  another  hyperparameter   •  Solu6on:  Use  a  separate  learning  rate  for  each  parameter  and  automa6cally  adapt  it   through  the  course  of  learning     •  Algorithms  (mini-­‐batch  based)   •  AdaGrad   •  RMSProp   •  Adam   •  RMSProp  with  Nesterov  momentum     24  
  25. 25. AdaGrad   •  Adapts  the  learning  rate  of  each  parameter  based  on  sizes  of  previous  updates:     •  scales  updates  to  be  larger  for  parameters  that  are  updated  less   •  scales  updates  to  be  smaller  for  parameters  that  are  updated  more     •  The  net  effect  is  greater  progress  in  the  more  gently  sloped  direc.ons  of  parameter  space     •  Desirable  theore.cal  proper.es  but  empirically  (for  deep  models)  can  result  in  a  premature  and   excessive  decrease  in  effec6ve  learning  rate   •  Require:  learning  rate  α,  ini.al  parameter  θ,  small  constant  δ  (e.g.  10-­‐7)  for  numerical  stability •  Update:   •  compute  gradient   •  accumulate  squared  gradient   •  compute  update   •  apply  update     25   g ← + 1 m ∇θ L( f (x(i) ;θ), y(i) )i∑ θ ←θ + Δθ Δθ ← − α δ + r ⊙ g r ← r + g ⊙ g sum  of    all  previous  squared  gradients     updates  inversely  propor.onal  to  the   square  root  of  the  sum   Duchi  et  al.  Adap.ve  Subgradient  Methods  for  Online  Learning  and  Stochas.c  Op.miza.on.  JMRL  2011  
  26. 26. Root  Mean  Square  Propaga6on  (RMSProp)   •  Modifies  AdaGrad  to  perform  beaer  in  non-­‐convex  surfaces,  for  aggressively  decaying   learning  rates   •  Changes  gradient  accumula.on  by  an  exponen6ally  decaying  average  of  sum  of   squares  of  gradients     •  Requires:  learning  rate  α,  ini.al  parameter  θ,  decay  rate  ρ,  small  constant  δ  (e.g.  10-­‐7)   for  numerical  stability •  Update:   •  compute  gradient   •  accumulate  squared  gradient   •  compute  update   •  apply  update     26   θ ←θ + Δθ Δθ ← − α δ + r ⊙ g r ← ρr + (1− ρ)g ⊙ g g ← + 1 m ∇θ L( f (x(i) ;θ), y(i) )i∑ It  can  be  combined  with   Nesterov  momentum   Geoff  Hinton,  Unpublished  
  27. 27. ADAp6ve  Moments  (Adam)   •  Combina.on  of  RMSProp  and  momentum,  but:   •  Keep  decaying  average  of  both  first-­‐order  moment  of  gradient  (momentum)  and  second-­‐order   moment  (RMSProp)   •  Includes  bias  correc.ons  (firs  and  second  moments)  to  account  for  their  ini.aliza.on  at  origin   Update:   •  compute  gradient   •  updated  biased  first  moment  es6mate   •  update  biased  second  moment   •  correct  biases   •  compute  update                                                                                              (opera.ons  applied  elementwise)   •  apply  update   27   θ ←θ + Δθ Δθ ← −α ˆs δ + ˆr s ← ρ1 s + (1− ρ1 )g g ← + 1 m ∇θ L( f (x(i) ;θ), y(i) )i∑ r ← ρ2 r + (1− ρ2 )g ⊙ g ˆs ← s 1− ρ1 ˆr ← r 1− ρ2 Kingma  et  al.  Adam:  a  Method  for  Stochas.c  Op.miza.on.  ICLR  2015    δ=e-­‐8,  ρ1=0.9,  ρ2=0.999  
  28. 28. Example:  test  func6on   28   Image  credit:  Alec  Radford.   Beale’s  func.on  
  29. 29. Example:  saddle  point   29   Image  credit:  Alec  Radford.  
  30. 30. Second  order  op6miza6on   30   First  order                Second  order  
  31. 31. Second  order  op6miza6on   •  Second  order  Taylor  expansion     •  Solving  for  the  cri.cal  point  we  obtain  the  Newton  parameter  update:   •  Problem:  Hessian  has  O(N2)  elements,  inver.ng  H  O(N3)  (N  parameters  ≈millions)   •  Alterna6ves:   •  Quasi-­‐Newton  methods  (BGFS  Broyden-­‐Fletcher-­‐Goldfarb-­‐Shanno):  instead  of   inver.ng  Hessian,  approximate  inverse  Hessianwith  rank  1  updates  over  .me   O(N2)  each   •  L-­‐BFGS  (Limited  memory  BFGS):  does  not  form/store  the  full  inverse  Hessian   31   J(θ) ≈ J(θ0 ) + (θ −θ0 )T ∇θ J(θ0 ) + 1 2 (θ −θ0 )T H(θ −θ0 ) θ* = θ0 − H−1 ∇θ J(θ0 )
  32. 32. Parameter  ini6aliza6on   •  Weights   •  Can’t  ini.alize  weights  to  0    (gradients  would  be  0)   •  Can’t  ini.alize  all  weights  to  the  same  value  (all  hidden  units  in  a  layer  will  always   behave  the  same;  need  to  break  symmetry)   •  Small  random  number,  e.g.,  uniform  or  gaussian  distribu6on     •  if  weights  start  too  small,  the  signal  shrinks  as  it  passes  through  each  layer  un.l  it  is   too  .ny  to  be  useful   •  Calibra6ng  variances  with  1/sqrt(n)    (Xavier  Ini6aliza6on)   •  each  neuron:  w  =  randn(n)  /  sqrt(n)  ,  n  inputs   •  He  ini6aliza6on  (for  ReLu  ac6va6ons)  sqrt(2/n)   •  each  neuron  w  =  randn(n)  *  sqrt(2.0  /n)  ,  n  inputs   •  Biases   •  ini.alize  all  to  0  (except  for  output  unit  for  skewed  distribu.ons,  0.01  to  avoid  satura.ng  RELU)   •  Alterna6ve:  Ini.alize  using  machine  learning;  parameters  learned  by  unsupervised  model   trained  on  the  same  inputs  /  trained  on  unrelated  task   32   N(0,10−2 )
  33. 33. Batch  normaliza6on   •  As  learning  progresses,  the  distribu.on  of  the  layer  inputs  changes  due   to  parameter  updates  (  internal  covariate  shi])   •  This  can  result  in  most  inputs  being  in  the  non-­‐linear  regime  of      the  ac.va.on  func.on,  slowing  down  learning     •  Bach  normaliza.on  is  a  technique  to  reduce  this  effect   •  Explicitly  force  the  layer  ac.va.ons  to  have  zero  mean  and  unit   variance  w.r.t  running  batch  es.mates     •  Adds  a  learnable  scale  and  bias  term  to  allow  the  network  to  s.ll   use  the  nonlinearity   33    Ioffe  and  Szegedy,  2015.  “Batch  normaliza.on:  accelera.ng  deep  network  training  by  reducing  internal  covariate  shi]”   FC  /  Conv   Batch  norm   ReLu   FC  /  Conv   Batch  norm   ReLu  
  34. 34. Batch  normaliza6on   •  Can  be  applied  to  any  input  or  hidden  layer   •  For  a  mini-­‐batch  of  N  ac.va.ons  of  the  layer   1.  compute  empirical  mean  and  variance  for  each  dimension  D   2.  normalize     •  Note:  normaliza.on  can  reduce  the  expressive  power  of  the  network  (e.g.  normalize   intputs  of  a  sigmoid  would  constrain  them  to  the  linear  regime   •  So  let  the  network  learn  the  iden.ty:  scale  and  shi]   •  To  recover  the  iden.ty  mapping  the  newtork  can  lean   34   ˆx(k ) = x(k ) − E(x(k ) ) var(x(k ) ) N D X β(k ) = E(x(k ) )γ (k ) = var(x(k ) ) y(k ) = γ (k ) ˆx(k ) + β(k )
  35. 35. Batch  normaliza6on   35   1.  Improves  gradient  flow   through  the  network   2.  Allows  higher  learning  rates   3.  Reduces  the  strong   dependency  on   ini.aliza.on   4.  Reduces  the  need  of   regulariza.on  
  36. 36. Batch  normaliza6on   36   At  test  6me  BN  layers  func6ons   differently:     1.  Mean  and  std  are  not   computed  on  the  batch.   2.  Instead,  a  single  fixed   empirical  mean  and  std  of   ac.va.ons  computed  during   training  is  used                  (can  be  es.mated  with    running  averages)  
  37. 37. Summary   37   •  Op.miza.on  for  NN  is  different  from  pure  op.miza.on:   •  GD  with  mini-­‐batches   •  early  stopping   •  non-­‐convex  surface,  local  minima  and  saddle  points   •  Learning  rate  has  a  significant  impact  on  model  performance   •  Several  extensions  to  SGD  can  improve  convergence   •  Ada.ve  learning-­‐rate  methods  are  likely  to  achieve  best  results   •  RMSProp,  Adam     •  Weight  ini.aliza.on:  He        w=  randn(n)  sqrt(2/n)     •  Batch  normaliza.on  to  reduce  the  internal  covariance  shi]  
  38. 38. Bibliograpy   •  Goodfellow,  I.,  Bengio,  Y.,  and  A.,  C.  (2016),  Deep  Learning,  MIT  Press.   •  Choromanska,  A.,  Henaff,  M.,  Mathieu,  M.,  Arous,  G.  B.,  and  LeCun,  Y.  (2015),  The  loss  surfaces  of   mul.layer  networks.  In  AISTATS.   •  Dauphin,  Y.  N.,  Pascanu,  R.,  Gulcehre,  C.,  Cho,  K.,  Ganguli,  S.,  and  Bengio,  Y.  (2014).  Iden.fying  and   aaacking  the  saddle  point  problem  in  high-­‐dimensional  non-­‐convex  op.miza.on.  In  Advances  in   Neural  Informa.on  Processing.  Systems,  pages  2933–2941.   •  Duchi,  J.,  Hazan,  E.,  and  Singer,  Y.  (2011).  Adap.ve  subgradient  methods  for  online  learning  and   stochas.c  op.miza.on.  Journal  of  Machine  Learning  Research,  12(Jul):2121–2159.   •  Goodfellow,  I.  J.,  Vinyals,  O.,  and  Saxe,  A.  M.  (2015).  Qualita.vely  characterizing  neural  network   op.miza.on  problems.  In  Interna.onal  Conference  on  Learning  Representa.ons.   •  Hinton,  G.  (2012).  Neural  networks  for  machine  learning.  Coursera,  video  lectures   •  Jacobs,  R.  A.  (1988).  Increased  rates  of  convergence  through  learning  rate  adapta.on.  Neural   networks,  1(4):295–307.   •  Kingma,  D.  and  Ba,  J.  (2014)-­‐  Adam:  A  method  for  stochas.c  op.miza.on.  arXiv  preprint  arXiv: 1412.6980.   •  Saxe,  A.  M.,  McClelland,  J.  L.,  and  Ganguli,  S.  (2013).  Exact  solu.ons  to  the  nonlinear  dynamics  of   learning  in  deep  linear  neural  networks.  In  Interna.onal  Conference  on  Learning  Representa.ons  38  

×