Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona 2018


Published on

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.

Published in: Data & Analytics
  • Did u try to use external powers for studying? Like ⇒ ⇐ ? They helped me a lot once.
    Are you sure you want to  Yes  No
    Your message goes here
  • Have you ever used the help of ⇒ ⇐? They can help you with any type of writing - from personal statement to research paper. Due to this service you'll save your time and get an essay without plagiarism.
    Are you sure you want to  Yes  No
    Your message goes here
  • 1 minute a day to keep your weight away! ★★★
    Are you sure you want to  Yes  No
    Your message goes here
  • It's genuinely changed my life. I have been sleeping in the spare room for 4 months - and let's just say my sex life had become pretty boring! My wife and I were becoming strangers living in the same house. Thanks to your strategies, I am now back in our bed and the closeness and intimacy have returned. Thank you so much for taking the time to put all this together. It has genuinely changed my life. 
    Are you sure you want to  Yes  No
    Your message goes here
  • Water Hack Burns 2lb of Fat Overnight (video tutorial) ●●●
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona 2018

  1. 1. [course  site]   Verónica Vilaplana Associate Professor Universitat Politecnica de Catalunya Technical University of Catalonia Optimization for neural network training Day 3 Lecture 2 #DLUPC
  2. 2. Previously  in  DLAI…   •  Mul.layer  perceptron   •  Training:  (stochas.c  /  mini-­‐batch)  gradient  descent   •  Backpropaga.on   •  Loss  func.on     but…   What  type  of  op.miza.on  problem?   Do  local  minima  and  saddle  points  cause  problems?   Does  gradient  descent  perform  well?   How  to  set  the  learning  rate?   How  to  ini.alize  weights?   How  does  batch  size  affect  training?         2  
  3. 3. Index   •  Op6miza6on  for  a  machine  learning  task;  difference  between  learning  and  pure  op6miza6on   •  Expected  and  empirical  risk   •  Surrogate  loss  func.ons  and  early  stopping   •  Batch  and  mini-­‐batch  algorithms   •  Challenges  for  deep  models   •  Local  minima     •  Saddle  points  and  other  flat  regions   •  Cliffs  and  exploding  gradients   •  Prac6cal  algorithms   •  Stochas.c  Gradient  Descent   •  Momentum   •  Nesterov  Momentum   •  Learning  rate   •  learning  rates:  adaGrad,  RMSProp,  Adam   •  Parameter  ini6aliza6on   •  Batch  Normaliza6on   3  
  4. 4. Differences  between  learning  and  pure   op6miza6on  
  5. 5. Op6miza6on  for  NN  training   •  Goal:  Find  the  parameters  that  minimize  the  expected  risk  (generaliza.on  error)   •  x  input,                                    predicted  output,  y  target  output,  E  expecta.on   •  pdata  true  (unknown)  data  distribu.on,  L    loss  func6on  (how  wrong  predic6ons  are)   •  But  we  only  have  a  training  set  of  samples:  we  minimize  the  empirical  risk,  average   loss  on  a  finite  dataset  D   J(θ) = Ε(x,y)∼pdata L( fθ (x), y) fθ (x) J(θ) = Ε(x,y)∼ ˆpdata L( fθ (x), y) = 1 D L( fθ (x(i) ), y(i) ) (x(i) ,y(i) )∈D ∑ where                            is  the  empirical  distribu.on,  |D|  is  the  number  of  examples  in  D   5   ˆpdata
  6. 6. Surrogate  loss   •  O]en  minimizing  the  real  loss  is  intractable  (can’t  be  used  with  gradient  descent)   •  e.g.  0-­‐1  loss  (0  if  correctly  classified,  1  if  it  is  not)                (intractable  even  for  linear  classifiers  (Marcobe  1992)       •  Minimize  a  surrogate  loss  instead   •  e.g.  for  the  0-­‐1  loss   hinge     square     logis.c   6   0-­‐1  loss  (blue)  and  surrogate  losses     (green:  square,  purple:  hinge,  yellow:  logis.c)     L( f (x) , y) = I( f (x)≠y) L( f (x), y) = max(0,1− yf (x)) L( f (x), y) = (1− yf (x))2 L( f (x), y) = log(1+ e− yf (x) )
  7. 7. Surrogate  loss  func6ons   7   Probabilistic classifier Outputs  probability  of  class  1   f(x) ≈ P(y=1 | x) Probability for class 0 is 1-f(x) Binary cross-entropy loss: L(f(x),y) = -(y log(f(x)) + (1-y) log(1-f(x)) Decision  func.on: F(x) = If(x)>0.5 Outputs  a  vector  of   f(x) ≈ ( P(y=0|x), ..., P(y=m-1|x) ) Negative conditional log likelihood loss L(f(x),y) = -log f(x)y Decision  func.on:  F(x) = argmax(f(x)) Non- Hinge  loss:  probabilistic classifier Outputs a «score» f(x) for class 1. score for the other class is -f(x) L(f(x),t) = max(0, 1-t f(x)) where t=2y-1 Decision  func.on:    F(x) = If(x)>0 Outputs  a  vector  f(x) of  real-­‐valued   scores  for  the  m  classes. Mul.class  margin  loss   L(f(x),y) = max(0,1+max(f(x)k)-f(x)y ) k≠y Decision  func.on:    F(x) = argmax(f(x)) Binary classifier Multiclass classifier
  8. 8. Early  stopping   •  Training  algorithms  usually  do  not  halt  at  a  local  minimum   •  Convergence  criterion  based  on  early  stopping:   •  based  on  surrogate  loss  or  true  underlying  loss  (ex  0-­‐1  loss)  measured  on  a  valida6on  set   •  #  training  steps  =  hyperparameter  controlling  the  capacity  of  the  model   •  simple,,  must  keep  a  copy  of  the  best  parameters   •  acts  as  a  regularizer  (Bishop  1995,…)   8   Training  error  decreases  steadily   Valida.on  error  begins  to  increase     Return  parameters  at  point  with   lowest  valida6on  error  
  9. 9. Batch  and  mini-­‐batch  algorithms   •  Gradient  descent  at  each  itera.on  computes  gradients  over  the  dataset  for  one  update     •  ↑  Gradients  are  stable   •  ↓  Using  the  complete  training  set  can  be  very  expensive     •  the  gain  of  using  more  samples  is  less  than  linear:     •  standard  error  of  the  mean  es.mated  from  m  samples  is                                          (σ  is  true  std)     •  ↓  Training  set  may  be  redundant   •  Use  a  subset  of  the  training  set   Loop:   1.  sample  a  subset  of  data   2.  forward  prop  through  the  network   3.  backprop  to  calculate  gradients   4.  update  parameters  using  gradients   9   ∇θ J(θ) = 1 m ∇θ L( fθ (x(i) ), y(i) )i∑ SE = σ m Minibatch  gradient  descent  
  10. 10. Batch  and  mini-­‐batch  algorithms   •  How  many  samples  in  each  update  step?     •  Determinis.c  or  batch  gradient  methods:  process  all  training  samples  in  a  large  batch   •  Mini-­‐batch  stochas.c  methods:  use  several  (not  all)  samples     •  Stochas.c  methods:  use  a  single  example  at  a  .me   •  online  methods:  samples  are  drawn  from  a  stream  of  con.nually  created  samples   10  batch  vs  minibatch  gradient  descent  
  11. 11. Batch  and  mini-­‐batch  algorithms   Mini-­‐batch  size?   •  Larger  batches:  more  accurate  es.mate  of  the  gradient  but  less  than  linear  return     •  Very  small  batches:  Mul.core  architectures  under-­‐u.lized   •  Smaller  batches  provide  noisier  gradient  es.mates   •  Small  batches  may  offer  a  regularizing  effect    (add  noise)   •  but  may  require  small  learning  rate   •  may  increase  number  of  steps  for  convergence     •  If  small  training  set,  use  batch  gradient  descent   •  If  large  training  set,  use  mini  batches   •  Minbatches  should  be  selected  randomly  (shuffle  samples)   •  unbiased  es.mate  of  gradients   •  Typical  mini-­‐batch  size:  32,  64,  128,  256     •  (2p,  make  sure  mini-­‐batch  fits  in  CPU-­‐GPU  memory)   11  
  12. 12. Challenges  in  deep  NN  op6miza6on  
  13. 13. Convex  /  Non-­‐convex  op6miza6on   A  func.on                                                  defined  on  an  n-­‐dim  interval  is  convex  if  for  any       13   f : X → ! f (λx + (1− λ)x') ≤ λ f (x) + (1− λ) f (x') x,x' ∈X λ ∈[0,1] f (λx + (1− λ)x') λ f (x) + (1− λ) f (x')
  14. 14. Convex  /  Non-­‐convex  op6miza6on   •  Convex  op.miza.on   •  any  local  minimum  is  a  global  minimum   •  there  are  several  opt.  algorithms  (polynomial-­‐.me)     •  Non-­‐convex  op.miza.on   •  objec6ve  func6on  in  deep  networks  is  non-­‐convex   •  deep  models  may  have  several  local  minima   •  but  this  is  not  necessarily  a  major  problem!   14  
  15. 15. Local  minima  and  saddle  points   •  Cri6cal  points:     •  For  high  dimensional  loss  func.ons,  local  minima  are  rare  compared  to  saddle  points   •  Hessian  matrix:      real,  symmetric    eigenvector/eigenvalue  decomposi.on     •  Intui.on:  eigenvalues  of  the  Hessian  matrix     •  local  minimum/maximum:  all  /  all  eigenvalues:  unlikely  as  n  grows   •  saddle  points:  both  and  eigenvalues   15  Dauphin  et  al.  Iden.fying  and  abacking  the  saddle  point  problem  in  high-­‐dimensional  non-­‐convex  op.miza.on.  NIPS  2014     Hij = ∂2 f ∂xi ∂xj f :!n → ! ∇x f (x) = 0
  16. 16. Local  minima  and  saddle  points   •  It  is  believed  that  for  many  problems   including  learning  deep  nets,  almost  all  local   minimum  have  very  similar  func.on  value  to   the  global  op.mum   •  Finding  a  local  minimum  is  good  enough   16   Value  of  local  minima  found  by  running  SGD  for  200   itera.ons  on  a  simplified  version  of  MNIST  from  different  points.  As  number  of  parameters  increases,   local  minima  tend  to  cluster  more  .ghtly.   •  For  many  random  func.ons  local  minima  are  more  likely  to  have  low  cost  than  high   cost.   Choromanska  et  al.  The  loss  surfaces  of  mul.layer  networks,  AISTATS  2015  
  17. 17. Saddle  points   How  to  escape  from  saddle  points?   •  First  order  methods   •  abracted  to  saddle  points,  but  unless   exact  hit,  it  will  be  repelled  when  close   •  hitng  point  exactly  is  unlikely  (es.mated   gradient  is  noisy)   •  saddle  points  are  very  unstable:  noise  (stochas.c   gradient  descent)  helps  convergence,  trajectory   escapes  quickly   •  Second  order  moments:   •  Netwon’s  method  can  jump  to  saddle  points   (where  gradient  is  0)   17  S.  Credit:  K.McGuinness   SGD  tends  to  oscillate  between  slowly  approaching   a  saddle  point  and  quickly  escaping  from  it  
  18. 18. Other  difficul6es   •  Cliffs  and  exploding  gradients   •  Nets  with  many  layers  /  recurrent  nets  can  contain  very  steep  regions  (cliffs):   gradient  descent  can  move  parameters  too  far,  jumping  off  of  the  cliff.  (solu.ons:   gradient  clipping)   •  Long  term  dependencies   •  computa.onal  graph  becomes  very  deep  (deep  nets  /  recurrent  nets):  vanishing   and  exploding  gradients   18   cost  func.on  of  highly   non  linear  deep  nets   or  recurrent  net   (Pascanu2013)  
  19. 19. Algorithms  
  20. 20. Mini-­‐batch  Gradient  Descent   •  Most  used  algorithm  for  deep  learning   Algorithm   •  Require:  parameter  θ,  learning  rate  α,     •  while  stopping  criterion  not  met  do   •  sample  a  minibatch  of  m  examples  from  the  training  set                                          with   corresponding  targets     •  compute  gradient  es.mate   •  apply  update     •  end  while   20   {x(i) }i=1...m {y(i) }i=1...m g ← + 1 m ∇θ L( fθ (x(i) ), y(i) )i∑ θ ←θ −αg
  21. 21. Problems  with  GD   •  GD  can  be  very  slow.     •  Can  get  stuck  in  local  minima  or  saddle  points   •  If  the  loss  changes  quickly  in  one  direc.on  and  slowly  in  another,  GD  makes  slow   progress  along  shallow  dimension,  jibers  along  steep  direc.on     21   Loss  func.on  has  a  high  condi6on  number  (5):  ra.o  of   largest  to  smallest  singular  value  of  Hessian  matrix  is  large  
  22. 22. Momentum     •  Momentum  is  designed  to  accelerate  learning,  especially  for  high  curvature,  small  but   consistent  gradients  or  noisy  gradients   •  New  variable  velocity  v  (direc.on  and  speed  at  which  parameters  move)   •  decaying  average  of  gradient     Algorithm   •  Require:  parameter  θ,  learning  rate  α,    momentum  parameter  λ    ,  ini6al  velocity  v •  Update  rule:  (g  is  gradient  es.mate)   •  compute  velocity  update   •  apply  update         •  Typical  values  v0=0,    λ=0.5,  0.9,0.99      (in  [0,1})   •  Read  physical  analogy  in  Deep  Learning  book  (Goodfellow  et  al):  velocity  =  momentum  of  unit  mass  par.cle   22   θ ←θ + v v ← λv −αg
  23. 23. Nesterov  accelerated  gradient  (NAG)   •  A  variant  of  momentum,  where  gradient  is  evaluated  a]er  current  velocity  is  applied:   •  Approximate  where  the  parameters  will  be  on  the  next  .me  step  using  current  velocity   •  Update  velocity  using  gradient  where  we  predict  parameters  will  be   Algorithm   •  Require:  parameter  θ,  learning  rate  α,  momentum  parameter  λ    ,  velocity  v •  Update:   •  apply  interim  update   •  compute  gradient  (at  interim  point)   •  compute  velocity  update   •  apply  update     •  Interpreta.on:  add  a  correc.on  factor  to  momentum   23   g ← + 1 m ∇!θ L!θ ( f (x(i) ), y(i) )i∑ θ ←θ + v v ← λv −αg !θ ←θ + λv interim    
  24. 24. Nesterov  accelerated  gradient  (NAG)   24   current  loca.on  wt vt ∇L(wt) vt+1 S.  Credit:  K.  McGuinness   predicted  loca.on  based  on  velocity  alone  wt + 𝛾v ∇L(wt + 𝛾vt) vt vt+1
  25. 25. GD:  learning  rate   •  Learning  rate  is  a  crucial  parameter  for  GD   •  Too  large:  overshoots  local  minimum,  loss  increases   •  Too  small:  makes  very  slow  progress,  can  get  stuck   •  Good  learning  rate:  makes  steady  progress  toward  local  minimum   25   too  small   too  large  
  26. 26. GD:  learning  rate  decay   •  In  prac.ce  it  is  necessary  to  gradually  decrease  learning  rate  to  speed  up  the  training   •  step  decay  (e.g.  reduce  by  half  every  few  epochs)   •  exponen6al  decay   •  1/t  decay     •  manual  decay   •  Sufficient  condi.ons  for  convergence:   •  Usually:  adapt  learning  rate  by  monitoring  learning  curves  that  plot  the   func.on  as  a  func.on  of  .me  (more  of  an  art  than  a  science!)   26   αt = ∞ t=1 ∞ ∑ αt 2 < ∞ t=1 ∞ ∑ α = α0 e−kt α = α0 1+ kt k decay rate t iteration number α0 initial learning rate
  27. 27. Adap6ve  learning  rates   •  Cost  if  o]en  to  some  direc.ons  and  to  others   •  Momentum/Nesterov  mi.gate  this  issue  but  introduce  another  hyperparameter   •  Solu6on:  Use  a  separate  learning  rate  for  each  parameter  and  automa6cally  adapt  it   through  the  course  of  learning     •  Algorithms  (mini-­‐batch  based)   •  AdaGrad   •  RMSProp   •  Adam       27  
  28. 28. AdaGrad   •  Adapts  the  learning  rate  of  each  parameter  based  on  sizes  of  previous  updates:     •  scales  updates  to  be  larger  for  parameters  that  are  updated  less   •  scales  updates  to  be  smaller  for  parameters  that  are  updated  more     •  The  net  effect  is  greater  progress  in  the  more  gently  sloped  direc.ons  of  parameter  space     •  Require:  parameter  θ,  learning  rate  α,  small  constant  δ  (e.g.  10-­‐7)  for  numerical  stability •  Update:   •  accumulate  squared  gradient   •  compute  update   •  apply  update     28   θ ←θ + Δθ Δθ ← − α δ + r ⊙ g r ← r + g ⊙ g sum  of    all  previous  squared  gradients     updates  inversely  propor.onal  to  the   square  root  of  the  sum   (elementwise  mul.plica.on)   Duchi  et  al.  Subgradient  Methods  for  Online  Learning  and  Stochas.c  Op.miza.on.  JMRL  2011  
  29. 29. Root  Mean  Square  Propaga6on  (RMSProp)   •  AdaGrad  can  result  in  a  premature  and  excessive  decrease  in  effec6ve  learning  rate   •  RMSProp  modifies  AdaGrad  to  perform  beber  in  non-­‐convex  surfaces   •  Changes  gradient  accumula.on  by  an  exponen6ally  decaying  average  of  sum  of   squares  of  gradients     •  Requires:  parameter  θ,  learning  rate  α,  decay  rate  ρ,  small  constant  δ  (e.g.  10-­‐7)   •  Update:   •  accumulate  squared  gradient   •  compute  update   •  apply  update     29   θ ←θ + Δθ Δθ ← − α δ + r ⊙ g r ← ρr + (1− ρ)g ⊙ g Geoff  Hinton,  Unpublished  
  30. 30. ADAp6ve  Moments  (Adam)   •  Combina.on  of  RMSProp  and  momentum,  but:   •  Keep  decaying  average  of  both  first-­‐order  moment  of  gradient  (momentum)  and  second-­‐ order  moment  (RMSProp)   •  Includes  bias  correc.ons  (first  and  second  moments)  to  account  for  their  ini.aliza.on  at   origin   Update:   •  updated  biased  first  moment  es6mate   •  update  biased  second  moment   •  correct  biases   •  compute  update                                                                                              (opera.ons  applied  elementwise)   •  apply  update   30   θ ←θ + Δθ Δθ ← −α ˆs δ + ˆr s ← ρ1 s + (1− ρ1 )g r ← ρ2 r + (1− ρ2 )g ⊙ g ˆs ← s 1− ρ1 ˆr ← r 1− ρ2 Kingma  et  al.  Adam:  a  Method  for  Stochas.c  Op.miza.on.  ICLR  2015    δ=10-­‐8,  ρ1=0.9,  ρ2=0.999  
  31. 31. Example:  test  func6on   31   Image  credit:  Alec  Radford.   Beale’s  func.on  
  32. 32. Example:  saddle  point   32   Image  credit:  Alec  Radford.  
  33. 33. Ini6aliza6on  -­‐  Normaliza6on  
  34. 34. Parameter  ini6aliza6on   •  Weights   •  Can’t  ini.alize  weights  to  0    (gradients  would  be  0)   •  Can’t  ini.alize  all  weights  to  the  same  value  (all  hidden  units  in  a  layer  will  always   behave  the  same;  need  to  break  symmetry)   •  Small  random  number,  e.g.,  uniform  or  gaussian  distribu.on     •  if  weights  start  too  small,  the  signal  shrinks  as  it  passes  through  each  layer  un.l  it  is  too  .ny   to  be  useful   •  Xavier  ini.aliza.on  (  variances,  for  tanh  sqrt(1/n)   •  each  neuron:  w  =  randn(n)  /  sqrt(n)  ,  n  inputs   •  He  ini.aliza.on  (for  ReLu  sqrt(2/n)   •  each  neuron  w  =  randn(n)  *  sqrt(2.0  /n)  ,  n  inputs   •  Biases   •  ini.alize  all  to  0  (except  for  output  unit  for  skewed  distribu.ons,  0.01  to  avoid  RELU)   •  Alterna6ve:  Ini.alize  using  machine  learning;  parameters  learned  by  unsupervised  model   trained  on  the  same  inputs  /  trained  on  unrelated  task   34  
  35. 35. Normalizing  inputs   •  Normalizing  inputs  to  speed  up  learning   •  For  input  layers:  data  preprocessing  mean  =  1,  std=1       •  For  hidden  layers:  batch  normaliza.on   35   original  data     mean=0   mean  =0,  std=1     Loss  for  unnormalized  data     Loss  for  normalized  data  
  36. 36. Batch  normaliza6on   •  As  learning  progresses,  the  distribu.on  of  the  layer  inputs  changes  due   to  parameter  updates  (  internal  covariate  shi])     •  This  can  result  in  most  inputs  being  in  the  non-­‐linear  regime  of      the  func.on,  slowing  down  learning     •  Bach  normaliza.on  is  a  technique  to  reduce  this  effect   •  Explicitly  force  the  layer  to  have  zero  mean  and  unit   variance  w.r.t  running  batch  es.mates     •  Adds  a  learnable  scale  and  bias  term  to  allow  the  network  to  s.ll   use  the  nonlinearity   36    Ioffe  and  Szegedy,  2015.  “Batch  normaliza.on:  deep  network  training  by  reducing  internal  covariate  shi]”   FC  /  Conv   Batch  norm   ReLu   FC  /  Conv   Batch  norm   ReLu  
  37. 37. Batch  normaliza6on   •  Can  be  applied  to  any  input  or  hidden  layer   •  For  a  mini-­‐batch  of  m  of  the  layer   1.  Compute  empirical  mean  and  variance  for  each  dimension  D   2.  Normalize   3.  Scale  and  shi]          (two  learnable  parameters  )     37   ˆxi = xi − µB σ B 2 + ε m D x yi = γ ˆxi + β B = xi{ }i=1....m µB = 1 m xi i=1 m ∑ σ B 2 = 1 m (xi − µB )2 i=1 m ∑ Note:  normaliza.on  can  reduce  the  expressive  power  of  the  network  (e.g.  normalize  inputs  of  a   sigmoid  would  constrain  them  to  the  linear  regime   To  recover  the  iden.ty  mapping.  The  network  can  lean   Then     β = µBγ = σ B 2 + ε ˆyi = xi
  38. 38. Batch  normaliza6on   Each  mini-­‐batch  is  scaled  by  the  mean/variance  computed  on  just  that  mini-­‐batch.   This  adds  some  noise  to  the  hidden  layer’s  within  that  minibatch,  having  a   slight  regulariza.on  effect:     •  Improves  gradient  flow  through  the  network   •  Allows  higher  learning  rates   •  Reduces  the  strong  dependency  on  ini.aliza.on   •  Reduces  the  need  of  regulariza.on   At  test  .me  BN  layers  func.on  differently:   •  Mean  and  std  are  not  computed  on  the  batch.   •  Instead,  a  single  fixed  empirical  mean  and  std  of  computed  during  training  is   used  (can  be  es.mated  with  decaying  weighted  averages)   38  
  39. 39. Summary   39   •  Op.miza.on  for  NN  is  different  from  pure  op.miza.on:   •  GD  with  mini-­‐batches   •  early  stopping   •  non-­‐convex  surface,  saddle  points   •  Learning  rate  has  a  significant  impact  on  model  performance   •  Several  extensions  to  GD  can  improve  convergence   •  learning-­‐rate  methods  are  likely  to  achieve  best  results   •  RMSProp,  Adam     •  Weight  ini.aliza.on:  He        w=  randn(n)/  sqrt(2/n)     •  Batch  normaliza.on  to  reduce  the  internal  covariance  shi]  
  40. 40. Bibliograpy   •  Goodfellow,  I.,  Bengio,  Y.,  and  A.,  C.  (2016),  Deep  Learning,  MIT  Press.   •  Choromanska,  A.,  Henaff,  M.,  Mathieu,  M.,  Arous,  G.  B.,  and  LeCun,  Y.  (2015),  The  loss  surfaces  of   mul.layer  networks.  In  AISTATS.   •  Dauphin,  Y.  N.,  Pascanu,  R.,  Gulcehre,  C.,  Cho,  K.,  Ganguli,  S.,  and  Bengio,  Y.  (2014).  Iden.fying  and   abacking  the  saddle  point  problem  in  high-­‐dimensional  non-­‐convex  op.miza.on.  In  Advances  in   Neural  Informa.on  Processing.  Systems,  pages  2933–2941.   •  Duchi,  J.,  Hazan,  E.,  and  Singer,  Y.  (2011).  subgradient  methods  for  online  learning  and   stochas.c  op.miza.on.  Journal  of  Machine  Learning  Research,  12(Jul):2121–2159.   •  Goodfellow,  I.  J.,  Vinyals,  O.,  and  Saxe,  A.  M.  (2015).  Qualita.vely  characterizing  neural  network   op.miza.on  problems.  In  Interna.onal  Conference  on  Learning  Representa.ons.   •  Hinton,  G.  (2012).  Neural  networks  for  machine  learning.  Coursera,  video  lectures   •  Jacobs,  R.  A.  (1988).  Increased  rates  of  convergence  through  learning  rate  adapta.on.  Neural   networks,  1(4):295–307.   •  Kingma,  D.  and  Ba,  J.  (2014)-­‐  Adam:  A  method  for  stochas.c  op.miza.on.  arXiv  preprint  arXiv: 1412.6980.   •  Saxe,  A.  M.,  McClelland,  J.  L.,  and  Ganguli,  S.  (2013).  Exact  solu.ons  to  the  nonlinear  dynamics  of   learning  in  deep  linear  neural  networks.  In  Interna.onal  Conference  on  Learning  Representa.ons  40