Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Memo:	
  Backpropaga.on	
  	
  
in	
  Convolu.onal	
  Neural	
  Network	
Hiroshi	
  Kuwajima	
  
13-­‐03-­‐2014	
  Created...
2	
  /14	
  
Note	
n  Purpose	
  
The	
  purpose	
  of	
  this	
  memo	
  is	
  trying	
  to	
  understand	
  and	
  remi...
3	
  /14	
  
Neural	
  Network	
  as	
  a	
  Composite	
  Func4on	
A	
  neural	
  network	
  is	
  decomposed	
  into	
  a...
∇W J W,b;x, y( ) =
∂
∂W
J W,b;x, y( ) =
∂J
∂z
∂z
∂W
= δ z( )
xT
∇bJ W,b;x, y( ) =
∂
∂b
J W,b;x, y( ) =
∂J
∂z
∂z
∂b
= δ z( ...
5	
  /14	
  
Decomposi4on	
  of	
  Mul4-­‐Layer	
  Neural	
  Network	
n  Composite	
  func.on	
  representa.on	
  of	
  a...
6	
  /14	
  
Error	
  Signals	
  and	
  Gradients	
  in	
  Mul4-­‐Layer	
  NN	
n  Error	
  signals	
  of	
  the	
  square...
7	
  /14	
  
Backpropaga4on	
  in	
  General	
  Cases	
1.  Decompose	
  opera.ons	
  in	
  layers	
  of	
  a	
  neural	
  ...
8	
  /14	
  
Convolu4onal	
  Neural	
  Network	
A	
  convolu.on-­‐pooling	
  layer	
  in	
  Convolu.onal	
  Neural	
  Netw...
9	
  /14	
  
Deriva4ves	
  of	
  Convolu4on	
n  Discrete	
  convolu.on	
  parameterized	
  by	
  a	
  feature	
  w	
  and...
10	
  /14	
  
Backpropaga4on	
  in	
  Convolu4on	
  Layer	
Error	
  signals	
  and	
  gradient	
  for	
  each	
  example	
...
11	
  /14	
  
Deriva4ves	
  of	
  Pooling	
Pooling	
  layer	
  subsamples	
  sta.s.cs	
  to	
  obtain	
  summary	
  sta.s....
12	
  /14	
  
Backpropaga4on	
  in	
  Pooling	
  Layer	
Error	
  signals	
  for	
  each	
  example	
  are	
  computed	
  b...
13	
  /14	
  
Backpropaga4on	
  in	
  CNN	
  (Summary)	
Plug	
  in	
  
δ(conv)
	
Plug	
  in	
  
δ(conv)
	
…	
∂J/∂Wn	
xn	
x...
14	
  /14	
  
Remarks	
n  References	
  
p  UFLDL	
  Tutorial,	
  h[p://ufldl.stanford.edu/tutorial	
  
p  Chain	
  Rule...
Upcoming SlideShare
Loading in …5
×

Backpropagation in Convolutional Neural Network

71,799 views

Published on

Published in: Technology
  • i saw a much Better PPT on ThesisScientist.com on the same Topic
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • I have started programming the backpropagation procedure and I ran into a particular problem. I plan to train the network I have built by doing one complete retropropagation after accumulating errors for a complete batch. But the max pooling layer contains only the maximum location for the last processed sample of the batch. If the maximum value location changes from sample to sample, will backpropagation using the maximum value location from the last sample will introduce errors ? Thanks. Jean
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Thanks a lot for the clarification. Regards, Jean
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • @Jean Vézina You are correct that we don't need flip for forward propagation through convolution layers as in the page 9; and the link explains the forward propagation only. As you said, flip or rot90() is just for the MATLAB conv2(). However, we need it for backward propagation as in the page 10; it is apart from MATLAB. I am sorry that I confused you by the link.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dear Mr Kuwashima, I checked the link you supplied and the flip operation is done only in order to follow the Matlab convolution function conventions. If we don't use Matlab, then the flip is unneccessary.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Backpropagation in Convolutional Neural Network

  1. 1. Memo:  Backpropaga.on     in  Convolu.onal  Neural  Network Hiroshi  Kuwajima   13-­‐03-­‐2014  Created   14-­‐08-­‐2014  Revised   1 /14
  2. 2. 2  /14   Note n  Purpose   The  purpose  of  this  memo  is  trying  to  understand  and  remind  the  backpropaga.on  algorithm  in   Convolu.onal  Neural  Network  based  on  a  discussion  with  Prof.  Masayuki  Tanaka.     n  Table  of  Contents   In  this  memo,  backpropaga.on  algorithms  in  different  neural  networks  are  explained  in  the  following   order.       p  Single  neuron    3   p  Mul.-­‐layer  neural  network  5   p  General  cases    7   p  Convolu.on  layer    9   p  Pooling  layer      11   p  Convolu.onal  Neural  Network  13   n  Nota.on   This  memo  follows  the  nota.on  in  UFLDL  tutorial  (h[p://ufldl.stanford.edu/tutorial)  
  3. 3. 3  /14   Neural  Network  as  a  Composite  Func4on A  neural  network  is  decomposed  into  a  composite  func.on  where  each  func.on  element   corresponds  to  a  differen.able  opera.on.       n  Single  neuron  (the  simplest  neural  network)  example   A  single  neuron  is  decomposed  into  a  composite  func.on  of  an  affine  func.on  element  parameterized  by   W  and  b  and  an  ac.va.on  func.on  element    f  which  we  choose  to  be  the  sigmoid  func.on.                 Deriva.ves  of  both  affine  and  sigmoid  func.on  elements  w.r.t.  both  inputs  and  parameters  are  known.   Note  that  sigmoid  func.on  does  not  have  neither  parameters  nor  deriva.ves  parameters.     Sigmoid  func.on  is  applied  element-­‐wise.  ‘●’  denotes  Hadamard  product,  or  element-­‐wise  product.     hW,b x( ) = f WT x + b( )= sigmoid affineW,b x( )( )= sigmoid!affineW,b( ) x( ) ∂a ∂z = a • 1− a( ) where a = hw,b x( ) = sigmoid z( ) = 1 1+ exp −z( ) ∂z ∂x = W, ∂z ∂W = x, ∂z ∂b = I where z = affineW,b x( ) = WT x + b, and I is identity matrix Decomposi.on Neuron Standard  network  representa.on x1 x2 x3 +1 hW,b(x) Affine   Ac.va.on   (e.g.  sigmoid)   Composite  func.on  representa.on x1 x2 x3 +1 hW,b(x) z a
  4. 4. ∇W J W,b;x, y( ) = ∂ ∂W J W,b;x, y( ) = ∂J ∂z ∂z ∂W = δ z( ) xT ∇bJ W,b;x, y( ) = ∂ ∂b J W,b;x, y( ) = ∂J ∂z ∂z ∂b = δ z( ) 4  /14   Chain  Rule  of  Error  Signals  and  Gradients Error  signals  are  defined  as  the  deriva.ves  of  any  cost  func.on  J  which  we  choose  to  be   the  square  error.  Error  signals  are  computed  (propagated  backward)  by  the  chain  rule  of   deriva.ve  and  useful  for  compu.ng  the  gradient  of  the  cost  func.on.       n  Single  neuron  example   Suppose  we  have  m  labeled  training  examples  {(x(1),  y(1)),  …,  (x(m),  y(m))}.  Square  error  cost  func.on  for  each   example  is  as  follows.  Overall  cost  func.on  is  the  summa.on  of  cost  func.ons  over  all  examples.         Error  signals  of  the  square  error  cost  func.on  for  each  example  are  propagated  using  deriva.ves  of   func.on  elements  w.r.t.  inputs.         Gradient  of  the  cost  func.on  w.r.t.  parameters  for  each  example  is  computed  using  error  signals  and   deriva.ves  of  func.on  elements  w.r.t.  parameters.  Summing  gradients  for  all  examples  gets  overall   gradient.     δ a( ) = ∂ ∂a J W,b;x, y( ) = − y − a( ) δ z( ) = ∂ ∂z J W,b;x, y( ) = ∂J ∂a ∂a ∂z = δ a( ) • a • 1− a( ) J W,b;x, y( ) = 1 2 y − hw,b x( ) 2
  5. 5. 5  /14   Decomposi4on  of  Mul4-­‐Layer  Neural  Network n  Composite  func.on  representa.on  of  a  mul.-­‐layer  neural  network     n  Deriva.ves  of  func.on  elements  w.r.t.  inputs  and  parameters   a 1( ) = x, a lmax( ) = hw,b x( ) ∂a l+1( ) ∂z l+1( ) = a l+1( ) • 1− a l+1( ) ( ) where a l+1( ) = sigmoid z l+1( ) ( )= 1 1+ exp −z l+1( ) ( ) ∂z l+1( ) ∂a l( ) = W l( ) , ∂z l+1( ) ∂W l( ) = a l( ) , ∂z l+1( ) ∂b l( ) = I where z l+1( ) = W l( ) ( ) T a l( ) + b l( ) hW,b x( ) = sigmoid!affineW 2( ),b 2( ) !sigmoid!affineW 1( ),b 1( )( ) x( ) Decomposi.on Standard  network  representa.on x1 x2 x3 +1 Layer  1 +1 Layer  2 x hW,b(x) a2 (2) a1 (2) a3 (2) Composite  func.on  representa.on x1 x2 x3 +1 Affine  1 Sigmoid  1 x z2 (2) z1 (2) z3 (2) +1 Affine  2 a2 (2) a1 (2) a3 (2) hW,b(x) z1 (3) a1 (3) a1 (1) a2 (1) a3 (1) Sigmoid  2
  6. 6. 6  /14   Error  Signals  and  Gradients  in  Mul4-­‐Layer  NN n  Error  signals  of  the  square  error  cost  func.on  for  each  example   n  Gradient  of  the  cost  func.on  w.r.t.  parameters  for  each  example   δ a l( ) ( ) = ∂ ∂a l( ) J W,b;x, y( ) = − y − a l( ) ( ) for l = lmax ∂J ∂z l+1( ) ∂z l+1( ) ∂a l( ) = W l( ) ( ) T δ z l+1( ) ( ) otherwise ⎧ ⎨ ⎪ ⎩ ⎪ δ z l( ) ( ) = ∂ ∂z l( ) J W,b;x, y( ) = ∂J ∂a l( ) ∂a l( ) ∂z l( ) = δ a l( ) ( ) • a l( ) • 1− a l( ) ( ) ∇W l( ) J W,b;x, y( ) = ∂ ∂W l( ) J W,b;x, y( ) = ∂J ∂z l+1( ) ∂z l+1( ) ∂W l( ) = δ z l+1( ) ( ) a l( ) ( ) T ∇b l( ) J W,b;x, y( ) = ∂ ∂b l( ) J W,b;x, y( ) = ∂J ∂z l+1( ) ∂z l+1( ) ∂b l( ) = δ z l+1( ) ( )
  7. 7. 7  /14   Backpropaga4on  in  General  Cases 1.  Decompose  opera.ons  in  layers  of  a  neural  network  into  func.on  elements  whose   deriva.ves  w.r.t  inputs  are  known  by  symbolic  computa.on.     2.  Backpropagate  error  signals  corresponding  to  a  differen.able  cost  func.on  by   numerical  computa.on  (Star.ng  from  cost  func.on,  plug  in  error  signals  backward).     3.  Use  backpropagated  error  signals  to  compute  gradients  w.r.t.  parameters  only  for  the   func.on  elements  with  parameters  where  their  deriva.ves  w.r.t  parameters  are   known  by  symbolic  computa.on.     4.  Sum  gradients  over  all  example  to  get  overall  gradient.     hθ x( ) = f lmax( ) !…! fθ l( ) l( ) !…! fθ 2( ) 2( ) ! f 1( ) ( ) x( ) where f 1( ) = x, f lmax( ) = hθ x( ) and ∀l : ∂f l+1( ) ∂f l( ) is known δ l( ) = ∂ ∂f l( ) J θ;x, y( ) = ∂J ∂f l+1( ) ∂f l+1( ) ∂f l( ) = δ l+1( ) ∂f l+1( ) ∂f l( ) where ∂J ∂f lmax( ) is known ∇θ l( ) J θ;x, y( ) = ∂ ∂θ l( ) J θ;x, y( ) = ∂J ∂f l( ) ∂fθ l( ) l( ) ∂θ l( ) = δ l( ) ∂fθ l( ) l( ) ∂θ l( ) where ∂fθ l( ) l( ) ∂θ l( ) is known ∇θ l( ) J θ( ) = ∇θ l( ) J θ;x i( ) , y i( ) ( )i=1 m ∑
  8. 8. 8  /14   Convolu4onal  Neural  Network A  convolu.on-­‐pooling  layer  in  Convolu.onal  Neural  Network  is  a  composite  func.on   decomposed  into  func.on  elements  f(conv),  f(sigm),  and  f(pool).     Let  x  be  the  output  from  the  previous  layer.  Sigmoid  nonlinearity  is  op.onal.     f pool( ) ! f sigm( ) ! fw conv( ) ( ) x( ) Convolu.on Sigmoid x Pooling Forward  propaga.on Backward  propaga.on Convolu.on Sigmoid x Pooling
  9. 9. 9  /14   Deriva4ves  of  Convolu4on n  Discrete  convolu.on  parameterized  by  a  feature  w  and  its  deriva.ves   Let  x  be  the  input,  and  y  be  the  output  of  convolu.on  layer.  Here  we  focus  on  only  one  feature  vector  w,   although  a  convolu.on  layer  usually  has  mul.ple  features  W  =  [w1  w2  …  wn].  n  indexes  x  and  y  where     1  ≤  n  ≤  |x|  for  xn,  1  ≤  n  ≤  |y|  =  |x|  -­‐  |w|  +  1  for  yn.  i  indexes  w  where  1  ≤  i  ≤  |w|.  (f*g)[n]  denotes  the  n-­‐th   element  of  f*g.     y = x ∗w = yn[ ], yn = x ∗w( ) n[ ]= xn+i−1wi i=1 w ∑ = wT xn:n+ w −1 ∂yn−i+1 ∂xn = wi, ∂yn ∂wi = xn+i−1 for 1≤ i ≤ w xn w1 yn w2 … yn-­‐1 … From  a  fixed  xn  stand  point,     xn  has  outgoing  connec.ons     to  yn-­‐|W|+1:n,  i.e.,     all  yn-­‐|W|+1:n  have  deriva.ves     w.r.t.  xn.  Note  that  y  and  w   indices  are  reverse  order.   x Convolu.on |w| xn … w1 w2 … yn yn  has  incoming     connec.ons  from  xn:n+|W|-­‐1.     x Convolu.on |w| xn+1
  10. 10. 10  /14   Backpropaga4on  in  Convolu4on  Layer Error  signals  and  gradient  for  each  example  are  computed  by  convolu.on  using     the  commuta.vity  property  of  convolu.on  and  the  mul.variable  chain  rule  of  deriva.ve.     Let’s  focus  on  single  elements  of  error  signals  and  a  gradient  w.r.t.  w.       δn x( ) = ∂J ∂xn = ∂J ∂y ∂y ∂xn = ∂J ∂yn−i+1 ∂yn−i+1 ∂xni=1 w ∑ = δn−i+1 y( ) wi i=1 w ∑ = δ y( ) ∗flip w( )( ) n[ ], δ x( ) = δn x( ) ⎡ ⎣ ⎤ ⎦ = δ y( ) ∗flip w( ) ∂J ∂wi = ∂J ∂y ∂y ∂wi = ∂J ∂yn ∂yn ∂win=1 x − w +1 ∑ = δn y( ) xn+i−1 n=1 x − w +1 ∑ = δ y( ) ∗ x( ) i[ ], ∂J ∂w = ∂J ∂wi ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ = δ y( ) ∗ x = x ∗δ y( ) ↑  Reverse  order  linear  combina.on x          *          w          =          y xn … W1 W2 … yn Forward  propaga.on  (convolu.on) (Valid  convolu.on) |w| xn+1 Backward  propaga.on … … δ(y) n w1 w2 δ(x)    =      flip(w)    *      δ(y) δ(x) n (Full  convolu.on) |w| δ(y) n-­‐1 … ∂J/∂wi xn δ(y) 1 δ(y) 2 x        *        δ(y)      =      ∂J/∂W (Valid  convolu.on) Gradient  computa.on |y| xn+1 …
  11. 11. 11  /14   Deriva4ves  of  Pooling Pooling  layer  subsamples  sta.s.cs  to  obtain  summary  sta.s.cs  with  any  aggregate   func.on  (or  filter)  g  whose  input  is  vector,  and  output  is  scalar.  Subsampling  is  an   opera.on  like  convolu.on,  however  g  is  applied  to  disjoint  (non-­‐overlapping)  regions.     n  Defini.on:  subsample  (or  downsample)   Let  m  be  the  size  of  pooling  region,  x  be  the  input,  and  y  be  the  output  of  the  pooling  layer.     subsample(f,  g)[n]  denotes  the  n-­‐th  element  of  subsample(f,  g).     yn = subsample x,g( ) n[ ]= g x n−1( )m+1:nm( ) y = subsample x,g( ) = yn[ ] g x( ) = xk k=1 m ∑ m , ∂g ∂x = 1 m mean pooling max x( ), ∂g ∂xi = 1 if xi = max x( ) 0 otherwise ⎧ ⎨ ⎩ max pooling x p = xk p k=1 m ∑ ⎛ ⎝⎜ ⎞ ⎠⎟ 1/p , ∂g ∂xi = xk p k=1 m ∑ ⎛ ⎝⎜ ⎞ ⎠⎟ 1/p−1 xi p−1 Lp pooling or any other differentiable Rm → R functions ⎧ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎪ ⎪ ⎪ ⎪ ⎪ x Pooling yn g m x(n-­‐1)m+1 …
  12. 12. 12  /14   Backpropaga4on  in  Pooling  Layer Error  signals  for  each  example  are  computed  by  upsampling.  Upsampling  is  an  opera.on   which  backpropagates  (distributes)  the  error  signals  over  the  aggregate  func.on  g  using   its  deriva.ves  g’n  =  ∂g/∂x(n-­‐1)m+1:nm.  g’n  can  change  depending  on  pooling  region  n.     p  In  max  pooling,  the  unit  which  was  the  max  at  forward  propaga.on  receives  all  the  error  at  backward   propaga.on  and  the  unit  is  different  depending  on  the  region  n.       n  Defini.on:  upsample   upsample(f,  g)[n]  denotes  the  n-­‐th  element  of  upsample(f,  g).     δ n−1( )m+1:nm x( ) = upsample δ y( ) , ′g( ) n[ ]= δn y( ) ′gn = δn y( ) ∂g ∂x n−1( )m+1:nm = ∂J ∂yn ∂yn ∂x n−1( )m+1:nm = ∂J ∂x n−1( )m+1:nm δ x( ) = upsample δ a( ) , ′g( )= δ n−1( )m+1:nm x( ) ⎡ ⎣ ⎤ ⎦ subsample(x,  g)  =  y yn Forward  propaga.on  (subsapmling) g x(n-­‐1)m+1 … m δ(x)  =  upsample(δ(y),  g’) δ(y) n δ(x) (n-­‐1)m+1 … Backward  propaga.on  (upsapmling) ∂g/∂x m
  13. 13. 13  /14   Backpropaga4on  in  CNN  (Summary) Plug  in   δ(conv) Plug  in   δ(conv) … ∂J/∂Wn xn xn+1 … (Valid  convolu.on) δ(conv) 1 δ(conv) 2 x ∗δ conv( ) = ∇wJ 3.  Compute  gradient  ∇wJ … … δ(conv) n-­‐1 δ(conv) n W1 W2 δ(x) n (Full  convolu.on) 2.  Propagate  error  signals  δ(conv) δ x( ) = δ conv( ) ∗flip w( ) δ conv( ) = upsample δ pool( ) , ′g( )• f sigm( ) • 1− f sigm( ) ( ) 1.  Propagate  error  signals  δ(pool) δ(pool) n δ(sigm) (n-­‐1)m+1 … δ(conv) (n-­‐1)m+1 … Deriva.ve  of  sigmoid   Convolu.on Convolu.on Sigmoid Pooling
  14. 14. 14  /14   Remarks n  References   p  UFLDL  Tutorial,  h[p://ufldl.stanford.edu/tutorial   p  Chain  Rule  of  Neural  Network  is  Error  Back  Propaga.on,     h[p://like.silk.to/studymemo/ChainRuleNeuralNetwork.pdf   n  Acknowledgement   This  memo  was  wri[en  thanks  to  a  good  discussion  with  Prof.  Masayuki  Tanaka.    

×