Deep Feedforward
Networks
Lecture slides for Chapter 6 of Deep Learning
www.deeplearningbook.org
Ian Goodfellow
Last updated 2016-10-04
(Goodfellow 2017)
Roadmap
• Example: Learning XOR
• Gradient-Based Learning
• Hidden Units
• Architecture Design
• Back-Propagation
(Goodfellow 2017)
XOR is not linearly separable
0 1
x1
0
1
x2
Original x space
0
0
1
h2
Learn
Figure 6.1, left
(Goodfellow 2017)
Rectified Linear Activation
0
z
0
g(z)=max{0,z}
Figure 6.3
(Goodfellow 2017)
Network Diagrams
DEEP FEEDFORWARD NETWORKS
yy
hh
xx
W
w
yy
h1h1
x1x1
h2h2
x2x2
n example of a feedforward network, drawn in two different style
Figure 6.2
(Goodfellow 2017)
Solving XOR
is needed. The activation function g is typically chosen to be a function
ied element-wise, with hi = g(x>W:,i +ci). In modern neural networks,
recommendation is to use the rectified linear unit, or ReLU (Jarrett
Nair and Hinton, 2010; Glorot et al., 2011a), defined by the activation
z) = max{0, z}, depicted in figure 6.3.
now specify our complete network as
f(x; W , c, w, b) = w>
max{0, W >
x + c} + b. (6.3)
then specify a solution to the XOR problem. Let
W =

1 1
1 1
, (6.4)
c =

0
1
, (6.5)
w =

1
2
, (6.6)
now walk through how the model processes a batch of inputs. Let X
eded. The activation function g is typically chosen to be a function
lement-wise, with hi = g(x>W:,i +ci). In modern neural networks,
mmendation is to use the rectified linear unit, or ReLU (Jarrett
r and Hinton, 2010; Glorot et al., 2011a), defined by the activation
max{0, z}, depicted in figure 6.3.
specify our complete network as
f(x; W , c, w, b) = w>
max{0, W >
x + c} + b. (6.3)
specify a solution to the XOR problem. Let
W =

1 1
1 1
, (6.4)
c =

0
1
, (6.5)
w =

1
2
, (6.6)
(Goodfellow 2017)
Solving XOR
0 1
x1
0
1
x2
Original x space
0 1 2
h1
0
1
h2
Learned h space
.1: Solving the XOR problem by learning a representation. The bold n
Figure 6.1
(Goodfellow 2017)
Roadmap
• Example: Learning XOR
• Gradient-Based Learning
• Hidden Units
• Architecture Design
• Back-Propagation
(Goodfellow 2017)
Gradient-Based Learning
• Specify
• Model
• Cost
• Design model and cost so cost is smooth
• Minimize cost using gradient descent or related
techniques
(Goodfellow 2017)
Conditional Distributions and
Cross-Entropy
g Conditional Distributions with Maximum Likelihood
l networks are trained using maximum likelihood. This means
on is simply the negative log-likelihood, equivalently described
y between the training data and the model distribution. This
en by
J(✓) = Ex,y⇠ˆpdata
log pmodel(y | x). (6.12)
m of the cost function changes from model to model, depending
m of log pmodel. The expansion of the above equation typically
that do not depend on the model parameters and may be dis-
e, as we saw in section 5.5.1, if pmodel(y | x) = N(y; f(x; ✓), I),
mean squared error cost,
J(✓) =
1
E ||y f(x; ✓)||2
+ const, (6.13)
(Goodfellow 2017)
Output Types
Output Type
Output
Distribution
Output
Layer
Cost
Function
Binary Bernoulli Sigmoid
Binary cross-
entropy
Discrete Multinoulli Softmax
Discrete cross-
entropy
Continuous Gaussian Linear
Gaussian cross-
entropy (MSE)
Continuous
Mixture of
Gaussian
Mixture
Density
Cross-entropy
Continuous Arbitrary
See part III: GAN,
VAE, FVBN
Various
(Goodfellow 2017)
Mixture Density Outputs
R 6. DEEP FEEDFORWARD NETWORKS
x
y
.4: Samples drawn from a neural network with a mixture density out
Figure 6.4
(Goodfellow 2017)
Don’t mix and match
3 2 1 0 1 2 3
z
0.0
0.5
1.0
(z)
Cross-entropy loss
MSE loss
Sigmoid output with target of 1
(Goodfellow 2017)
Roadmap
• Example: Learning XOR
• Gradient-Based Learning
• Hidden Units
• Architecture Design
• Back-Propagation
(Goodfellow 2017)
Hidden units
• Use ReLUs, 90% of the time
• For RNNs, see Chapter 10
• For some research projects, get creative
• Many hidden units perform comparably to ReLUs.
New hidden units that perform comparably are
rarely interesting.
(Goodfellow 2017)
Roadmap
• Example: Learning XOR
• Gradient-Based Learning
• Hidden Units
• Architecture Design
• Back-Propagation
(Goodfellow 2017)
Architecture BasicsAPTER 6. DEEP FEEDFORWARD NETWORKS
yy
hh
xx
W
w
yy
h1h1
x1x1
h2h2
x2x2
ure 6.2: An example of a feedforward network, drawn in two different styles
is the feedforward network we use to solve the XOR example. It has a s
r containing two units. (Left) In this style, we draw every unit as a node
Width
Depth
(Goodfellow 2017)
Universal Approximator
Theorem
• One hidden layer is enough to represent (not learn)
an approximation of any function to an arbitrary
degree of accuracy
• So why deeper?
• Shallow net may need (exponentially) more width
• Shallow net may overfit more
(Goodfellow 2017)
Exponential Representation
Advantage of Depth
network with absolute value rectification creates mirror images of the function
mputed on top of some hidden unit, with respect to the input of that hidden
it. Each hidden unit specifies where to fold the input space in order to create
rror responses (on both sides of the absolute value nonlinearity). By composing
ese folding operations, we obtain an exponentially large number of piecewise
ear regions which can capture all kinds of regular (e.g., repeating) patterns.
gure 6.5: An intuitive, geometric explanation of the exponential advantage of deeper
tifier networks formally by Montufar et al. (2014). (Left)An absolute value rectification
it has the same output for every pair of mirror points in its input. The mirror axis
symmetry is given by the hyperplane defined by the weights and bias of the unit. A
nction computed on top of that unit (the green decision surface) will be a mirror image
a simpler pattern across that axis of symmetry. (Center)The function can be obtained
folding the space around the axis of symmetry. (Right)Another repeating pattern can
Figure 6.5
(Goodfellow 2017)
Better Generalization with
Greater Depth
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
3 4 5 6 7 8 9 10 11
Number of hidden layers
92.0
92.5
93.0
93.5
94.0
94.5
95.0
95.5
96.0
96.5
Testaccuracy(percent)
Figure 6.6: Empirical results showing that deeper networks generalize better when used
to transcribe multi-digit numbers from photographs of addresses. Data from Goodfellow
et al. (2014d). The test set accuracy consistently increases with increasing depth. See
Figure 6.6
Layers
(Goodfellow 2017)
Large, Shallow Models Overfit More
0.0 0.2 0.4 0.6 0.8 1.0
Number of parameters ⇥108
91
92
93
94
95
96
97
Testaccuracy(percent)
3, convolutional
3, fully connected
11, convolutional
Figure 6.7: Deeper models tend to perform better. This is not merely because the model is
larger. This experiment from Goodfellow et al. (2014d) shows that increasing the number
of parameters in layers of convolutional networks without increasing their depth is not
nearly as effective at increasing test set performance. The legend indicates the depth ofFigure 6.7
(Goodfellow 2017)
Roadmap
• Example: Learning XOR
• Gradient-Based Learning
• Hidden Units
• Architecture Design
• Back-Propagation
(Goodfellow 2017)
Back-Propagation
• Back-propagation is “just the chain rule” of calculus
• But it’s a particular implementation of the chain rule
• Uses dynamic programming (table filling)
• Avoids recomputing repeated subexpressions
• Speed vs memory tradeoff
APTER 6. DEEP FEEDFORWARD NETWORKS
chain rule states that
dz
dx
=
dz
dy
dy
dx
. (6.44)
We can generalize this beyond the scalar case. Suppose that x 2 Rm, y 2 Rn,
maps from Rm to Rn, and f maps from Rn to R. If y = g(x) and z = f(y), then
@z
@xi
=
X
j
@z
@yj
@yj
@xi
. (6.45)
vector notation, this may be equivalently written as
rxz =
✓
@y
@x
◆>
ryz, (6.46)
ere @y
@x is the n ⇥ m Jacobian matrix of g.
From this we see that the gradient of a variable x can be obtained by multiplying
acobian matrix @y
@x by a gradient ryz. The back-propagation algorithm consists
performing such a Jacobian-gradient product for each operation in the graph.
Usually we apply the back-propagation algorithm to tensors of arbitrary di-
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
the chain rule states that
dz
dx
=
dz
dy
dy
dx
. (6.44)
We can generalize this beyond the scalar case. Suppose that x 2 Rm, y 2 Rn,
g maps from Rm to Rn, and f maps from Rn to R. If y = g(x) and z = f(y), then
@z
@xi
=
X
j
@z
@yj
@yj
@xi
. (6.45)
In vector notation, this may be equivalently written as
rxz =
✓
@y
@x
◆>
ryz, (6.46)
where @y
@x is the n ⇥ m Jacobian matrix of g.
From this we see that the gradient of a variable x can be obtained by multiplying
a Jacobian matrix @y
@x by a gradient ryz. The back-propagation algorithm consists
of performing such a Jacobian-gradient product for each operation in the graph.
Usually we apply the back-propagation algorithm to tensors of arbitrary di-
mensionality, not merely to vectors. Conceptually, this is exactly the same as
back-propagation with vectors. The only difference is how the numbers are ar-
ranged in a grid to form a tensor. We could imagine flattening each tensor into
a vector before we run back-propagation, computing a vector-valued gradient,
and then reshaping the gradient back into a tensor. In this rearranged view,
back-propagation is still just multiplying Jacobians by gradients.
To denote the gradient of a value z with respect to a tensor X, we write rXz,
(Goodfellow 2017)
Simple Back-Prop ExampleCHAPTER 6. DEEP FEEDFORWARD NETWORKS
yy
hh
xx
W
w
yy
h1h1
x1x1
h2h2
x2x2
Figure 6.2: An example of a feedforward network, drawn in two different st
Forwardprop
Back-prop
Computeactivations
Computederivatives
Compute loss
(Goodfellow 2017)
Computation Graphs
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
zz
xx yy
(a)
⇥
xx ww
(b)
u(1)
u(1)
dot
bb
u(2)
u(2)
+
ˆyˆy
(c)
XX WW
U(1)
U(1)
matmul
bb
U(2)
U(2)
+
HH
relu
xx ww
(d)
ˆyˆy
dot
u(1)
u(1)
sqr
u(2)
u(2)
sum
u(3)
u(3)
⇥
Figure 6.8: Examples of computational graphs. (a)The graph using the ⇥ operation to
compute z = xy. (b)The graph for the logistic regression prediction ˆy = x>
w + b .
Figure 6.8
Multiplication
ReLU layer
Logistic regression
Linear regression
and weight decay
(Goodfellow 2017)
Repeated Subexpressions
zz
xx
yy
ww
f
f
f
aph that results in repeated subexpressions when computing
e input to the graph. We use the same function f : R ! R
ww
computational graph that results in repeated subexpressions when computing
Let w 2 R be the input to the graph. We use the same function f : R ! R
on that we apply at every step of a chain: x = f(w), y = f(x), z = f(y).
z
w , we apply equation 6.44 and obtain:
@z
@w
(6.50)
=
@z
@y
@y
@x
@x
@w
(6.51)
=f0
(y)f0
(x)f0
(w) (6.52)
=f0
(f(f(w)))f0
(f(w))f0
(w) (6.53)
suggests an implementation in which we compute the value of f(w) only
e it in the variable x. This is the approach taken by the back-propagation
alternative approach is suggested by equation 6.53, where the subexpression
more than once. In the alternative approach, f(w) is recomputed each time
When the memory required to store the value of these expressions is low, the
Figure 6.9
Back-prop avoids computing this twice
(Goodfellow 2017)
Symbol-to-Symbol
Differentiation
DEEP FEEDFORWARD NETWORKS
zz
xx
yy
ww
f
f
f
zz
xx
yy
ww
f
f
f
dz
dy
dz
dy
f0
dy
dx
dy
dx
f0
dz
dx
dz
dx
⇥
dx
dw
dx
dw
f0
dz
dw
dz
dw
⇥
Figure 6.10
(Goodfellow 2017)
Neural Network Loss Function
APTER 6. DEEP FEEDFORWARD NETWORKS
XX W (1)
W (1)
U(1)
U(1)
matmul
HH
relu
U(3)
U(3)
sqr
u(4)
u(4)
sum
u(7)
u(7)
W (2)
W (2)
U(2)
U(2)
matmul
yy
JMLEJMLE
cross_entropy
U(5)
U(5)
sqr
u(6)
u(6)
sum
u(8)
u(8)
JJ
+
⇥
+
ure 6.11: The computational graph used to compute the cost used to train our example
Figure 6.11
(Goodfellow 2017)
Hessian-vector Products
ods. Krylov methods are a set of iterative techniques for
rations like approximately inverting a matrix or finding
eigenvectors or eigenvalues, without using any operation
or products.
lov methods on the Hessian, we only need to be able to
tween the Hessian matrix H and an arbitrary vector v. A
ue (Christianson, 1992) for doing so is to compute
Hv = rx
h
(rxf(x))>
v
i
. (6.59)
mputations in this expression may be computed automati-
e software library. Note that the outer gradient expression
function of the inner gradient expression.
r produced by a computational graph, it is important to
tic differentiation software should not differentiate through
d v.
(Goodfellow 2017)
Questions

06 mlp

  • 1.
    Deep Feedforward Networks Lecture slidesfor Chapter 6 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last updated 2016-10-04
  • 2.
    (Goodfellow 2017) Roadmap • Example:Learning XOR • Gradient-Based Learning • Hidden Units • Architecture Design • Back-Propagation
  • 3.
    (Goodfellow 2017) XOR isnot linearly separable 0 1 x1 0 1 x2 Original x space 0 0 1 h2 Learn Figure 6.1, left
  • 4.
    (Goodfellow 2017) Rectified LinearActivation 0 z 0 g(z)=max{0,z} Figure 6.3
  • 5.
    (Goodfellow 2017) Network Diagrams DEEPFEEDFORWARD NETWORKS yy hh xx W w yy h1h1 x1x1 h2h2 x2x2 n example of a feedforward network, drawn in two different style Figure 6.2
  • 6.
    (Goodfellow 2017) Solving XOR isneeded. The activation function g is typically chosen to be a function ied element-wise, with hi = g(x>W:,i +ci). In modern neural networks, recommendation is to use the rectified linear unit, or ReLU (Jarrett Nair and Hinton, 2010; Glorot et al., 2011a), defined by the activation z) = max{0, z}, depicted in figure 6.3. now specify our complete network as f(x; W , c, w, b) = w> max{0, W > x + c} + b. (6.3) then specify a solution to the XOR problem. Let W =  1 1 1 1 , (6.4) c =  0 1 , (6.5) w =  1 2 , (6.6) now walk through how the model processes a batch of inputs. Let X eded. The activation function g is typically chosen to be a function lement-wise, with hi = g(x>W:,i +ci). In modern neural networks, mmendation is to use the rectified linear unit, or ReLU (Jarrett r and Hinton, 2010; Glorot et al., 2011a), defined by the activation max{0, z}, depicted in figure 6.3. specify our complete network as f(x; W , c, w, b) = w> max{0, W > x + c} + b. (6.3) specify a solution to the XOR problem. Let W =  1 1 1 1 , (6.4) c =  0 1 , (6.5) w =  1 2 , (6.6)
  • 7.
    (Goodfellow 2017) Solving XOR 01 x1 0 1 x2 Original x space 0 1 2 h1 0 1 h2 Learned h space .1: Solving the XOR problem by learning a representation. The bold n Figure 6.1
  • 8.
    (Goodfellow 2017) Roadmap • Example:Learning XOR • Gradient-Based Learning • Hidden Units • Architecture Design • Back-Propagation
  • 9.
    (Goodfellow 2017) Gradient-Based Learning •Specify • Model • Cost • Design model and cost so cost is smooth • Minimize cost using gradient descent or related techniques
  • 10.
    (Goodfellow 2017) Conditional Distributionsand Cross-Entropy g Conditional Distributions with Maximum Likelihood l networks are trained using maximum likelihood. This means on is simply the negative log-likelihood, equivalently described y between the training data and the model distribution. This en by J(✓) = Ex,y⇠ˆpdata log pmodel(y | x). (6.12) m of the cost function changes from model to model, depending m of log pmodel. The expansion of the above equation typically that do not depend on the model parameters and may be dis- e, as we saw in section 5.5.1, if pmodel(y | x) = N(y; f(x; ✓), I), mean squared error cost, J(✓) = 1 E ||y f(x; ✓)||2 + const, (6.13)
  • 11.
    (Goodfellow 2017) Output Types OutputType Output Distribution Output Layer Cost Function Binary Bernoulli Sigmoid Binary cross- entropy Discrete Multinoulli Softmax Discrete cross- entropy Continuous Gaussian Linear Gaussian cross- entropy (MSE) Continuous Mixture of Gaussian Mixture Density Cross-entropy Continuous Arbitrary See part III: GAN, VAE, FVBN Various
  • 12.
    (Goodfellow 2017) Mixture DensityOutputs R 6. DEEP FEEDFORWARD NETWORKS x y .4: Samples drawn from a neural network with a mixture density out Figure 6.4
  • 13.
    (Goodfellow 2017) Don’t mixand match 3 2 1 0 1 2 3 z 0.0 0.5 1.0 (z) Cross-entropy loss MSE loss Sigmoid output with target of 1
  • 14.
    (Goodfellow 2017) Roadmap • Example:Learning XOR • Gradient-Based Learning • Hidden Units • Architecture Design • Back-Propagation
  • 15.
    (Goodfellow 2017) Hidden units •Use ReLUs, 90% of the time • For RNNs, see Chapter 10 • For some research projects, get creative • Many hidden units perform comparably to ReLUs. New hidden units that perform comparably are rarely interesting.
  • 16.
    (Goodfellow 2017) Roadmap • Example:Learning XOR • Gradient-Based Learning • Hidden Units • Architecture Design • Back-Propagation
  • 17.
    (Goodfellow 2017) Architecture BasicsAPTER6. DEEP FEEDFORWARD NETWORKS yy hh xx W w yy h1h1 x1x1 h2h2 x2x2 ure 6.2: An example of a feedforward network, drawn in two different styles is the feedforward network we use to solve the XOR example. It has a s r containing two units. (Left) In this style, we draw every unit as a node Width Depth
  • 18.
    (Goodfellow 2017) Universal Approximator Theorem •One hidden layer is enough to represent (not learn) an approximation of any function to an arbitrary degree of accuracy • So why deeper? • Shallow net may need (exponentially) more width • Shallow net may overfit more
  • 19.
    (Goodfellow 2017) Exponential Representation Advantageof Depth network with absolute value rectification creates mirror images of the function mputed on top of some hidden unit, with respect to the input of that hidden it. Each hidden unit specifies where to fold the input space in order to create rror responses (on both sides of the absolute value nonlinearity). By composing ese folding operations, we obtain an exponentially large number of piecewise ear regions which can capture all kinds of regular (e.g., repeating) patterns. gure 6.5: An intuitive, geometric explanation of the exponential advantage of deeper tifier networks formally by Montufar et al. (2014). (Left)An absolute value rectification it has the same output for every pair of mirror points in its input. The mirror axis symmetry is given by the hyperplane defined by the weights and bias of the unit. A nction computed on top of that unit (the green decision surface) will be a mirror image a simpler pattern across that axis of symmetry. (Center)The function can be obtained folding the space around the axis of symmetry. (Right)Another repeating pattern can Figure 6.5
  • 20.
    (Goodfellow 2017) Better Generalizationwith Greater Depth CHAPTER 6. DEEP FEEDFORWARD NETWORKS 3 4 5 6 7 8 9 10 11 Number of hidden layers 92.0 92.5 93.0 93.5 94.0 94.5 95.0 95.5 96.0 96.5 Testaccuracy(percent) Figure 6.6: Empirical results showing that deeper networks generalize better when used to transcribe multi-digit numbers from photographs of addresses. Data from Goodfellow et al. (2014d). The test set accuracy consistently increases with increasing depth. See Figure 6.6 Layers
  • 21.
    (Goodfellow 2017) Large, ShallowModels Overfit More 0.0 0.2 0.4 0.6 0.8 1.0 Number of parameters ⇥108 91 92 93 94 95 96 97 Testaccuracy(percent) 3, convolutional 3, fully connected 11, convolutional Figure 6.7: Deeper models tend to perform better. This is not merely because the model is larger. This experiment from Goodfellow et al. (2014d) shows that increasing the number of parameters in layers of convolutional networks without increasing their depth is not nearly as effective at increasing test set performance. The legend indicates the depth ofFigure 6.7
  • 22.
    (Goodfellow 2017) Roadmap • Example:Learning XOR • Gradient-Based Learning • Hidden Units • Architecture Design • Back-Propagation
  • 23.
    (Goodfellow 2017) Back-Propagation • Back-propagationis “just the chain rule” of calculus • But it’s a particular implementation of the chain rule • Uses dynamic programming (table filling) • Avoids recomputing repeated subexpressions • Speed vs memory tradeoff APTER 6. DEEP FEEDFORWARD NETWORKS chain rule states that dz dx = dz dy dy dx . (6.44) We can generalize this beyond the scalar case. Suppose that x 2 Rm, y 2 Rn, maps from Rm to Rn, and f maps from Rn to R. If y = g(x) and z = f(y), then @z @xi = X j @z @yj @yj @xi . (6.45) vector notation, this may be equivalently written as rxz = ✓ @y @x ◆> ryz, (6.46) ere @y @x is the n ⇥ m Jacobian matrix of g. From this we see that the gradient of a variable x can be obtained by multiplying acobian matrix @y @x by a gradient ryz. The back-propagation algorithm consists performing such a Jacobian-gradient product for each operation in the graph. Usually we apply the back-propagation algorithm to tensors of arbitrary di- CHAPTER 6. DEEP FEEDFORWARD NETWORKS the chain rule states that dz dx = dz dy dy dx . (6.44) We can generalize this beyond the scalar case. Suppose that x 2 Rm, y 2 Rn, g maps from Rm to Rn, and f maps from Rn to R. If y = g(x) and z = f(y), then @z @xi = X j @z @yj @yj @xi . (6.45) In vector notation, this may be equivalently written as rxz = ✓ @y @x ◆> ryz, (6.46) where @y @x is the n ⇥ m Jacobian matrix of g. From this we see that the gradient of a variable x can be obtained by multiplying a Jacobian matrix @y @x by a gradient ryz. The back-propagation algorithm consists of performing such a Jacobian-gradient product for each operation in the graph. Usually we apply the back-propagation algorithm to tensors of arbitrary di- mensionality, not merely to vectors. Conceptually, this is exactly the same as back-propagation with vectors. The only difference is how the numbers are ar- ranged in a grid to form a tensor. We could imagine flattening each tensor into a vector before we run back-propagation, computing a vector-valued gradient, and then reshaping the gradient back into a tensor. In this rearranged view, back-propagation is still just multiplying Jacobians by gradients. To denote the gradient of a value z with respect to a tensor X, we write rXz,
  • 24.
    (Goodfellow 2017) Simple Back-PropExampleCHAPTER 6. DEEP FEEDFORWARD NETWORKS yy hh xx W w yy h1h1 x1x1 h2h2 x2x2 Figure 6.2: An example of a feedforward network, drawn in two different st Forwardprop Back-prop Computeactivations Computederivatives Compute loss
  • 25.
    (Goodfellow 2017) Computation Graphs CHAPTER6. DEEP FEEDFORWARD NETWORKS zz xx yy (a) ⇥ xx ww (b) u(1) u(1) dot bb u(2) u(2) + ˆyˆy (c) XX WW U(1) U(1) matmul bb U(2) U(2) + HH relu xx ww (d) ˆyˆy dot u(1) u(1) sqr u(2) u(2) sum u(3) u(3) ⇥ Figure 6.8: Examples of computational graphs. (a)The graph using the ⇥ operation to compute z = xy. (b)The graph for the logistic regression prediction ˆy = x> w + b . Figure 6.8 Multiplication ReLU layer Logistic regression Linear regression and weight decay
  • 26.
    (Goodfellow 2017) Repeated Subexpressions zz xx yy ww f f f aphthat results in repeated subexpressions when computing e input to the graph. We use the same function f : R ! R ww computational graph that results in repeated subexpressions when computing Let w 2 R be the input to the graph. We use the same function f : R ! R on that we apply at every step of a chain: x = f(w), y = f(x), z = f(y). z w , we apply equation 6.44 and obtain: @z @w (6.50) = @z @y @y @x @x @w (6.51) =f0 (y)f0 (x)f0 (w) (6.52) =f0 (f(f(w)))f0 (f(w))f0 (w) (6.53) suggests an implementation in which we compute the value of f(w) only e it in the variable x. This is the approach taken by the back-propagation alternative approach is suggested by equation 6.53, where the subexpression more than once. In the alternative approach, f(w) is recomputed each time When the memory required to store the value of these expressions is low, the Figure 6.9 Back-prop avoids computing this twice
  • 27.
    (Goodfellow 2017) Symbol-to-Symbol Differentiation DEEP FEEDFORWARDNETWORKS zz xx yy ww f f f zz xx yy ww f f f dz dy dz dy f0 dy dx dy dx f0 dz dx dz dx ⇥ dx dw dx dw f0 dz dw dz dw ⇥ Figure 6.10
  • 28.
    (Goodfellow 2017) Neural NetworkLoss Function APTER 6. DEEP FEEDFORWARD NETWORKS XX W (1) W (1) U(1) U(1) matmul HH relu U(3) U(3) sqr u(4) u(4) sum u(7) u(7) W (2) W (2) U(2) U(2) matmul yy JMLEJMLE cross_entropy U(5) U(5) sqr u(6) u(6) sum u(8) u(8) JJ + ⇥ + ure 6.11: The computational graph used to compute the cost used to train our example Figure 6.11
  • 29.
    (Goodfellow 2017) Hessian-vector Products ods.Krylov methods are a set of iterative techniques for rations like approximately inverting a matrix or finding eigenvectors or eigenvalues, without using any operation or products. lov methods on the Hessian, we only need to be able to tween the Hessian matrix H and an arbitrary vector v. A ue (Christianson, 1992) for doing so is to compute Hv = rx h (rxf(x))> v i . (6.59) mputations in this expression may be computed automati- e software library. Note that the outer gradient expression function of the inner gradient expression. r produced by a computational graph, it is important to tic differentiation software should not differentiate through d v.
  • 30.