06 mlp

Deep Feedforward
Networks
Lecture slides for Chapter 6 of Deep Learning
www.deeplearningbook.org
Ian Goodfellow
Last updated 2016-10-04

(Goodfellow 2017)
Roadmap
• Example: Learning XOR
• Gradient-Based Learning
• Hidden Units
• Architecture Design
• Back-Propagation

(Goodfellow 2017)
XOR is not linearly separable
0 1
x1
0
1
x2
Original x space
0
0
1
h2
Learn
Figure 6.1, left

(Goodfellow 2017)
Rectiﬁed Linear Activation
0
z
0
g(z)=max{0,z}
Figure 6.3

(Goodfellow 2017)
Network Diagrams
DEEP FEEDFORWARD NETWORKS
yy
hh
xx
W
w
yy
h1h1
x1x1
h2h2
x2x2
n example of a feedforward network, drawn in two diﬀerent style
Figure 6.2

(Goodfellow 2017)
Solving XOR
is needed. The activation function g is typically chosen to be a function
ied element-wise, with hi = g(x>W:,i +ci). In modern neural networks,
recommendation is to use the rectified linear unit, or ReLU (Jarrett
Nair and Hinton, 2010; Glorot et al., 2011a), defined by the activation
z) = max{0, z}, depicted in figure 6.3.
now specify our complete network as
f(x; W , c, w, b) = w>
max{0, W >
x + c} + b. (6.3)
then specify a solution to the XOR problem. Let
W =

1 1
1 1
, (6.4)
c =

0
1
, (6.5)
w =

1
2
, (6.6)
now walk through how the model processes a batch of inputs. Let X
eded. The activation function g is typically chosen to be a function
lement-wise, with hi = g(x>W:,i +ci). In modern neural networks,
mmendation is to use the rectified linear unit, or ReLU (Jarrett
r and Hinton, 2010; Glorot et al., 2011a), defined by the activation
max{0, z}, depicted in figure 6.3.
specify our complete network as
f(x; W , c, w, b) = w>
max{0, W >
x + c} + b. (6.3)
specify a solution to the XOR problem. Let
W =

1 1
1 1
, (6.4)
c =

0
1
, (6.5)
w =

1
2
, (6.6)

(Goodfellow 2017)
Solving XOR
0 1
x1
0
1
x2
Original x space
0 1 2
h1
0
1
h2
Learned h space
.1: Solving the XOR problem by learning a representation. The bold n
Figure 6.1

(Goodfellow 2017)
Gradient-Based Learning
• Specify
• Model
• Cost
• Design model and cost so cost is smooth
• Minimize cost using gradient descent or related
techniques

(Goodfellow 2017)
Conditional Distributions and
Cross-Entropy
g Conditional Distributions with Maximum Likelihood
l networks are trained using maximum likelihood. This means
on is simply the negative log-likelihood, equivalently described
y between the training data and the model distribution. This
en by
J(✓) = Ex,y⇠ˆpdata
log pmodel(y | x). (6.12)
m of the cost function changes from model to model, depending
m of log pmodel. The expansion of the above equation typically
that do not depend on the model parameters and may be dis-
e, as we saw in section 5.5.1, if pmodel(y | x) = N(y; f(x; ✓), I),
mean squared error cost,
J(✓) =
1
E ||y f(x; ✓)||2
+ const, (6.13)

(Goodfellow 2017)
Output Types
Output Type
Output
Distribution
Output
Layer
Cost
Function
Binary Bernoulli Sigmoid
Binary cross-
entropy
Discrete Multinoulli Softmax
Discrete cross-
entropy
Continuous Gaussian Linear
Gaussian cross-
entropy (MSE)
Continuous
Mixture of
Gaussian
Mixture
Density
Cross-entropy
Continuous Arbitrary
See part III: GAN,
VAE, FVBN
Various

(Goodfellow 2017)
Mixture Density Outputs
R 6. DEEP FEEDFORWARD NETWORKS
x
y
.4: Samples drawn from a neural network with a mixture density out
Figure 6.4

(Goodfellow 2017)
Don’t mix and match
3 2 1 0 1 2 3
z
0.0
0.5
1.0
(z)
Cross-entropy loss
MSE loss
Sigmoid output with target of 1

(Goodfellow 2017)
Hidden units
• Use ReLUs, 90% of the time
• For RNNs, see Chapter 10
• For some research projects, get creative
• Many hidden units perform comparably to ReLUs.
New hidden units that perform comparably are
rarely interesting.

(Goodfellow 2017)
Architecture BasicsAPTER 6. DEEP FEEDFORWARD NETWORKS
yy
hh
xx
W
w
yy
h1h1
x1x1
h2h2
x2x2
ure 6.2: An example of a feedforward network, drawn in two diﬀerent styles
is the feedforward network we use to solve the XOR example. It has a s
r containing two units. (Left) In this style, we draw every unit as a node
Width
Depth

(Goodfellow 2017)
Universal Approximator
Theorem
• One hidden layer is enough to represent (not learn)
an approximation of any function to an arbitrary
degree of accuracy
• So why deeper?
• Shallow net may need (exponentially) more width
• Shallow net may overﬁt more

(Goodfellow 2017)
Exponential Representation
Advantage of Depth
network with absolute value rectification creates mirror images of the function
mputed on top of some hidden unit, with respect to the input of that hidden
it. Each hidden unit specifies where to fold the input space in order to create
rror responses (on both sides of the absolute value nonlinearity). By composing
ese folding operations, we obtain an exponentially large number of piecewise
ear regions which can capture all kinds of regular (e.g., repeating) patterns.
gure 6.5: An intuitive, geometric explanation of the exponential advantage of deeper
tifier networks formally by Montufar et al. (2014). (Left)An absolute value rectification
it has the same output for every pair of mirror points in its input. The mirror axis
symmetry is given by the hyperplane defined by the weights and bias of the unit. A
nction computed on top of that unit (the green decision surface) will be a mirror image
a simpler pattern across that axis of symmetry. (Center)The function can be obtained
folding the space around the axis of symmetry. (Right)Another repeating pattern can
Figure 6.5

(Goodfellow 2017)
Better Generalization with
Greater Depth
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
3 4 5 6 7 8 9 10 11
Number of hidden layers
92.0
92.5
93.0
93.5
94.0
94.5
95.0
95.5
96.0
96.5
Testaccuracy(percent)
Figure 6.6: Empirical results showing that deeper networks generalize better when used
to transcribe multi-digit numbers from photographs of addresses. Data from Goodfellow
et al. (2014d). The test set accuracy consistently increases with increasing depth. See
Figure 6.6
Layers

(Goodfellow 2017)
Large, Shallow Models Overﬁt More
0.0 0.2 0.4 0.6 0.8 1.0
Number of parameters ⇥108
91
92
93
94
95
96
97
Testaccuracy(percent)
3, convolutional
3, fully connected
11, convolutional
Figure 6.7: Deeper models tend to perform better. This is not merely because the model is
larger. This experiment from Goodfellow et al. (2014d) shows that increasing the number
of parameters in layers of convolutional networks without increasing their depth is not
nearly as eﬀective at increasing test set performance. The legend indicates the depth ofFigure 6.7

(Goodfellow 2017)
Back-Propagation
• Back-propagation is “just the chain rule” of calculus
• But it’s a particular implementation of the chain rule
• Uses dynamic programming (table filling)
• Avoids recomputing repeated subexpressions
• Speed vs memory tradeoff
APTER 6. DEEP FEEDFORWARD NETWORKS
chain rule states that
dz
dx
=
dz
dy
dy
dx
. (6.44)
We can generalize this beyond the scalar case. Suppose that x 2 Rm, y 2 Rn,
maps from Rm to Rn, and f maps from Rn to R. If y = g(x) and z = f(y), then
@z
@xi
=
X
j
@z
@yj
@yj
@xi
. (6.45)
vector notation, this may be equivalently written as
rxz =
✓
@y
@x
◆>
ryz, (6.46)
ere @y
@x is the n ⇥ m Jacobian matrix of g.
From this we see that the gradient of a variable x can be obtained by multiplying
acobian matrix @y
@x by a gradient ryz. The back-propagation algorithm consists
performing such a Jacobian-gradient product for each operation in the graph.
Usually we apply the back-propagation algorithm to tensors of arbitrary di-
the chain rule states that
dz
dx
=
dz
dy
dy
dx
. (6.44)
We can generalize this beyond the scalar case. Suppose that x 2 Rm, y 2 Rn,
g maps from Rm to Rn, and f maps from Rn to R. If y = g(x) and z = f(y), then
@z
@xi
=
X
j
@z
@yj
@yj
@xi
. (6.45)
In vector notation, this may be equivalently written as
rxz =
✓
@y
@x
◆>
ryz, (6.46)
where @y
@x is the n ⇥ m Jacobian matrix of g.
From this we see that the gradient of a variable x can be obtained by multiplying
a Jacobian matrix @y
@x by a gradient ryz. The back-propagation algorithm consists
of performing such a Jacobian-gradient product for each operation in the graph.
Usually we apply the back-propagation algorithm to tensors of arbitrary di-
mensionality, not merely to vectors. Conceptually, this is exactly the same as
back-propagation with vectors. The only difference is how the numbers are ar-
ranged in a grid to form a tensor. We could imagine flattening each tensor into
a vector before we run back-propagation, computing a vector-valued gradient,
and then reshaping the gradient back into a tensor. In this rearranged view,
back-propagation is still just multiplying Jacobians by gradients.
To denote the gradient of a value z with respect to a tensor X, we write rXz,

(Goodfellow 2017)
Simple Back-Prop ExampleCHAPTER 6. DEEP FEEDFORWARD NETWORKS
yy
hh
xx
W
w
yy
h1h1
x1x1
h2h2
x2x2
Figure 6.2: An example of a feedforward network, drawn in two diﬀerent st
Forwardprop
Back-prop
Computeactivations
Computederivatives
Compute loss

(Goodfellow 2017)
Computation Graphs
zz
xx yy
(a)
⇥
xx ww
(b)
u(1)
u(1)
dot
bb
u(2)
u(2)
+
ˆyˆy
(c)
XX WW
U(1)
U(1)
matmul
bb
U(2)
U(2)
+
HH
relu
xx ww
(d)
ˆyˆy
dot
u(1)
u(1)
sqr
u(2)
u(2)
sum
u(3)
u(3)
⇥
Figure 6.8: Examples of computational graphs. (a)The graph using the ⇥ operation to
compute z = xy. (b)The graph for the logistic regression prediction ˆy = x>
w + b .
Figure 6.8
Multiplication
ReLU layer
Logistic regression
Linear regression
and weight decay

(Goodfellow 2017)
Repeated Subexpressions
zz
xx
yy
ww
f
f
f
aph that results in repeated subexpressions when computing
e input to the graph. We use the same function f : R ! R
ww
computational graph that results in repeated subexpressions when computing
Let w 2 R be the input to the graph. We use the same function f : R ! R
on that we apply at every step of a chain: x = f(w), y = f(x), z = f(y).
z
w , we apply equation 6.44 and obtain:
@z
@w
(6.50)
=
@z
@y
@y
@x
@x
@w
(6.51)
=f0
(y)f0
(x)f0
(w) (6.52)
=f0
(f(f(w)))f0
(f(w))f0
(w) (6.53)
suggests an implementation in which we compute the value of f(w) only
e it in the variable x. This is the approach taken by the back-propagation
alternative approach is suggested by equation 6.53, where the subexpression
more than once. In the alternative approach, f(w) is recomputed each time
When the memory required to store the value of these expressions is low, the
Figure 6.9
Back-prop avoids computing this twice

(Goodfellow 2017)
Symbol-to-Symbol
Diﬀerentiation
DEEP FEEDFORWARD NETWORKS
zz
xx
yy
ww
f
f
f
zz
xx
yy
ww
f
f
f
dz
dy
dz
dy
f0
dy
dx
dy
dx
f0
dz
dx
dz
dx
⇥
dx
dw
dx
dw
f0
dz
dw
dz
dw
⇥
Figure 6.10

(Goodfellow 2017)
Neural Network Loss Function
APTER 6. DEEP FEEDFORWARD NETWORKS
XX W (1)
W (1)
U(1)
U(1)
matmul
HH
relu
U(3)
U(3)
sqr
u(4)
u(4)
sum
u(7)
u(7)
W (2)
W (2)
U(2)
U(2)
matmul
yy
JMLEJMLE
cross_entropy
U(5)
U(5)
sqr
u(6)
u(6)
sum
u(8)
u(8)
JJ
+
⇥
+
ure 6.11: The computational graph used to compute the cost used to train our example
Figure 6.11

(Goodfellow 2017)
Hessian-vector Products
ods. Krylov methods are a set of iterative techniques for
rations like approximately inverting a matrix or finding
eigenvectors or eigenvalues, without using any operation
or products.
lov methods on the Hessian, we only need to be able to
tween the Hessian matrix H and an arbitrary vector v. A
ue (Christianson, 1992) for doing so is to compute
Hv = rx
h
(rxf(x))>
v
i
. (6.59)
mputations in this expression may be computed automati-
e software library. Note that the outer gradient expression
function of the inner gradient expression.
r produced by a computational graph, it is important to
tic differentiation software should not differentiate through
d v.

06 mlp

More Related Content

What's hot

Similar to 06 mlp

More from Ronald Teo

Recently uploaded

06 mlp