Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6

Deep Feedforward Networks
Goodfellow, Bengio, & Courville (2016) Deep Learning, Chap 6.
Shigeru ONO (Insight Factory)
DL 読書会: 2020/07
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 1 / 50

TOC
1 6.1 Example: Learning XOR
2 6.2 Gradient-Based Learning
3 6.3 Hidden Units
4 6.4 Architecture Design
5 6.5 Back-Propagation and Other Differentiation Algorithms
6 6.6 Historical Notes

(introduction)
deep feedforward network
aka: feedforward neural network, multilayer perceptrons (MLP)
Purpose: to approximate some function f∗
no feedback connections
Why is it called ”network”?
represented by composing many different functions
1st layer, 2nd layer, ..., output layer
Why is it called ”neural”?
Hidden layers consist of vector-to-vector functions
We can think of the layer as consisting of many units (vector-to-scalar
functions) that act in parallel
Each unit resembles a neuron

(introduction)
To extend linear models, we can apply the linear model to a nonlinearly
transformed input ϕ(x). How to choose ϕ?
use a generic ϕ. E.g. infinite-dimensional ϕ in kernel machines
manually engineer ϕ.
learn ϕ ... the strategy of DL

6.1 Example: Learning XOR
Target function f∗
: XOR function
Training set: X =
{[
0
0
]
,
[
0
1
]
,
[
1
0
]
,
[
1
1
]}
Model: f(x; θ)
Loss function: MSE J(θ) = 1
4
∑
x∈X(f∗
(x) − f(x; θ))2
↓
Linear model cannot represent the XOR function.
↓
Learn a different feature space, where a linear model can represent the solution.

We introduce a simple feedforward network :
f(x; W, c, w, b) = f(2)
(f(1)
(x; W, c); w, b)
In most neural networks, an aﬀine transformation controlled by learned parameters
is used in f(1)
:
f(1)
(x; W, c) = g(W⊤
x + c)
g is typically an element-wise function. In modern neural networks, the default
recommendation for g is the rectified linear unit (ReLU) :
g(z) = max{0, z}
Our complete network:
f(x; W, c, w, b) = w⊤
max{0, W⊤
x + c} + b

Let X be the design matrix
X =




0 0
1 1
1 0
1 1




Set
W =
[
1 1
1 1
]
, c =
[
0
−1
]
, w =
[
1
−2
]
Then the matrix of max{0, W⊤
x + c} is




0 0
1 0
1 0
2 1




With multiplying w we get the correct answers [0 1 1 0]⊤
.

6.2 Gradient-Based Learning
The largest difference between the linear models and NN is that most interesting
loss function for NN are nonconvex.
NN are usually trained by gradient-based optimizer (rather than linear equation
solvers or the convex optimizers).
convergence is not guaranteed.
sensitive to the initial parameters values.

6.2.1 Cost Functions
1. learning conditional distribution
Most modern NN models define p(y|x; θ) and simply use the ML principle.
The cost function is the negative log likelihood (= the cross-entropy b/w
training data and model distribution):
J(θ) = −Ex,y∼ˆpdata
log pmodel(y|x)
Advantage: Specifying a model p(y|x) automatically determines a cost
function log(y|x).

2. learning conditional statistics
In some cases we want to learn just one conditional statistics of y given x.
We can view the cost function as being a functional (=mapping from
functions to real numbers) rather than a function.
For example, when we wish to predict the mean of y, we can design the cost
functional to have its minimum lies on the function f(E(y|x); x).
Solving an optimization problem with respect to a function requires calculus
of variations(変分法).

(cont’d)
Suppose the optimization problem is
f∗
= arg min
f
Ex,y∼pdata
||y − f(x)||2
If this function lies within the class we optimize over, it yields
f∗
(x) = Ey∼pdata(y|x)[y]
i.e.: if we could train on infinitely many samples, minimizing MSE cost
||y − f(x)||2
would give a function that predict E[y|x] for each value of x.
Similarly, minimizing MAE cost ||y − f(x)||1 would give a function that
predict median(y) for each value of x.
MSE & MAE often lead to poor results when used with gradient-based
optimization. Cross-entropy cost is more popular, even when we do not need
to estimate p(y|x).

6.2.2 Output Units
1. linear units
output units based on an aﬀine transformation w/o nonlinearity:
ˆy = W⊤
h + b
often used to produce the mean of a conditional Gaussian distribution.
do not saturate. Suitable for gradient-based optimization.

6.2.2 Output Units
2. sigmoid units
ˆy = σ(w⊤
h + b)
where σ(x) = 1
1+exp(−x) (logistic sigmoid function).
For binary y, NN needs to predict P(y = 1|x).
The ML approach is to define Bernoulli distribution conditioned on x.
One possibility is to define
P(y = 1|x) = max{0, min{1, w⊤
h + b}}
but it has no gradient outside [0, 1].
Let ˜P(y) as unnormalized probability of y. Assume that log ˜P(y) = yz 　 (i.e.
˜P(y = 0|z) = 0, ˜P(y = 1|z) = z). Then we can derive Bernoulli distribution:
P(y) =
yz
exp(0) + exp(z)
= σ((2y − 1)z)

6.2.2 Output Units
(cont’d)
The loss function for ML is
J(θ) = − log σ((2y − 1)z) = ζ((1 − 2y)z)
where ζ(x) = log(1 + exp(x)) (softplus function). It saturates only when
(1 − 2y)z is very negative (i.e. the model already has the right answer).
Other loss functions (e.g. MSE) can saturate anytime σ(z) saturates. So ML
is preferred.

6.2.2 Output Units
3. softmax units
Now we with to generalize the sigmoid function to the case of a discrete
variables with n values. We need to produce a vector ˆy with ˆyi = P(y = i|x).
Assume we can predict nonnormalized log probability vector z as
zi = log ˜P(y = 1|x). We can obtain the desired ˆy as
ˆyi = softmax(z)i =
exp(zi)
∑n
j=1 exp(zj)
In ML approach we with to maximize the log-likelihood
log P(y = i|z) = log softmax(z), where
log softmax(z)i = zi − log
∑
j
exp(zj)
Many objective functions other than the log-likelihood do not work as well
with the softmax function.

6.2.2 Output Units
(cont’d)
z can be produced as z = W⊤
h + b, but it actually overparameterizes the
distribution.
Or we can impose a requirement that one of zi be fixed. In practice it rarely
makes differences.
Origin of the name: ”soft” means it is continuous and differentiable. It would
perhaps be better to call ”softargmax”.
4. Other output types
...skipped ...

6.3 Hidden Units
ReLU is an excellent default choice
We can disregard whether the activation functions is differentiable at all
input point or not
most hidden units
(1) accept a vector x,
(2) compute an aﬀine transformation z = W⊤
x + b, and
(3) apply an element-wise nonlinear function g(z).
They are distinguished only by the choice of g(z).

6.3.1 ReLU and their generalizations
ReLU (Rectified linear units):
the activation function is g(z) = max{0, z}
typically used after an aﬀine transformation:
h = g(W⊤
x + b)
when initializing, set all elements of b to a small positive value (e.g. 0.1)
Drawback: ReLU cannot learn on examples for which their activation is zero

6.3.1 ReLU and their generalizations
Generalization of ReLU:
hi = g(z, α)i = max(0, zi) + αi min(0, zi)
Absolute value rectification: αi = −1; g(z) = |z|. Used for object recognition.
leaky ReLU: fixes αi to a small value like 0.01
parametric ReLU: treats αi as a learnable parameter
maxout units: g(z)i = maxj∈G(i) zj where G(i)
is a group of k elements in z.
maxout units can learn a piecewise linear, convex function
each unit is parameterized by k weight vectors
each unit is driven by multiple filters. It resists ”catastrophic forgetting”
(forgetting of how to perform task)

6.3.2 Logistic Sigmoid and Hyperbolic Tangent
logistic sigmoid activation function: g(z) = σ(z)
mainly used prior to the introduction of ReLU
saturate across most of the domain. Gradient-Based learning is diﬀicult
now only used for output units ()or other setting than feed-forward network)
hyperbolic tangent activation function: g(z) = tanh(z) = 2σ(2z) − 1
mainly used prior to the introduction of ReLU
performs better than the logistic sigmoid

6.3.3 Other Hidden Units
A wide variety of differentiable functions perform well.
identity function. (It is acceptable for some layers to be purely linear)
softmax function
radial basis function hi = exp(− 1
σ2
i
||W:,i − x||2
)
softplus function. ()generally discouraged)
hard tanh g(a) = max(−1, min(1, a))

6.4 Architecture Design
architecture: the overall structure of network.
Most NN are organized into layers.
1st layer: h(1)
= g(1)
(W(1)⊤
x + b(1)
)
2nd layer: h(2)
= g(2)
(W(2)⊤
h(1)
+ b(2)
)
...
Main considerations:
Depth: number of layers
Width: number of units in each layer

6.4.1 Universal Approximation Properties and Depth
Universal approximation theorem:
if a feedforward network has
(1) a linear output layer,
(2) at least one hidden layer with any ”squashing” activation function, and
(3) has enough hidden units,
then it can approximate any Borel measurable function from one
finite-dimensional space to another, with any desired nonzero amount of error
”spuashing” function ... e.g. logistic sigmoid
Borel measurable function ... including any continuous function on a closed
and bounded subset of Rn
In other words: a large feedforward network will be able to represent any
function we are trying to learn

6.4.1 Universal Approximation Properties and Depth
But... ’represent’ ̸= ’learn’.
MLP may fail to find parameters or choose the wrong functions.
Even if one hidden layer is enough, the layer may be infeasibly large.
Towards deeper models:
In many circumstances, using deeper models can reduce the number of units.
Statistical reason: Choosing a deep model means that we believe the learning
problem consists of discovering a set of underlying factors, which can be
described in terms of simpler underlying factors
Some experiments suggests deep architectures express a useful prior

6.4.3 Other Architectural Considerations
...skipped...

6.5 Back-Propagation and Other Differentiation Algorithms
forward propagation (順伝播):
The input x provides the initial information
It propagates up to the hidden layers and finally produce ˆy
Forward propagation can continue until it produce a scalar cost J(θ)
back-propagation algorithm (backprop, 誤差逆伝播法):
The cost J(θ) provides the initial information
It flows backward through the network in order to compute the gradients
a simpler procedure than evaluating the gradient analytically
It is not the whole learning algorithm, but the method for computing the
gradient. Another algorithm (e.g. stochastic gradient descent) is used to
perform learning.

6.5.1 Computational Graphs
Graph expression of computation
node: a variable
edge from x to y: an operation to a variable x which computes y

6.5.1 Computational Graphs

6.5.2 Chain Rule of Calculus
Let x be a real number. Suppose that y = g(x), z = f(y). Then
dz
dx
=
dz
dy
dy
dx
Suppose that x ∈ Rm
, y ∈ Rn
, y = g(x), z = f(y). Then
∂z
∂xi
=
∑
j
∂z
∂yj
∂yj
∂xi
In vector notation,
∇xz
=
(
∂y
∂x
)⊤
∇yz
where ∂y
∂x is n × m Jacobian matrix of g.

6.5.3 Recursively applying the Chain Rule to Obtain
Backprop
Setting:
Consider a computational graph describing how to compute a single scalar
u(n) (e.g. the loss of a training example)
We want to obtain u(n)’s gradient with respect to the ni input nodes
u(1)
, . . . , u(ni)
The nodes of the graph have been ordered in such a way that we can compute
their output one after the other, starting u(ni+1)
and going up to u(n)
.
u(i)
= f(A(i)
) where A(i)
is the set of all parent nodes of u(i)

Backprop
algorithm for the forward propagation computation:

Backprop
algorithm for the backprop that specifies the actual gradient computation:

6.5.4 Backprop Computation in Fully Connected MLP
Point: apply the chain rule in order to get derivative ∂u(n)
∂u(j) :
∂u(n)
∂u(j)
=
∑
i∈(children of u(j))
∂u(n)
∂u(i)
∂u(i)
∂u(j)

Consider a computational graph of a fully-connected multi layer MLP.

6.5.5 Symbol-to-Symbol Derivatives
symbol-to-number differentiation
take a computational graph and a set of numerical input values
return a set of gradient values at those input values
used by Torch and Caffe
symbol-to-symbol derivatives approach
take a computational graph
add additional nodes of symbolic descriptions of the desired derivatives
used by Theano and TensorFlow

6.5.5 Symbol-to-Symbol Derivatives

6.5.6 General Back-Propagation
To compute the gradient of z with respect to its ancestors x,
the gradient of z with respect to z: dz
dz = 1. It is the current gradient.
the gradient of z with respect to its parent: (the current gradient) x
(Jacobian of the operation that produced z)
the gradient of z with respect to its grandparent: (the current gradient) x
(Jacobian of the operation that produced the parent)
...
if we reach a node through multiple paths, simply sum the gradients

More formally...
Assume the subroutines below:
get_operation(V): returns the operation that compute V (the edges into V)
get_consumers(V, g): returns the list of V ’s children in the graph g
get_inputs(V, g): returns the list of V ’s parent in the graph g
Each operation op has methods below:
op.f(inputs): implementation of operation
op.bprop(inputs, X, G): implementation of the chain rule.
X: the input whose gradient we with to compute.
G: the gradient on the output of the operation .
return
∑
i(∇Xop.f(inputs)i)Gi.
E.g. We use a operation (op) C = AB. G is the gradient of a scalar z with
respect to C. You can call op.bprop((A, B), A, G) to get the gradient with
respect to A, which is given by GB⊤
.

6.5.7 Example: Back-Propagation for MLP Training
...skipped...

6.5.8 Complications
Actual implementation of backprop has to be more complex...
Most implementation need to support operations that can return more than
one tensor.
how to control memory consumption
handling of various data type (32bit fp, 64bit fp, int, ...)
tracking undefined gradient

6.5.9 Differentiation outside the DL Community
...skipped...

6.5.10 Higher Order Derivatives
...skipped...

6.6 Historical Notes
17c: the chain rule
19c: the gradient descent technique
1940s: machine learning models (e.g. perceptron) based on linear models.
Critics (e.g. Minsky) pointed out the flows of the linear model, which led to a
backlash against the entire NN approach.
1960s-70s: eﬀicient applications of the chain rule
1980s: applying the chain rule for learning of nonlinear functions in NN
PDP (Rumelhart et al, 1986). Popularization of backprop & multilayer NN,
and of ”connectionism”.
early 1990s: a peak of NN research.
2006: renaissance of modern DL

Why did NN performance improve in 1986-2015?
Two main factors are:
larger datasets
larger networks with powerful computers and better software
Small number of algorithmic change have also improved NN.
loss function: replacement of MSE with cross-entropy family
hidden units: replacement of sigmoid with piecewise linear function (e.g.
ReLU)

Even after 2006, feedforward network continued to have a bad reputation.
It was widely believed that feedforward networks would not perform well
unless they were assisted by other models
Since 2012, feed forward network with gradient-based learning has been viewed as
a powerful technology. They continue to have unfulfilled potential.

Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6

Similar to Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6 (20)

More from Ono Shigeru

More from Ono Shigeru (6)

Recently uploaded

Recently uploaded (20)

Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6