SlideShare a Scribd company logo
1 of 50
Download to read offline
Deep Feedforward Networks
Goodfellow, Bengio, & Courville (2016) Deep Learning, Chap 6.
Shigeru ONO (Insight Factory)
DL 読書会: 2020/07
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 1 / 50
TOC
1 6.1 Example: Learning XOR
2 6.2 Gradient-Based Learning
3 6.3 Hidden Units
4 6.4 Architecture Design
5 6.5 Back-Propagation and Other Differentiation Algorithms
6 6.6 Historical Notes
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 2 / 50
(introduction)
deep feedforward network
aka: feedforward neural network, multilayer perceptrons (MLP)
Purpose: to approximate some function f∗
no feedback connections
Why is it called ”network”?
represented by composing many different functions
1st layer, 2nd layer, ..., output layer
Why is it called ”neural”?
Hidden layers consist of vector-to-vector functions
We can think of the layer as consisting of many units (vector-to-scalar
functions) that act in parallel
Each unit resembles a neuron
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 3 / 50
(introduction)
To extend linear models, we can apply the linear model to a nonlinearly
transformed input ϕ(x). How to choose ϕ?
use a generic ϕ. E.g. infinite-dimensional ϕ in kernel machines
manually engineer ϕ.
learn ϕ ... the strategy of DL
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 4 / 50
6.1 Example: Learning XOR
Target function f∗
: XOR function
Training set: X =
{[
0
0
]
,
[
0
1
]
,
[
1
0
]
,
[
1
1
]}
Model: f(x; θ)
Loss function: MSE J(θ) = 1
4
∑
x∈X(f∗
(x) − f(x; θ))2
↓
Linear model cannot represent the XOR function.
↓
Learn a different feature space, where a linear model can represent the solution.
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 5 / 50
6.1 Example: Learning XOR
We introduce a simple feedforward network :
f(x; W, c, w, b) = f(2)
(f(1)
(x; W, c); w, b)
In most neural networks, an affine transformation controlled by learned parameters
is used in f(1)
:
f(1)
(x; W, c) = g(W⊤
x + c)
g is typically an element-wise function. In modern neural networks, the default
recommendation for g is the rectified linear unit (ReLU) :
g(z) = max{0, z}
Our complete network:
f(x; W, c, w, b) = w⊤
max{0, W⊤
x + c} + b
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 6 / 50
6.1 Example: Learning XOR
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 7 / 50
6.1 Example: Learning XOR
Let X be the design matrix
X =




0 0
1 1
1 0
1 1




Set
W =
[
1 1
1 1
]
, c =
[
0
−1
]
, w =
[
1
−2
]
Then the matrix of max{0, W⊤
x + c} is




0 0
1 0
1 0
2 1




With multiplying w we get the correct answers [0 1 1 0]⊤
.
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 8 / 50
6.2 Gradient-Based Learning
The largest difference between the linear models and NN is that most interesting
loss function for NN are nonconvex.
NN are usually trained by gradient-based optimizer (rather than linear equation
solvers or the convex optimizers).
convergence is not guaranteed.
sensitive to the initial parameters values.
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 9 / 50
6.2.1 Cost Functions
1. learning conditional distribution
Most modern NN models define p(y|x; θ) and simply use the ML principle.
The cost function is the negative log likelihood (= the cross-entropy b/w
training data and model distribution):
J(θ) = −Ex,y∼ˆpdata
log pmodel(y|x)
Advantage: Specifying a model p(y|x) automatically determines a cost
function log(y|x).
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 10 / 50
6.2.1 Cost Functions
2. learning conditional statistics
In some cases we want to learn just one conditional statistics of y given x.
We can view the cost function as being a functional (=mapping from
functions to real numbers) rather than a function.
For example, when we wish to predict the mean of y, we can design the cost
functional to have its minimum lies on the function f(E(y|x); x).
Solving an optimization problem with respect to a function requires calculus
of variations(変分法).
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 11 / 50
6.2.1 Cost Functions
(cont’d)
Suppose the optimization problem is
f∗
= arg min
f
Ex,y∼pdata
||y − f(x)||2
If this function lies within the class we optimize over, it yields
f∗
(x) = Ey∼pdata(y|x)[y]
i.e.: if we could train on infinitely many samples, minimizing MSE cost
||y − f(x)||2
would give a function that predict E[y|x] for each value of x.
Similarly, minimizing MAE cost ||y − f(x)||1 would give a function that
predict median(y) for each value of x.
MSE & MAE often lead to poor results when used with gradient-based
optimization. Cross-entropy cost is more popular, even when we do not need
to estimate p(y|x).
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 12 / 50
6.2.2 Output Units
1. linear units
output units based on an affine transformation w/o nonlinearity:
ˆy = W⊤
h + b
often used to produce the mean of a conditional Gaussian distribution.
do not saturate. Suitable for gradient-based optimization.
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 13 / 50
6.2.2 Output Units
2. sigmoid units
ˆy = σ(w⊤
h + b)
where σ(x) = 1
1+exp(−x) (logistic sigmoid function).
For binary y, NN needs to predict P(y = 1|x).
The ML approach is to define Bernoulli distribution conditioned on x.
One possibility is to define
P(y = 1|x) = max{0, min{1, w⊤
h + b}}
but it has no gradient outside [0, 1].
Let ˜P(y) as unnormalized probability of y. Assume that log ˜P(y) = yz   (i.e.
˜P(y = 0|z) = 0, ˜P(y = 1|z) = z). Then we can derive Bernoulli distribution:
P(y) =
yz
exp(0) + exp(z)
= σ((2y − 1)z)
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 14 / 50
6.2.2 Output Units
(cont’d)
The loss function for ML is
J(θ) = − log σ((2y − 1)z) = ζ((1 − 2y)z)
where ζ(x) = log(1 + exp(x)) (softplus function). It saturates only when
(1 − 2y)z is very negative (i.e. the model already has the right answer).
Other loss functions (e.g. MSE) can saturate anytime σ(z) saturates. So ML
is preferred.
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 15 / 50
6.2.2 Output Units
3. softmax units
Now we with to generalize the sigmoid function to the case of a discrete
variables with n values. We need to produce a vector ˆy with ˆyi = P(y = i|x).
Assume we can predict nonnormalized log probability vector z as
zi = log ˜P(y = 1|x). We can obtain the desired ˆy as
ˆyi = softmax(z)i =
exp(zi)
∑n
j=1 exp(zj)
In ML approach we with to maximize the log-likelihood
log P(y = i|z) = log softmax(z), where
log softmax(z)i = zi − log
∑
j
exp(zj)
Many objective functions other than the log-likelihood do not work as well
with the softmax function.
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 16 / 50
6.2.2 Output Units
(cont’d)
z can be produced as z = W⊤
h + b, but it actually overparameterizes the
distribution.
Or we can impose a requirement that one of zi be fixed. In practice it rarely
makes differences.
Origin of the name: ”soft” means it is continuous and differentiable. It would
perhaps be better to call ”softargmax”.
4. Other output types
...skipped ...
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 17 / 50
6.3 Hidden Units
ReLU is an excellent default choice
We can disregard whether the activation functions is differentiable at all
input point or not
most hidden units
(1) accept a vector x,
(2) compute an affine transformation z = W⊤
x + b, and
(3) apply an element-wise nonlinear function g(z).
They are distinguished only by the choice of g(z).
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 18 / 50
6.3.1 ReLU and their generalizations
ReLU (Rectified linear units):
the activation function is g(z) = max{0, z}
typically used after an affine transformation:
h = g(W⊤
x + b)
when initializing, set all elements of b to a small positive value (e.g. 0.1)
Drawback: ReLU cannot learn on examples for which their activation is zero
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 19 / 50
6.3.1 ReLU and their generalizations
Generalization of ReLU:
hi = g(z, α)i = max(0, zi) + αi min(0, zi)
Absolute value rectification: αi = −1; g(z) = |z|. Used for object recognition.
leaky ReLU: fixes αi to a small value like 0.01
parametric ReLU: treats αi as a learnable parameter
maxout units: g(z)i = maxj∈G(i) zj where G(i)
is a group of k elements in z.
maxout units can learn a piecewise linear, convex function
each unit is parameterized by k weight vectors
each unit is driven by multiple filters. It resists ”catastrophic forgetting”
(forgetting of how to perform task)
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 20 / 50
6.3.2 Logistic Sigmoid and Hyperbolic Tangent
logistic sigmoid activation function: g(z) = σ(z)
mainly used prior to the introduction of ReLU
saturate across most of the domain. Gradient-Based learning is difficult
now only used for output units ()or other setting than feed-forward network)
hyperbolic tangent activation function: g(z) = tanh(z) = 2σ(2z) − 1
mainly used prior to the introduction of ReLU
performs better than the logistic sigmoid
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 21 / 50
6.3.3 Other Hidden Units
A wide variety of differentiable functions perform well.
identity function. (It is acceptable for some layers to be purely linear)
softmax function
radial basis function hi = exp(− 1
σ2
i
||W:,i − x||2
)
softplus function. ()generally discouraged)
hard tanh g(a) = max(−1, min(1, a))
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 22 / 50
6.4 Architecture Design
architecture: the overall structure of network.
Most NN are organized into layers.
1st layer: h(1)
= g(1)
(W(1)⊤
x + b(1)
)
2nd layer: h(2)
= g(2)
(W(2)⊤
h(1)
+ b(2)
)
...
Main considerations:
Depth: number of layers
Width: number of units in each layer
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 23 / 50
6.4.1 Universal Approximation Properties and Depth
Universal approximation theorem:
if a feedforward network has
(1) a linear output layer,
(2) at least one hidden layer with any ”squashing” activation function, and
(3) has enough hidden units,
then it can approximate any Borel measurable function from one
finite-dimensional space to another, with any desired nonzero amount of error
”spuashing” function ... e.g. logistic sigmoid
Borel measurable function ... including any continuous function on a closed
and bounded subset of Rn
In other words: a large feedforward network will be able to represent any
function we are trying to learn
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 24 / 50
6.4.1 Universal Approximation Properties and Depth
But... ’represent’ ̸= ’learn’.
MLP may fail to find parameters or choose the wrong functions.
Even if one hidden layer is enough, the layer may be infeasibly large.
Towards deeper models:
In many circumstances, using deeper models can reduce the number of units.
Statistical reason: Choosing a deep model means that we believe the learning
problem consists of discovering a set of underlying factors, which can be
described in terms of simpler underlying factors
Some experiments suggests deep architectures express a useful prior
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 25 / 50
6.4.3 Other Architectural Considerations
...skipped...
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 26 / 50
6.5 Back-Propagation and Other Differentiation Algorithms
forward propagation (順伝播):
The input x provides the initial information
It propagates up to the hidden layers and finally produce ˆy
Forward propagation can continue until it produce a scalar cost J(θ)
back-propagation algorithm (backprop, 誤差逆伝播法):
The cost J(θ) provides the initial information
It flows backward through the network in order to compute the gradients
a simpler procedure than evaluating the gradient analytically
It is not the whole learning algorithm, but the method for computing the
gradient. Another algorithm (e.g. stochastic gradient descent) is used to
perform learning.
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 27 / 50
6.5.1 Computational Graphs
Graph expression of computation
node: a variable
edge from x to y: an operation to a variable x which computes y
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 28 / 50
6.5.1 Computational Graphs
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 29 / 50
6.5.2 Chain Rule of Calculus
Let x be a real number. Suppose that y = g(x), z = f(y). Then
dz
dx
=
dz
dy
dy
dx
Suppose that x ∈ Rm
, y ∈ Rn
, y = g(x), z = f(y). Then
∂z
∂xi
=
∑
j
∂z
∂yj
∂yj
∂xi
In vector notation,
∇xz
=
(
∂y
∂x
)⊤
∇yz
where ∂y
∂x is n × m Jacobian matrix of g.
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 30 / 50
6.5.3 Recursively applying the Chain Rule to Obtain
Backprop
Setting:
Consider a computational graph describing how to compute a single scalar
u(n) (e.g. the loss of a training example)
We want to obtain u(n)’s gradient with respect to the ni input nodes
u(1)
, . . . , u(ni)
The nodes of the graph have been ordered in such a way that we can compute
their output one after the other, starting u(ni+1)
and going up to u(n)
.
u(i)
= f(A(i)
) where A(i)
is the set of all parent nodes of u(i)
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 31 / 50
6.5.3 Recursively applying the Chain Rule to Obtain
Backprop
algorithm for the forward propagation computation:
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 32 / 50
6.5.3 Recursively applying the Chain Rule to Obtain
Backprop
algorithm for the backprop that specifies the actual gradient computation:
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 33 / 50
6.5.3 Recursively applying the Chain Rule to Obtain
Backprop
algorithm for the backprop that specifies the actual gradient computation:
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 34 / 50
6.5.4 Backprop Computation in Fully Connected MLP
Point: apply the chain rule in order to get derivative ∂u(n)
∂u(j) :
∂u(n)
∂u(j)
=
∑
i∈(children of u(j))
∂u(n)
∂u(i)
∂u(i)
∂u(j)
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 35 / 50
6.5.4 Backprop Computation in Fully Connected MLP
Consider a computational graph of a fully-connected multi layer MLP.
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 36 / 50
6.5.4 Backprop Computation in Fully Connected MLP
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 37 / 50
6.5.5 Symbol-to-Symbol Derivatives
symbol-to-number differentiation
take a computational graph and a set of numerical input values
return a set of gradient values at those input values
used by Torch and Caffe
symbol-to-symbol derivatives approach
take a computational graph
add additional nodes of symbolic descriptions of the desired derivatives
used by Theano and TensorFlow
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 38 / 50
6.5.5 Symbol-to-Symbol Derivatives
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 39 / 50
6.5.6 General Back-Propagation
To compute the gradient of z with respect to its ancestors x,
the gradient of z with respect to z: dz
dz = 1. It is the current gradient.
the gradient of z with respect to its parent: (the current gradient) x
(Jacobian of the operation that produced z)
the gradient of z with respect to its grandparent: (the current gradient) x
(Jacobian of the operation that produced the parent)
...
if we reach a node through multiple paths, simply sum the gradients
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 40 / 50
6.5.6 General Back-Propagation
More formally...
Assume the subroutines below:
get_operation(V): returns the operation that compute V (the edges into V)
get_consumers(V, g): returns the list of V ’s children in the graph g
get_inputs(V, g): returns the list of V ’s parent in the graph g
Each operation op has methods below:
op.f(inputs): implementation of operation
op.bprop(inputs, X, G): implementation of the chain rule.
X: the input whose gradient we with to compute.
G: the gradient on the output of the operation .
return
∑
i(∇Xop.f(inputs)i)Gi.
E.g. We use a operation (op) C = AB. G is the gradient of a scalar z with
respect to C. You can call op.bprop((A, B), A, G) to get the gradient with
respect to A, which is given by GB⊤
.
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 41 / 50
6.5.6 General Back-Propagation
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 42 / 50
6.5.6 General Back-Propagation
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 43 / 50
6.5.7 Example: Back-Propagation for MLP Training
...skipped...
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 44 / 50
6.5.8 Complications
Actual implementation of backprop has to be more complex...
Most implementation need to support operations that can return more than
one tensor.
how to control memory consumption
handling of various data type (32bit fp, 64bit fp, int, ...)
tracking undefined gradient
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 45 / 50
6.5.9 Differentiation outside the DL Community
...skipped...
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 46 / 50
6.5.10 Higher Order Derivatives
...skipped...
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 47 / 50
6.6 Historical Notes
17c: the chain rule
19c: the gradient descent technique
1940s: machine learning models (e.g. perceptron) based on linear models.
Critics (e.g. Minsky) pointed out the flows of the linear model, which led to a
backlash against the entire NN approach.
1960s-70s: efficient applications of the chain rule
1980s: applying the chain rule for learning of nonlinear functions in NN
PDP (Rumelhart et al, 1986). Popularization of backprop & multilayer NN,
and of ”connectionism”.
early 1990s: a peak of NN research.
2006: renaissance of modern DL
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 48 / 50
6.6 Historical Notes
Why did NN performance improve in 1986-2015?
Two main factors are:
larger datasets
larger networks with powerful computers and better software
Small number of algorithmic change have also improved NN.
loss function: replacement of MSE with cross-entropy family
hidden units: replacement of sigmoid with piecewise linear function (e.g.
ReLU)
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 49 / 50
6.6 Historical Notes
Even after 2006, feedforward network continued to have a bad reputation.
It was widely believed that feedforward networks would not perform well
unless they were assisted by other models
Since 2012, feed forward network with gradient-based learning has been viewed as
a powerful technology. They continue to have unfulfilled potential.
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 50 / 50

More Related Content

What's hot

What's hot (20)

Crash course in verilog
Crash course in verilogCrash course in verilog
Crash course in verilog
 
Logic Simulation, Modeling, and Testing
Logic Simulation, Modeling, and TestingLogic Simulation, Modeling, and Testing
Logic Simulation, Modeling, and Testing
 
İNTEGRAL UYGULAMALARI
İNTEGRAL UYGULAMALARIİNTEGRAL UYGULAMALARI
İNTEGRAL UYGULAMALARI
 
Logic Synthesis
Logic SynthesisLogic Synthesis
Logic Synthesis
 
Fpga 05-verilog-programming
Fpga 05-verilog-programmingFpga 05-verilog-programming
Fpga 05-verilog-programming
 
Activation_function.pptx
Activation_function.pptxActivation_function.pptx
Activation_function.pptx
 
Lecture3 - Machine Learning
Lecture3 - Machine LearningLecture3 - Machine Learning
Lecture3 - Machine Learning
 
Maze Path Finding
Maze Path FindingMaze Path Finding
Maze Path Finding
 
Soft Computing
Soft ComputingSoft Computing
Soft Computing
 
Genetic Algorithms
Genetic AlgorithmsGenetic Algorithms
Genetic Algorithms
 
An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)
 
LR Parsing
LR ParsingLR Parsing
LR Parsing
 
Actel fpga
Actel fpgaActel fpga
Actel fpga
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2
 
Yacc
YaccYacc
Yacc
 
ASIC Design.pdf
ASIC Design.pdfASIC Design.pdf
ASIC Design.pdf
 
CSC446: Pattern Recognition (LN4)
CSC446: Pattern Recognition (LN4)CSC446: Pattern Recognition (LN4)
CSC446: Pattern Recognition (LN4)
 
Histolab: an Open Source Python Library for Reproducible Digital Pathology
Histolab: an Open Source Python Library for Reproducible Digital PathologyHistolab: an Open Source Python Library for Reproducible Digital Pathology
Histolab: an Open Source Python Library for Reproducible Digital Pathology
 
Manual Solution Probability and Statistic Hayter 4th Edition
Manual Solution Probability and Statistic Hayter 4th EditionManual Solution Probability and Statistic Hayter 4th Edition
Manual Solution Probability and Statistic Hayter 4th Edition
 
Fuzzy Genetic Algorithm
Fuzzy Genetic AlgorithmFuzzy Genetic Algorithm
Fuzzy Genetic Algorithm
 

Similar to Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6

Stack squeues lists
Stack squeues listsStack squeues lists
Stack squeues lists
James Wong
 
Stacksqueueslists
StacksqueueslistsStacksqueueslists
Stacksqueueslists
Fraboni Ec
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
Tony Nguyen
 

Similar to Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6 (20)

Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
 
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
 
Neural Networks - How do they work?
Neural Networks - How do they work?Neural Networks - How do they work?
Neural Networks - How do they work?
 
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof..."Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
 
UofT_ML_lecture.pptx
UofT_ML_lecture.pptxUofT_ML_lecture.pptx
UofT_ML_lecture.pptx
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural Networks
 
INDUCTIVE LEARNING OF COMPLEX FUZZY RELATION
INDUCTIVE LEARNING OF COMPLEX FUZZY RELATIONINDUCTIVE LEARNING OF COMPLEX FUZZY RELATION
INDUCTIVE LEARNING OF COMPLEX FUZZY RELATION
 
INDUCTIVE LEARNING OF COMPLEX FUZZY RELATION
INDUCTIVE LEARNING OF COMPLEX FUZZY RELATIONINDUCTIVE LEARNING OF COMPLEX FUZZY RELATION
INDUCTIVE LEARNING OF COMPLEX FUZZY RELATION
 
lec10.pdf
lec10.pdflec10.pdf
lec10.pdf
 
03 image transformations_i
03 image transformations_i03 image transformations_i
03 image transformations_i
 
Lecture4 xing
Lecture4 xingLecture4 xing
Lecture4 xing
 
Writing your own Neural Network.
Writing your own Neural Network.Writing your own Neural Network.
Writing your own Neural Network.
 
6_1_course-notes-deep-nets-overview.pdf
6_1_course-notes-deep-nets-overview.pdf6_1_course-notes-deep-nets-overview.pdf
6_1_course-notes-deep-nets-overview.pdf
 
Clojure And Swing
Clojure And SwingClojure And Swing
Clojure And Swing
 
Stack squeues lists
Stack squeues listsStack squeues lists
Stack squeues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacksqueueslists
StacksqueueslistsStacksqueueslists
Stacksqueueslists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 

More from Ono Shigeru

Hong&Page(2012): Some Microfoundations of Collective Wisdom
Hong&Page(2012): Some Microfoundations of Collective WisdomHong&Page(2012): Some Microfoundations of Collective Wisdom
Hong&Page(2012): Some Microfoundations of Collective Wisdom
Ono Shigeru
 

More from Ono Shigeru (6)

Miller_Resnick_Zhackhauser_2005
Miller_Resnick_Zhackhauser_2005Miller_Resnick_Zhackhauser_2005
Miller_Resnick_Zhackhauser_2005
 
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 9
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 9Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 9
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 9
 
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 7
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 7Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 7
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 7
 
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 5
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 5Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 5
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 5
 
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 3
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 3Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 3
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 3
 
Hong&Page(2012): Some Microfoundations of Collective Wisdom
Hong&Page(2012): Some Microfoundations of Collective WisdomHong&Page(2012): Some Microfoundations of Collective Wisdom
Hong&Page(2012): Some Microfoundations of Collective Wisdom
 

Recently uploaded

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
varanasisatyanvesh
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
Abortion pills in Jeddah |+966572737505 | get cytotec
Abortion pills in Jeddah |+966572737505 | get cytotecAbortion pills in Jeddah |+966572737505 | get cytotec
Abortion pills in Jeddah |+966572737505 | get cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted KitAbortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh +966572737505 get cytotec
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
jk0tkvfv
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
mikehavy0
 

Recently uploaded (20)

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Abortion pills in Jeddah |+966572737505 | get cytotec
Abortion pills in Jeddah |+966572737505 | get cytotecAbortion pills in Jeddah |+966572737505 | get cytotec
Abortion pills in Jeddah |+966572737505 | get cytotec
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted KitAbortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
 

Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6

  • 1. Deep Feedforward Networks Goodfellow, Bengio, & Courville (2016) Deep Learning, Chap 6. Shigeru ONO (Insight Factory) DL 読書会: 2020/07 Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 1 / 50
  • 2. TOC 1 6.1 Example: Learning XOR 2 6.2 Gradient-Based Learning 3 6.3 Hidden Units 4 6.4 Architecture Design 5 6.5 Back-Propagation and Other Differentiation Algorithms 6 6.6 Historical Notes Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 2 / 50
  • 3. (introduction) deep feedforward network aka: feedforward neural network, multilayer perceptrons (MLP) Purpose: to approximate some function f∗ no feedback connections Why is it called ”network”? represented by composing many different functions 1st layer, 2nd layer, ..., output layer Why is it called ”neural”? Hidden layers consist of vector-to-vector functions We can think of the layer as consisting of many units (vector-to-scalar functions) that act in parallel Each unit resembles a neuron Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 3 / 50
  • 4. (introduction) To extend linear models, we can apply the linear model to a nonlinearly transformed input ϕ(x). How to choose ϕ? use a generic ϕ. E.g. infinite-dimensional ϕ in kernel machines manually engineer ϕ. learn ϕ ... the strategy of DL Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 4 / 50
  • 5. 6.1 Example: Learning XOR Target function f∗ : XOR function Training set: X = {[ 0 0 ] , [ 0 1 ] , [ 1 0 ] , [ 1 1 ]} Model: f(x; θ) Loss function: MSE J(θ) = 1 4 ∑ x∈X(f∗ (x) − f(x; θ))2 ↓ Linear model cannot represent the XOR function. ↓ Learn a different feature space, where a linear model can represent the solution. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 5 / 50
  • 6. 6.1 Example: Learning XOR We introduce a simple feedforward network : f(x; W, c, w, b) = f(2) (f(1) (x; W, c); w, b) In most neural networks, an affine transformation controlled by learned parameters is used in f(1) : f(1) (x; W, c) = g(W⊤ x + c) g is typically an element-wise function. In modern neural networks, the default recommendation for g is the rectified linear unit (ReLU) : g(z) = max{0, z} Our complete network: f(x; W, c, w, b) = w⊤ max{0, W⊤ x + c} + b Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 6 / 50
  • 7. 6.1 Example: Learning XOR Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 7 / 50
  • 8. 6.1 Example: Learning XOR Let X be the design matrix X =     0 0 1 1 1 0 1 1     Set W = [ 1 1 1 1 ] , c = [ 0 −1 ] , w = [ 1 −2 ] Then the matrix of max{0, W⊤ x + c} is     0 0 1 0 1 0 2 1     With multiplying w we get the correct answers [0 1 1 0]⊤ . Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 8 / 50
  • 9. 6.2 Gradient-Based Learning The largest difference between the linear models and NN is that most interesting loss function for NN are nonconvex. NN are usually trained by gradient-based optimizer (rather than linear equation solvers or the convex optimizers). convergence is not guaranteed. sensitive to the initial parameters values. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 9 / 50
  • 10. 6.2.1 Cost Functions 1. learning conditional distribution Most modern NN models define p(y|x; θ) and simply use the ML principle. The cost function is the negative log likelihood (= the cross-entropy b/w training data and model distribution): J(θ) = −Ex,y∼ˆpdata log pmodel(y|x) Advantage: Specifying a model p(y|x) automatically determines a cost function log(y|x). Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 10 / 50
  • 11. 6.2.1 Cost Functions 2. learning conditional statistics In some cases we want to learn just one conditional statistics of y given x. We can view the cost function as being a functional (=mapping from functions to real numbers) rather than a function. For example, when we wish to predict the mean of y, we can design the cost functional to have its minimum lies on the function f(E(y|x); x). Solving an optimization problem with respect to a function requires calculus of variations(変分法). Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 11 / 50
  • 12. 6.2.1 Cost Functions (cont’d) Suppose the optimization problem is f∗ = arg min f Ex,y∼pdata ||y − f(x)||2 If this function lies within the class we optimize over, it yields f∗ (x) = Ey∼pdata(y|x)[y] i.e.: if we could train on infinitely many samples, minimizing MSE cost ||y − f(x)||2 would give a function that predict E[y|x] for each value of x. Similarly, minimizing MAE cost ||y − f(x)||1 would give a function that predict median(y) for each value of x. MSE & MAE often lead to poor results when used with gradient-based optimization. Cross-entropy cost is more popular, even when we do not need to estimate p(y|x). Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 12 / 50
  • 13. 6.2.2 Output Units 1. linear units output units based on an affine transformation w/o nonlinearity: ˆy = W⊤ h + b often used to produce the mean of a conditional Gaussian distribution. do not saturate. Suitable for gradient-based optimization. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 13 / 50
  • 14. 6.2.2 Output Units 2. sigmoid units ˆy = σ(w⊤ h + b) where σ(x) = 1 1+exp(−x) (logistic sigmoid function). For binary y, NN needs to predict P(y = 1|x). The ML approach is to define Bernoulli distribution conditioned on x. One possibility is to define P(y = 1|x) = max{0, min{1, w⊤ h + b}} but it has no gradient outside [0, 1]. Let ˜P(y) as unnormalized probability of y. Assume that log ˜P(y) = yz   (i.e. ˜P(y = 0|z) = 0, ˜P(y = 1|z) = z). Then we can derive Bernoulli distribution: P(y) = yz exp(0) + exp(z) = σ((2y − 1)z) Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 14 / 50
  • 15. 6.2.2 Output Units (cont’d) The loss function for ML is J(θ) = − log σ((2y − 1)z) = ζ((1 − 2y)z) where ζ(x) = log(1 + exp(x)) (softplus function). It saturates only when (1 − 2y)z is very negative (i.e. the model already has the right answer). Other loss functions (e.g. MSE) can saturate anytime σ(z) saturates. So ML is preferred. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 15 / 50
  • 16. 6.2.2 Output Units 3. softmax units Now we with to generalize the sigmoid function to the case of a discrete variables with n values. We need to produce a vector ˆy with ˆyi = P(y = i|x). Assume we can predict nonnormalized log probability vector z as zi = log ˜P(y = 1|x). We can obtain the desired ˆy as ˆyi = softmax(z)i = exp(zi) ∑n j=1 exp(zj) In ML approach we with to maximize the log-likelihood log P(y = i|z) = log softmax(z), where log softmax(z)i = zi − log ∑ j exp(zj) Many objective functions other than the log-likelihood do not work as well with the softmax function. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 16 / 50
  • 17. 6.2.2 Output Units (cont’d) z can be produced as z = W⊤ h + b, but it actually overparameterizes the distribution. Or we can impose a requirement that one of zi be fixed. In practice it rarely makes differences. Origin of the name: ”soft” means it is continuous and differentiable. It would perhaps be better to call ”softargmax”. 4. Other output types ...skipped ... Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 17 / 50
  • 18. 6.3 Hidden Units ReLU is an excellent default choice We can disregard whether the activation functions is differentiable at all input point or not most hidden units (1) accept a vector x, (2) compute an affine transformation z = W⊤ x + b, and (3) apply an element-wise nonlinear function g(z). They are distinguished only by the choice of g(z). Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 18 / 50
  • 19. 6.3.1 ReLU and their generalizations ReLU (Rectified linear units): the activation function is g(z) = max{0, z} typically used after an affine transformation: h = g(W⊤ x + b) when initializing, set all elements of b to a small positive value (e.g. 0.1) Drawback: ReLU cannot learn on examples for which their activation is zero Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 19 / 50
  • 20. 6.3.1 ReLU and their generalizations Generalization of ReLU: hi = g(z, α)i = max(0, zi) + αi min(0, zi) Absolute value rectification: αi = −1; g(z) = |z|. Used for object recognition. leaky ReLU: fixes αi to a small value like 0.01 parametric ReLU: treats αi as a learnable parameter maxout units: g(z)i = maxj∈G(i) zj where G(i) is a group of k elements in z. maxout units can learn a piecewise linear, convex function each unit is parameterized by k weight vectors each unit is driven by multiple filters. It resists ”catastrophic forgetting” (forgetting of how to perform task) Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 20 / 50
  • 21. 6.3.2 Logistic Sigmoid and Hyperbolic Tangent logistic sigmoid activation function: g(z) = σ(z) mainly used prior to the introduction of ReLU saturate across most of the domain. Gradient-Based learning is difficult now only used for output units ()or other setting than feed-forward network) hyperbolic tangent activation function: g(z) = tanh(z) = 2σ(2z) − 1 mainly used prior to the introduction of ReLU performs better than the logistic sigmoid Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 21 / 50
  • 22. 6.3.3 Other Hidden Units A wide variety of differentiable functions perform well. identity function. (It is acceptable for some layers to be purely linear) softmax function radial basis function hi = exp(− 1 σ2 i ||W:,i − x||2 ) softplus function. ()generally discouraged) hard tanh g(a) = max(−1, min(1, a)) Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 22 / 50
  • 23. 6.4 Architecture Design architecture: the overall structure of network. Most NN are organized into layers. 1st layer: h(1) = g(1) (W(1)⊤ x + b(1) ) 2nd layer: h(2) = g(2) (W(2)⊤ h(1) + b(2) ) ... Main considerations: Depth: number of layers Width: number of units in each layer Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 23 / 50
  • 24. 6.4.1 Universal Approximation Properties and Depth Universal approximation theorem: if a feedforward network has (1) a linear output layer, (2) at least one hidden layer with any ”squashing” activation function, and (3) has enough hidden units, then it can approximate any Borel measurable function from one finite-dimensional space to another, with any desired nonzero amount of error ”spuashing” function ... e.g. logistic sigmoid Borel measurable function ... including any continuous function on a closed and bounded subset of Rn In other words: a large feedforward network will be able to represent any function we are trying to learn Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 24 / 50
  • 25. 6.4.1 Universal Approximation Properties and Depth But... ’represent’ ̸= ’learn’. MLP may fail to find parameters or choose the wrong functions. Even if one hidden layer is enough, the layer may be infeasibly large. Towards deeper models: In many circumstances, using deeper models can reduce the number of units. Statistical reason: Choosing a deep model means that we believe the learning problem consists of discovering a set of underlying factors, which can be described in terms of simpler underlying factors Some experiments suggests deep architectures express a useful prior Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 25 / 50
  • 26. 6.4.3 Other Architectural Considerations ...skipped... Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 26 / 50
  • 27. 6.5 Back-Propagation and Other Differentiation Algorithms forward propagation (順伝播): The input x provides the initial information It propagates up to the hidden layers and finally produce ˆy Forward propagation can continue until it produce a scalar cost J(θ) back-propagation algorithm (backprop, 誤差逆伝播法): The cost J(θ) provides the initial information It flows backward through the network in order to compute the gradients a simpler procedure than evaluating the gradient analytically It is not the whole learning algorithm, but the method for computing the gradient. Another algorithm (e.g. stochastic gradient descent) is used to perform learning. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 27 / 50
  • 28. 6.5.1 Computational Graphs Graph expression of computation node: a variable edge from x to y: an operation to a variable x which computes y Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 28 / 50
  • 29. 6.5.1 Computational Graphs Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 29 / 50
  • 30. 6.5.2 Chain Rule of Calculus Let x be a real number. Suppose that y = g(x), z = f(y). Then dz dx = dz dy dy dx Suppose that x ∈ Rm , y ∈ Rn , y = g(x), z = f(y). Then ∂z ∂xi = ∑ j ∂z ∂yj ∂yj ∂xi In vector notation, ∇xz = ( ∂y ∂x )⊤ ∇yz where ∂y ∂x is n × m Jacobian matrix of g. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 30 / 50
  • 31. 6.5.3 Recursively applying the Chain Rule to Obtain Backprop Setting: Consider a computational graph describing how to compute a single scalar u(n) (e.g. the loss of a training example) We want to obtain u(n)’s gradient with respect to the ni input nodes u(1) , . . . , u(ni) The nodes of the graph have been ordered in such a way that we can compute their output one after the other, starting u(ni+1) and going up to u(n) . u(i) = f(A(i) ) where A(i) is the set of all parent nodes of u(i) Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 31 / 50
  • 32. 6.5.3 Recursively applying the Chain Rule to Obtain Backprop algorithm for the forward propagation computation: Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 32 / 50
  • 33. 6.5.3 Recursively applying the Chain Rule to Obtain Backprop algorithm for the backprop that specifies the actual gradient computation: Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 33 / 50
  • 34. 6.5.3 Recursively applying the Chain Rule to Obtain Backprop algorithm for the backprop that specifies the actual gradient computation: Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 34 / 50
  • 35. 6.5.4 Backprop Computation in Fully Connected MLP Point: apply the chain rule in order to get derivative ∂u(n) ∂u(j) : ∂u(n) ∂u(j) = ∑ i∈(children of u(j)) ∂u(n) ∂u(i) ∂u(i) ∂u(j) Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 35 / 50
  • 36. 6.5.4 Backprop Computation in Fully Connected MLP Consider a computational graph of a fully-connected multi layer MLP. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 36 / 50
  • 37. 6.5.4 Backprop Computation in Fully Connected MLP Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 37 / 50
  • 38. 6.5.5 Symbol-to-Symbol Derivatives symbol-to-number differentiation take a computational graph and a set of numerical input values return a set of gradient values at those input values used by Torch and Caffe symbol-to-symbol derivatives approach take a computational graph add additional nodes of symbolic descriptions of the desired derivatives used by Theano and TensorFlow Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 38 / 50
  • 39. 6.5.5 Symbol-to-Symbol Derivatives Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 39 / 50
  • 40. 6.5.6 General Back-Propagation To compute the gradient of z with respect to its ancestors x, the gradient of z with respect to z: dz dz = 1. It is the current gradient. the gradient of z with respect to its parent: (the current gradient) x (Jacobian of the operation that produced z) the gradient of z with respect to its grandparent: (the current gradient) x (Jacobian of the operation that produced the parent) ... if we reach a node through multiple paths, simply sum the gradients Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 40 / 50
  • 41. 6.5.6 General Back-Propagation More formally... Assume the subroutines below: get_operation(V): returns the operation that compute V (the edges into V) get_consumers(V, g): returns the list of V ’s children in the graph g get_inputs(V, g): returns the list of V ’s parent in the graph g Each operation op has methods below: op.f(inputs): implementation of operation op.bprop(inputs, X, G): implementation of the chain rule. X: the input whose gradient we with to compute. G: the gradient on the output of the operation . return ∑ i(∇Xop.f(inputs)i)Gi. E.g. We use a operation (op) C = AB. G is the gradient of a scalar z with respect to C. You can call op.bprop((A, B), A, G) to get the gradient with respect to A, which is given by GB⊤ . Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 41 / 50
  • 42. 6.5.6 General Back-Propagation Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 42 / 50
  • 43. 6.5.6 General Back-Propagation Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 43 / 50
  • 44. 6.5.7 Example: Back-Propagation for MLP Training ...skipped... Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 44 / 50
  • 45. 6.5.8 Complications Actual implementation of backprop has to be more complex... Most implementation need to support operations that can return more than one tensor. how to control memory consumption handling of various data type (32bit fp, 64bit fp, int, ...) tracking undefined gradient Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 45 / 50
  • 46. 6.5.9 Differentiation outside the DL Community ...skipped... Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 46 / 50
  • 47. 6.5.10 Higher Order Derivatives ...skipped... Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 47 / 50
  • 48. 6.6 Historical Notes 17c: the chain rule 19c: the gradient descent technique 1940s: machine learning models (e.g. perceptron) based on linear models. Critics (e.g. Minsky) pointed out the flows of the linear model, which led to a backlash against the entire NN approach. 1960s-70s: efficient applications of the chain rule 1980s: applying the chain rule for learning of nonlinear functions in NN PDP (Rumelhart et al, 1986). Popularization of backprop & multilayer NN, and of ”connectionism”. early 1990s: a peak of NN research. 2006: renaissance of modern DL Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 48 / 50
  • 49. 6.6 Historical Notes Why did NN performance improve in 1986-2015? Two main factors are: larger datasets larger networks with powerful computers and better software Small number of algorithmic change have also improved NN. loss function: replacement of MSE with cross-entropy family hidden units: replacement of sigmoid with piecewise linear function (e.g. ReLU) Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 49 / 50
  • 50. 6.6 Historical Notes Even after 2006, feedforward network continued to have a bad reputation. It was widely believed that feedforward networks would not perform well unless they were assisted by other models Since 2012, feed forward network with gradient-based learning has been viewed as a powerful technology. They continue to have unfulfilled potential. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 50 / 50