SlideShare a Scribd company logo
Deep Feedforward Networks
Goodfellow, Bengio, & Courville (2016) Deep Learning, Chap 6.
Shigeru ONO (Insight Factory)
DL 読書会: 2020/07
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 1 / 50
TOC
1 6.1 Example: Learning XOR
2 6.2 Gradient-Based Learning
3 6.3 Hidden Units
4 6.4 Architecture Design
5 6.5 Back-Propagation and Other Differentiation Algorithms
6 6.6 Historical Notes
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 2 / 50
(introduction)
deep feedforward network
aka: feedforward neural network, multilayer perceptrons (MLP)
Purpose: to approximate some function f∗
no feedback connections
Why is it called ”network”?
represented by composing many different functions
1st layer, 2nd layer, ..., output layer
Why is it called ”neural”?
Hidden layers consist of vector-to-vector functions
We can think of the layer as consisting of many units (vector-to-scalar
functions) that act in parallel
Each unit resembles a neuron
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 3 / 50
(introduction)
To extend linear models, we can apply the linear model to a nonlinearly
transformed input ϕ(x). How to choose ϕ?
use a generic ϕ. E.g. infinite-dimensional ϕ in kernel machines
manually engineer ϕ.
learn ϕ ... the strategy of DL
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 4 / 50
6.1 Example: Learning XOR
Target function f∗
: XOR function
Training set: X =
{[
0
0
]
,
[
0
1
]
,
[
1
0
]
,
[
1
1
]}
Model: f(x; θ)
Loss function: MSE J(θ) = 1
4
∑
x∈X(f∗
(x) − f(x; θ))2
↓
Linear model cannot represent the XOR function.
↓
Learn a different feature space, where a linear model can represent the solution.
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 5 / 50
6.1 Example: Learning XOR
We introduce a simple feedforward network :
f(x; W, c, w, b) = f(2)
(f(1)
(x; W, c); w, b)
In most neural networks, an affine transformation controlled by learned parameters
is used in f(1)
:
f(1)
(x; W, c) = g(W⊤
x + c)
g is typically an element-wise function. In modern neural networks, the default
recommendation for g is the rectified linear unit (ReLU) :
g(z) = max{0, z}
Our complete network:
f(x; W, c, w, b) = w⊤
max{0, W⊤
x + c} + b
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 6 / 50
6.1 Example: Learning XOR
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 7 / 50
6.1 Example: Learning XOR
Let X be the design matrix
X =




0 0
1 1
1 0
1 1




Set
W =
[
1 1
1 1
]
, c =
[
0
−1
]
, w =
[
1
−2
]
Then the matrix of max{0, W⊤
x + c} is




0 0
1 0
1 0
2 1




With multiplying w we get the correct answers [0 1 1 0]⊤
.
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 8 / 50
6.2 Gradient-Based Learning
The largest difference between the linear models and NN is that most interesting
loss function for NN are nonconvex.
NN are usually trained by gradient-based optimizer (rather than linear equation
solvers or the convex optimizers).
convergence is not guaranteed.
sensitive to the initial parameters values.
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 9 / 50
6.2.1 Cost Functions
1. learning conditional distribution
Most modern NN models define p(y|x; θ) and simply use the ML principle.
The cost function is the negative log likelihood (= the cross-entropy b/w
training data and model distribution):
J(θ) = −Ex,y∼ˆpdata
log pmodel(y|x)
Advantage: Specifying a model p(y|x) automatically determines a cost
function log(y|x).
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 10 / 50
6.2.1 Cost Functions
2. learning conditional statistics
In some cases we want to learn just one conditional statistics of y given x.
We can view the cost function as being a functional (=mapping from
functions to real numbers) rather than a function.
For example, when we wish to predict the mean of y, we can design the cost
functional to have its minimum lies on the function f(E(y|x); x).
Solving an optimization problem with respect to a function requires calculus
of variations(変分法).
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 11 / 50
6.2.1 Cost Functions
(cont’d)
Suppose the optimization problem is
f∗
= arg min
f
Ex,y∼pdata
||y − f(x)||2
If this function lies within the class we optimize over, it yields
f∗
(x) = Ey∼pdata(y|x)[y]
i.e.: if we could train on infinitely many samples, minimizing MSE cost
||y − f(x)||2
would give a function that predict E[y|x] for each value of x.
Similarly, minimizing MAE cost ||y − f(x)||1 would give a function that
predict median(y) for each value of x.
MSE & MAE often lead to poor results when used with gradient-based
optimization. Cross-entropy cost is more popular, even when we do not need
to estimate p(y|x).
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 12 / 50
6.2.2 Output Units
1. linear units
output units based on an affine transformation w/o nonlinearity:
ˆy = W⊤
h + b
often used to produce the mean of a conditional Gaussian distribution.
do not saturate. Suitable for gradient-based optimization.
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 13 / 50
6.2.2 Output Units
2. sigmoid units
ˆy = σ(w⊤
h + b)
where σ(x) = 1
1+exp(−x) (logistic sigmoid function).
For binary y, NN needs to predict P(y = 1|x).
The ML approach is to define Bernoulli distribution conditioned on x.
One possibility is to define
P(y = 1|x) = max{0, min{1, w⊤
h + b}}
but it has no gradient outside [0, 1].
Let ˜P(y) as unnormalized probability of y. Assume that log ˜P(y) = yz   (i.e.
˜P(y = 0|z) = 0, ˜P(y = 1|z) = z). Then we can derive Bernoulli distribution:
P(y) =
yz
exp(0) + exp(z)
= σ((2y − 1)z)
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 14 / 50
6.2.2 Output Units
(cont’d)
The loss function for ML is
J(θ) = − log σ((2y − 1)z) = ζ((1 − 2y)z)
where ζ(x) = log(1 + exp(x)) (softplus function). It saturates only when
(1 − 2y)z is very negative (i.e. the model already has the right answer).
Other loss functions (e.g. MSE) can saturate anytime σ(z) saturates. So ML
is preferred.
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 15 / 50
6.2.2 Output Units
3. softmax units
Now we with to generalize the sigmoid function to the case of a discrete
variables with n values. We need to produce a vector ˆy with ˆyi = P(y = i|x).
Assume we can predict nonnormalized log probability vector z as
zi = log ˜P(y = 1|x). We can obtain the desired ˆy as
ˆyi = softmax(z)i =
exp(zi)
∑n
j=1 exp(zj)
In ML approach we with to maximize the log-likelihood
log P(y = i|z) = log softmax(z), where
log softmax(z)i = zi − log
∑
j
exp(zj)
Many objective functions other than the log-likelihood do not work as well
with the softmax function.
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 16 / 50
6.2.2 Output Units
(cont’d)
z can be produced as z = W⊤
h + b, but it actually overparameterizes the
distribution.
Or we can impose a requirement that one of zi be fixed. In practice it rarely
makes differences.
Origin of the name: ”soft” means it is continuous and differentiable. It would
perhaps be better to call ”softargmax”.
4. Other output types
...skipped ...
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 17 / 50
6.3 Hidden Units
ReLU is an excellent default choice
We can disregard whether the activation functions is differentiable at all
input point or not
most hidden units
(1) accept a vector x,
(2) compute an affine transformation z = W⊤
x + b, and
(3) apply an element-wise nonlinear function g(z).
They are distinguished only by the choice of g(z).
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 18 / 50
6.3.1 ReLU and their generalizations
ReLU (Rectified linear units):
the activation function is g(z) = max{0, z}
typically used after an affine transformation:
h = g(W⊤
x + b)
when initializing, set all elements of b to a small positive value (e.g. 0.1)
Drawback: ReLU cannot learn on examples for which their activation is zero
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 19 / 50
6.3.1 ReLU and their generalizations
Generalization of ReLU:
hi = g(z, α)i = max(0, zi) + αi min(0, zi)
Absolute value rectification: αi = −1; g(z) = |z|. Used for object recognition.
leaky ReLU: fixes αi to a small value like 0.01
parametric ReLU: treats αi as a learnable parameter
maxout units: g(z)i = maxj∈G(i) zj where G(i)
is a group of k elements in z.
maxout units can learn a piecewise linear, convex function
each unit is parameterized by k weight vectors
each unit is driven by multiple filters. It resists ”catastrophic forgetting”
(forgetting of how to perform task)
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 20 / 50
6.3.2 Logistic Sigmoid and Hyperbolic Tangent
logistic sigmoid activation function: g(z) = σ(z)
mainly used prior to the introduction of ReLU
saturate across most of the domain. Gradient-Based learning is difficult
now only used for output units ()or other setting than feed-forward network)
hyperbolic tangent activation function: g(z) = tanh(z) = 2σ(2z) − 1
mainly used prior to the introduction of ReLU
performs better than the logistic sigmoid
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 21 / 50
6.3.3 Other Hidden Units
A wide variety of differentiable functions perform well.
identity function. (It is acceptable for some layers to be purely linear)
softmax function
radial basis function hi = exp(− 1
σ2
i
||W:,i − x||2
)
softplus function. ()generally discouraged)
hard tanh g(a) = max(−1, min(1, a))
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 22 / 50
6.4 Architecture Design
architecture: the overall structure of network.
Most NN are organized into layers.
1st layer: h(1)
= g(1)
(W(1)⊤
x + b(1)
)
2nd layer: h(2)
= g(2)
(W(2)⊤
h(1)
+ b(2)
)
...
Main considerations:
Depth: number of layers
Width: number of units in each layer
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 23 / 50
6.4.1 Universal Approximation Properties and Depth
Universal approximation theorem:
if a feedforward network has
(1) a linear output layer,
(2) at least one hidden layer with any ”squashing” activation function, and
(3) has enough hidden units,
then it can approximate any Borel measurable function from one
finite-dimensional space to another, with any desired nonzero amount of error
”spuashing” function ... e.g. logistic sigmoid
Borel measurable function ... including any continuous function on a closed
and bounded subset of Rn
In other words: a large feedforward network will be able to represent any
function we are trying to learn
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 24 / 50
6.4.1 Universal Approximation Properties and Depth
But... ’represent’ ̸= ’learn’.
MLP may fail to find parameters or choose the wrong functions.
Even if one hidden layer is enough, the layer may be infeasibly large.
Towards deeper models:
In many circumstances, using deeper models can reduce the number of units.
Statistical reason: Choosing a deep model means that we believe the learning
problem consists of discovering a set of underlying factors, which can be
described in terms of simpler underlying factors
Some experiments suggests deep architectures express a useful prior
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 25 / 50
6.4.3 Other Architectural Considerations
...skipped...
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 26 / 50
6.5 Back-Propagation and Other Differentiation Algorithms
forward propagation (順伝播):
The input x provides the initial information
It propagates up to the hidden layers and finally produce ˆy
Forward propagation can continue until it produce a scalar cost J(θ)
back-propagation algorithm (backprop, 誤差逆伝播法):
The cost J(θ) provides the initial information
It flows backward through the network in order to compute the gradients
a simpler procedure than evaluating the gradient analytically
It is not the whole learning algorithm, but the method for computing the
gradient. Another algorithm (e.g. stochastic gradient descent) is used to
perform learning.
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 27 / 50
6.5.1 Computational Graphs
Graph expression of computation
node: a variable
edge from x to y: an operation to a variable x which computes y
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 28 / 50
6.5.1 Computational Graphs
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 29 / 50
6.5.2 Chain Rule of Calculus
Let x be a real number. Suppose that y = g(x), z = f(y). Then
dz
dx
=
dz
dy
dy
dx
Suppose that x ∈ Rm
, y ∈ Rn
, y = g(x), z = f(y). Then
∂z
∂xi
=
∑
j
∂z
∂yj
∂yj
∂xi
In vector notation,
∇xz
=
(
∂y
∂x
)⊤
∇yz
where ∂y
∂x is n × m Jacobian matrix of g.
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 30 / 50
6.5.3 Recursively applying the Chain Rule to Obtain
Backprop
Setting:
Consider a computational graph describing how to compute a single scalar
u(n) (e.g. the loss of a training example)
We want to obtain u(n)’s gradient with respect to the ni input nodes
u(1)
, . . . , u(ni)
The nodes of the graph have been ordered in such a way that we can compute
their output one after the other, starting u(ni+1)
and going up to u(n)
.
u(i)
= f(A(i)
) where A(i)
is the set of all parent nodes of u(i)
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 31 / 50
6.5.3 Recursively applying the Chain Rule to Obtain
Backprop
algorithm for the forward propagation computation:
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 32 / 50
6.5.3 Recursively applying the Chain Rule to Obtain
Backprop
algorithm for the backprop that specifies the actual gradient computation:
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 33 / 50
6.5.3 Recursively applying the Chain Rule to Obtain
Backprop
algorithm for the backprop that specifies the actual gradient computation:
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 34 / 50
6.5.4 Backprop Computation in Fully Connected MLP
Point: apply the chain rule in order to get derivative ∂u(n)
∂u(j) :
∂u(n)
∂u(j)
=
∑
i∈(children of u(j))
∂u(n)
∂u(i)
∂u(i)
∂u(j)
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 35 / 50
6.5.4 Backprop Computation in Fully Connected MLP
Consider a computational graph of a fully-connected multi layer MLP.
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 36 / 50
6.5.4 Backprop Computation in Fully Connected MLP
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 37 / 50
6.5.5 Symbol-to-Symbol Derivatives
symbol-to-number differentiation
take a computational graph and a set of numerical input values
return a set of gradient values at those input values
used by Torch and Caffe
symbol-to-symbol derivatives approach
take a computational graph
add additional nodes of symbolic descriptions of the desired derivatives
used by Theano and TensorFlow
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 38 / 50
6.5.5 Symbol-to-Symbol Derivatives
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 39 / 50
6.5.6 General Back-Propagation
To compute the gradient of z with respect to its ancestors x,
the gradient of z with respect to z: dz
dz = 1. It is the current gradient.
the gradient of z with respect to its parent: (the current gradient) x
(Jacobian of the operation that produced z)
the gradient of z with respect to its grandparent: (the current gradient) x
(Jacobian of the operation that produced the parent)
...
if we reach a node through multiple paths, simply sum the gradients
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 40 / 50
6.5.6 General Back-Propagation
More formally...
Assume the subroutines below:
get_operation(V): returns the operation that compute V (the edges into V)
get_consumers(V, g): returns the list of V ’s children in the graph g
get_inputs(V, g): returns the list of V ’s parent in the graph g
Each operation op has methods below:
op.f(inputs): implementation of operation
op.bprop(inputs, X, G): implementation of the chain rule.
X: the input whose gradient we with to compute.
G: the gradient on the output of the operation .
return
∑
i(∇Xop.f(inputs)i)Gi.
E.g. We use a operation (op) C = AB. G is the gradient of a scalar z with
respect to C. You can call op.bprop((A, B), A, G) to get the gradient with
respect to A, which is given by GB⊤
.
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 41 / 50
6.5.6 General Back-Propagation
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 42 / 50
6.5.6 General Back-Propagation
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 43 / 50
6.5.7 Example: Back-Propagation for MLP Training
...skipped...
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 44 / 50
6.5.8 Complications
Actual implementation of backprop has to be more complex...
Most implementation need to support operations that can return more than
one tensor.
how to control memory consumption
handling of various data type (32bit fp, 64bit fp, int, ...)
tracking undefined gradient
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 45 / 50
6.5.9 Differentiation outside the DL Community
...skipped...
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 46 / 50
6.5.10 Higher Order Derivatives
...skipped...
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 47 / 50
6.6 Historical Notes
17c: the chain rule
19c: the gradient descent technique
1940s: machine learning models (e.g. perceptron) based on linear models.
Critics (e.g. Minsky) pointed out the flows of the linear model, which led to a
backlash against the entire NN approach.
1960s-70s: efficient applications of the chain rule
1980s: applying the chain rule for learning of nonlinear functions in NN
PDP (Rumelhart et al, 1986). Popularization of backprop & multilayer NN,
and of ”connectionism”.
early 1990s: a peak of NN research.
2006: renaissance of modern DL
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 48 / 50
6.6 Historical Notes
Why did NN performance improve in 1986-2015?
Two main factors are:
larger datasets
larger networks with powerful computers and better software
Small number of algorithmic change have also improved NN.
loss function: replacement of MSE with cross-entropy family
hidden units: replacement of sigmoid with piecewise linear function (e.g.
ReLU)
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 49 / 50
6.6 Historical Notes
Even after 2006, feedforward network continued to have a bad reputation.
It was widely believed that feedforward networks would not perform well
unless they were assisted by other models
Since 2012, feed forward network with gradient-based learning has been viewed as
a powerful technology. They continue to have unfulfilled potential.
Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 50 / 50

More Related Content

What's hot

単語の分散表現と構成性の計算モデルの発展
単語の分散表現と構成性の計算モデルの発展単語の分散表現と構成性の計算モデルの発展
単語の分散表現と構成性の計算モデルの発展
Naoaki Okazaki
 
Random Features Strengthen Graph Neural Networks
Random Features Strengthen Graph Neural NetworksRandom Features Strengthen Graph Neural Networks
Random Features Strengthen Graph Neural Networks
joisino
 
機械学習のためのベイズ最適化入門
機械学習のためのベイズ最適化入門機械学習のためのベイズ最適化入門
機械学習のためのベイズ最適化入門
hoxo_m
 
【材料力学】自重を受ける棒の伸び
【材料力学】自重を受ける棒の伸び【材料力学】自重を受ける棒の伸び
【材料力学】自重を受ける棒の伸び
Kazuhiro Suga
 
[DL輪読会]陰関数微分を用いた深層学習
[DL輪読会]陰関数微分を用いた深層学習[DL輪読会]陰関数微分を用いた深層学習
[DL輪読会]陰関数微分を用いた深層学習
Deep Learning JP
 
Chapter11.2
Chapter11.2Chapter11.2
Chapter11.2
Takuya Minagawa
 
(2021.3) 不均一系触媒研究のための機械学習と最適実験計画
(2021.3) 不均一系触媒研究のための機械学習と最適実験計画(2021.3) 不均一系触媒研究のための機械学習と最適実験計画
(2021.3) 不均一系触媒研究のための機械学習と最適実験計画
Ichigaku Takigawa
 
RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)
RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)
RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)
Takuma Yagi
 
NLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit DistanceNLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit Distance
Hemantha Kulathilake
 
Discovery of Linear Acyclic Models Using Independent Component Analysis
Discovery of Linear Acyclic Models Using Independent Component AnalysisDiscovery of Linear Acyclic Models Using Independent Component Analysis
Discovery of Linear Acyclic Models Using Independent Component Analysis
Shiga University, RIKEN
 
協力ゲーム理論でXAI (説明可能なAI) を目指すSHAP (Shapley Additive exPlanation)
協力ゲーム理論でXAI (説明可能なAI) を目指すSHAP (Shapley Additive exPlanation)協力ゲーム理論でXAI (説明可能なAI) を目指すSHAP (Shapley Additive exPlanation)
協力ゲーム理論でXAI (説明可能なAI) を目指すSHAP (Shapley Additive exPlanation)
西岡 賢一郎
 
【DL輪読会】Scaling laws for single-agent reinforcement learning
【DL輪読会】Scaling laws for single-agent reinforcement learning【DL輪読会】Scaling laws for single-agent reinforcement learning
【DL輪読会】Scaling laws for single-agent reinforcement learning
Deep Learning JP
 
GCNによる取引関係グラフからの企業の特徴量抽出
GCNによる取引関係グラフからの企業の特徴量抽出GCNによる取引関係グラフからの企業の特徴量抽出
GCNによる取引関係グラフからの企業の特徴量抽出
Masakazu Mori
 
乱択データ構造の最新事情 -MinHash と HyperLogLog の最近の進歩-
乱択データ構造の最新事情 -MinHash と HyperLogLog の最近の進歩-乱択データ構造の最新事情 -MinHash と HyperLogLog の最近の進歩-
乱択データ構造の最新事情 -MinHash と HyperLogLog の最近の進歩-
Takuya Akiba
 
アンサンブル学習
アンサンブル学習アンサンブル学習
アンサンブル学習
Hidekazu Tanaka
 
[DL輪読会]Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question...
[DL輪読会]Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question...[DL輪読会]Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question...
[DL輪読会]Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question...
Deep Learning JP
 
What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?
Kazuki Yoshida
 
Conditional Random Fields
Conditional Random FieldsConditional Random Fields
Conditional Random Fields
lswing
 
相関係数は傾きに影響される
相関係数は傾きに影響される相関係数は傾きに影響される
相関係数は傾きに影響される
Mitsuo Shimohata
 
[DL Hacks] Learning Transferable Features with Deep Adaptation Networks
[DL Hacks] Learning Transferable Features with Deep Adaptation Networks[DL Hacks] Learning Transferable Features with Deep Adaptation Networks
[DL Hacks] Learning Transferable Features with Deep Adaptation Networks
Yusuke Iwasawa
 

What's hot (20)

単語の分散表現と構成性の計算モデルの発展
単語の分散表現と構成性の計算モデルの発展単語の分散表現と構成性の計算モデルの発展
単語の分散表現と構成性の計算モデルの発展
 
Random Features Strengthen Graph Neural Networks
Random Features Strengthen Graph Neural NetworksRandom Features Strengthen Graph Neural Networks
Random Features Strengthen Graph Neural Networks
 
機械学習のためのベイズ最適化入門
機械学習のためのベイズ最適化入門機械学習のためのベイズ最適化入門
機械学習のためのベイズ最適化入門
 
【材料力学】自重を受ける棒の伸び
【材料力学】自重を受ける棒の伸び【材料力学】自重を受ける棒の伸び
【材料力学】自重を受ける棒の伸び
 
[DL輪読会]陰関数微分を用いた深層学習
[DL輪読会]陰関数微分を用いた深層学習[DL輪読会]陰関数微分を用いた深層学習
[DL輪読会]陰関数微分を用いた深層学習
 
Chapter11.2
Chapter11.2Chapter11.2
Chapter11.2
 
(2021.3) 不均一系触媒研究のための機械学習と最適実験計画
(2021.3) 不均一系触媒研究のための機械学習と最適実験計画(2021.3) 不均一系触媒研究のための機械学習と最適実験計画
(2021.3) 不均一系触媒研究のための機械学習と最適実験計画
 
RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)
RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)
RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)
 
NLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit DistanceNLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit Distance
 
Discovery of Linear Acyclic Models Using Independent Component Analysis
Discovery of Linear Acyclic Models Using Independent Component AnalysisDiscovery of Linear Acyclic Models Using Independent Component Analysis
Discovery of Linear Acyclic Models Using Independent Component Analysis
 
協力ゲーム理論でXAI (説明可能なAI) を目指すSHAP (Shapley Additive exPlanation)
協力ゲーム理論でXAI (説明可能なAI) を目指すSHAP (Shapley Additive exPlanation)協力ゲーム理論でXAI (説明可能なAI) を目指すSHAP (Shapley Additive exPlanation)
協力ゲーム理論でXAI (説明可能なAI) を目指すSHAP (Shapley Additive exPlanation)
 
【DL輪読会】Scaling laws for single-agent reinforcement learning
【DL輪読会】Scaling laws for single-agent reinforcement learning【DL輪読会】Scaling laws for single-agent reinforcement learning
【DL輪読会】Scaling laws for single-agent reinforcement learning
 
GCNによる取引関係グラフからの企業の特徴量抽出
GCNによる取引関係グラフからの企業の特徴量抽出GCNによる取引関係グラフからの企業の特徴量抽出
GCNによる取引関係グラフからの企業の特徴量抽出
 
乱択データ構造の最新事情 -MinHash と HyperLogLog の最近の進歩-
乱択データ構造の最新事情 -MinHash と HyperLogLog の最近の進歩-乱択データ構造の最新事情 -MinHash と HyperLogLog の最近の進歩-
乱択データ構造の最新事情 -MinHash と HyperLogLog の最近の進歩-
 
アンサンブル学習
アンサンブル学習アンサンブル学習
アンサンブル学習
 
[DL輪読会]Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question...
[DL輪読会]Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question...[DL輪読会]Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question...
[DL輪読会]Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question...
 
What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?
 
Conditional Random Fields
Conditional Random FieldsConditional Random Fields
Conditional Random Fields
 
相関係数は傾きに影響される
相関係数は傾きに影響される相関係数は傾きに影響される
相関係数は傾きに影響される
 
[DL Hacks] Learning Transferable Features with Deep Adaptation Networks
[DL Hacks] Learning Transferable Features with Deep Adaptation Networks[DL Hacks] Learning Transferable Features with Deep Adaptation Networks
[DL Hacks] Learning Transferable Features with Deep Adaptation Networks
 

Similar to Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6

Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
Devashish Patel
 
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Ono Shigeru
 
Neural Networks - How do they work?
Neural Networks - How do they work?Neural Networks - How do they work?
Neural Networks - How do they work?
Accubits Technologies
 
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof..."Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
Paris Women in Machine Learning and Data Science
 
UofT_ML_lecture.pptx
UofT_ML_lecture.pptxUofT_ML_lecture.pptx
UofT_ML_lecture.pptx
abcdefghijklmn19
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural Networks
Stratio
 
INDUCTIVE LEARNING OF COMPLEX FUZZY RELATION
INDUCTIVE LEARNING OF COMPLEX FUZZY RELATIONINDUCTIVE LEARNING OF COMPLEX FUZZY RELATION
INDUCTIVE LEARNING OF COMPLEX FUZZY RELATION
ijcseit
 
INDUCTIVE LEARNING OF COMPLEX FUZZY RELATION
INDUCTIVE LEARNING OF COMPLEX FUZZY RELATIONINDUCTIVE LEARNING OF COMPLEX FUZZY RELATION
INDUCTIVE LEARNING OF COMPLEX FUZZY RELATION
ijcseit
 
lec10.pdf
lec10.pdflec10.pdf
lec10.pdf
ssuserc02bd5
 
03 image transformations_i
03 image transformations_i03 image transformations_i
03 image transformations_i
ankit_ppt
 
Lecture4 xing
Lecture4 xingLecture4 xing
Lecture4 xing
Tianlu Wang
 
Writing your own Neural Network.
Writing your own Neural Network.Writing your own Neural Network.
Writing your own Neural Network.
shafkatdu9212
 
6_1_course-notes-deep-nets-overview.pdf
6_1_course-notes-deep-nets-overview.pdf6_1_course-notes-deep-nets-overview.pdf
6_1_course-notes-deep-nets-overview.pdf
Javier Crisostomo
 
Clojure And Swing
Clojure And SwingClojure And Swing
Clojure And Swing
Skills Matter
 
Stack squeues lists
Stack squeues listsStack squeues lists
Stack squeues lists
James Wong
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
Harry Potter
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
Luis Goldster
 
Stacksqueueslists
StacksqueueslistsStacksqueueslists
Stacksqueueslists
Fraboni Ec
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
Young Alista
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
Tony Nguyen
 

Similar to Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6 (20)

Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
 
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
 
Neural Networks - How do they work?
Neural Networks - How do they work?Neural Networks - How do they work?
Neural Networks - How do they work?
 
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof..."Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
 
UofT_ML_lecture.pptx
UofT_ML_lecture.pptxUofT_ML_lecture.pptx
UofT_ML_lecture.pptx
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural Networks
 
INDUCTIVE LEARNING OF COMPLEX FUZZY RELATION
INDUCTIVE LEARNING OF COMPLEX FUZZY RELATIONINDUCTIVE LEARNING OF COMPLEX FUZZY RELATION
INDUCTIVE LEARNING OF COMPLEX FUZZY RELATION
 
INDUCTIVE LEARNING OF COMPLEX FUZZY RELATION
INDUCTIVE LEARNING OF COMPLEX FUZZY RELATIONINDUCTIVE LEARNING OF COMPLEX FUZZY RELATION
INDUCTIVE LEARNING OF COMPLEX FUZZY RELATION
 
lec10.pdf
lec10.pdflec10.pdf
lec10.pdf
 
03 image transformations_i
03 image transformations_i03 image transformations_i
03 image transformations_i
 
Lecture4 xing
Lecture4 xingLecture4 xing
Lecture4 xing
 
Writing your own Neural Network.
Writing your own Neural Network.Writing your own Neural Network.
Writing your own Neural Network.
 
6_1_course-notes-deep-nets-overview.pdf
6_1_course-notes-deep-nets-overview.pdf6_1_course-notes-deep-nets-overview.pdf
6_1_course-notes-deep-nets-overview.pdf
 
Clojure And Swing
Clojure And SwingClojure And Swing
Clojure And Swing
 
Stack squeues lists
Stack squeues listsStack squeues lists
Stack squeues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacksqueueslists
StacksqueueslistsStacksqueueslists
Stacksqueueslists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 

More from Ono Shigeru

Miller_Resnick_Zhackhauser_2005
Miller_Resnick_Zhackhauser_2005Miller_Resnick_Zhackhauser_2005
Miller_Resnick_Zhackhauser_2005
Ono Shigeru
 
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 9
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 9Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 9
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 9
Ono Shigeru
 
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 7
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 7Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 7
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 7
Ono Shigeru
 
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 5
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 5Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 5
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 5
Ono Shigeru
 
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 3
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 3Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 3
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 3
Ono Shigeru
 
Hong&Page(2012): Some Microfoundations of Collective Wisdom
Hong&Page(2012): Some Microfoundations of Collective WisdomHong&Page(2012): Some Microfoundations of Collective Wisdom
Hong&Page(2012): Some Microfoundations of Collective WisdomOno Shigeru
 

More from Ono Shigeru (6)

Miller_Resnick_Zhackhauser_2005
Miller_Resnick_Zhackhauser_2005Miller_Resnick_Zhackhauser_2005
Miller_Resnick_Zhackhauser_2005
 
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 9
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 9Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 9
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 9
 
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 7
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 7Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 7
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 7
 
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 5
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 5Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 5
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 5
 
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 3
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 3Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 3
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 3
 
Hong&Page(2012): Some Microfoundations of Collective Wisdom
Hong&Page(2012): Some Microfoundations of Collective WisdomHong&Page(2012): Some Microfoundations of Collective Wisdom
Hong&Page(2012): Some Microfoundations of Collective Wisdom
 

Recently uploaded

办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 

Recently uploaded (20)

办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 

Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6

  • 1. Deep Feedforward Networks Goodfellow, Bengio, & Courville (2016) Deep Learning, Chap 6. Shigeru ONO (Insight Factory) DL 読書会: 2020/07 Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 1 / 50
  • 2. TOC 1 6.1 Example: Learning XOR 2 6.2 Gradient-Based Learning 3 6.3 Hidden Units 4 6.4 Architecture Design 5 6.5 Back-Propagation and Other Differentiation Algorithms 6 6.6 Historical Notes Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 2 / 50
  • 3. (introduction) deep feedforward network aka: feedforward neural network, multilayer perceptrons (MLP) Purpose: to approximate some function f∗ no feedback connections Why is it called ”network”? represented by composing many different functions 1st layer, 2nd layer, ..., output layer Why is it called ”neural”? Hidden layers consist of vector-to-vector functions We can think of the layer as consisting of many units (vector-to-scalar functions) that act in parallel Each unit resembles a neuron Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 3 / 50
  • 4. (introduction) To extend linear models, we can apply the linear model to a nonlinearly transformed input ϕ(x). How to choose ϕ? use a generic ϕ. E.g. infinite-dimensional ϕ in kernel machines manually engineer ϕ. learn ϕ ... the strategy of DL Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 4 / 50
  • 5. 6.1 Example: Learning XOR Target function f∗ : XOR function Training set: X = {[ 0 0 ] , [ 0 1 ] , [ 1 0 ] , [ 1 1 ]} Model: f(x; θ) Loss function: MSE J(θ) = 1 4 ∑ x∈X(f∗ (x) − f(x; θ))2 ↓ Linear model cannot represent the XOR function. ↓ Learn a different feature space, where a linear model can represent the solution. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 5 / 50
  • 6. 6.1 Example: Learning XOR We introduce a simple feedforward network : f(x; W, c, w, b) = f(2) (f(1) (x; W, c); w, b) In most neural networks, an affine transformation controlled by learned parameters is used in f(1) : f(1) (x; W, c) = g(W⊤ x + c) g is typically an element-wise function. In modern neural networks, the default recommendation for g is the rectified linear unit (ReLU) : g(z) = max{0, z} Our complete network: f(x; W, c, w, b) = w⊤ max{0, W⊤ x + c} + b Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 6 / 50
  • 7. 6.1 Example: Learning XOR Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 7 / 50
  • 8. 6.1 Example: Learning XOR Let X be the design matrix X =     0 0 1 1 1 0 1 1     Set W = [ 1 1 1 1 ] , c = [ 0 −1 ] , w = [ 1 −2 ] Then the matrix of max{0, W⊤ x + c} is     0 0 1 0 1 0 2 1     With multiplying w we get the correct answers [0 1 1 0]⊤ . Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 8 / 50
  • 9. 6.2 Gradient-Based Learning The largest difference between the linear models and NN is that most interesting loss function for NN are nonconvex. NN are usually trained by gradient-based optimizer (rather than linear equation solvers or the convex optimizers). convergence is not guaranteed. sensitive to the initial parameters values. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 9 / 50
  • 10. 6.2.1 Cost Functions 1. learning conditional distribution Most modern NN models define p(y|x; θ) and simply use the ML principle. The cost function is the negative log likelihood (= the cross-entropy b/w training data and model distribution): J(θ) = −Ex,y∼ˆpdata log pmodel(y|x) Advantage: Specifying a model p(y|x) automatically determines a cost function log(y|x). Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 10 / 50
  • 11. 6.2.1 Cost Functions 2. learning conditional statistics In some cases we want to learn just one conditional statistics of y given x. We can view the cost function as being a functional (=mapping from functions to real numbers) rather than a function. For example, when we wish to predict the mean of y, we can design the cost functional to have its minimum lies on the function f(E(y|x); x). Solving an optimization problem with respect to a function requires calculus of variations(変分法). Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 11 / 50
  • 12. 6.2.1 Cost Functions (cont’d) Suppose the optimization problem is f∗ = arg min f Ex,y∼pdata ||y − f(x)||2 If this function lies within the class we optimize over, it yields f∗ (x) = Ey∼pdata(y|x)[y] i.e.: if we could train on infinitely many samples, minimizing MSE cost ||y − f(x)||2 would give a function that predict E[y|x] for each value of x. Similarly, minimizing MAE cost ||y − f(x)||1 would give a function that predict median(y) for each value of x. MSE & MAE often lead to poor results when used with gradient-based optimization. Cross-entropy cost is more popular, even when we do not need to estimate p(y|x). Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 12 / 50
  • 13. 6.2.2 Output Units 1. linear units output units based on an affine transformation w/o nonlinearity: ˆy = W⊤ h + b often used to produce the mean of a conditional Gaussian distribution. do not saturate. Suitable for gradient-based optimization. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 13 / 50
  • 14. 6.2.2 Output Units 2. sigmoid units ˆy = σ(w⊤ h + b) where σ(x) = 1 1+exp(−x) (logistic sigmoid function). For binary y, NN needs to predict P(y = 1|x). The ML approach is to define Bernoulli distribution conditioned on x. One possibility is to define P(y = 1|x) = max{0, min{1, w⊤ h + b}} but it has no gradient outside [0, 1]. Let ˜P(y) as unnormalized probability of y. Assume that log ˜P(y) = yz   (i.e. ˜P(y = 0|z) = 0, ˜P(y = 1|z) = z). Then we can derive Bernoulli distribution: P(y) = yz exp(0) + exp(z) = σ((2y − 1)z) Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 14 / 50
  • 15. 6.2.2 Output Units (cont’d) The loss function for ML is J(θ) = − log σ((2y − 1)z) = ζ((1 − 2y)z) where ζ(x) = log(1 + exp(x)) (softplus function). It saturates only when (1 − 2y)z is very negative (i.e. the model already has the right answer). Other loss functions (e.g. MSE) can saturate anytime σ(z) saturates. So ML is preferred. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 15 / 50
  • 16. 6.2.2 Output Units 3. softmax units Now we with to generalize the sigmoid function to the case of a discrete variables with n values. We need to produce a vector ˆy with ˆyi = P(y = i|x). Assume we can predict nonnormalized log probability vector z as zi = log ˜P(y = 1|x). We can obtain the desired ˆy as ˆyi = softmax(z)i = exp(zi) ∑n j=1 exp(zj) In ML approach we with to maximize the log-likelihood log P(y = i|z) = log softmax(z), where log softmax(z)i = zi − log ∑ j exp(zj) Many objective functions other than the log-likelihood do not work as well with the softmax function. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 16 / 50
  • 17. 6.2.2 Output Units (cont’d) z can be produced as z = W⊤ h + b, but it actually overparameterizes the distribution. Or we can impose a requirement that one of zi be fixed. In practice it rarely makes differences. Origin of the name: ”soft” means it is continuous and differentiable. It would perhaps be better to call ”softargmax”. 4. Other output types ...skipped ... Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 17 / 50
  • 18. 6.3 Hidden Units ReLU is an excellent default choice We can disregard whether the activation functions is differentiable at all input point or not most hidden units (1) accept a vector x, (2) compute an affine transformation z = W⊤ x + b, and (3) apply an element-wise nonlinear function g(z). They are distinguished only by the choice of g(z). Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 18 / 50
  • 19. 6.3.1 ReLU and their generalizations ReLU (Rectified linear units): the activation function is g(z) = max{0, z} typically used after an affine transformation: h = g(W⊤ x + b) when initializing, set all elements of b to a small positive value (e.g. 0.1) Drawback: ReLU cannot learn on examples for which their activation is zero Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 19 / 50
  • 20. 6.3.1 ReLU and their generalizations Generalization of ReLU: hi = g(z, α)i = max(0, zi) + αi min(0, zi) Absolute value rectification: αi = −1; g(z) = |z|. Used for object recognition. leaky ReLU: fixes αi to a small value like 0.01 parametric ReLU: treats αi as a learnable parameter maxout units: g(z)i = maxj∈G(i) zj where G(i) is a group of k elements in z. maxout units can learn a piecewise linear, convex function each unit is parameterized by k weight vectors each unit is driven by multiple filters. It resists ”catastrophic forgetting” (forgetting of how to perform task) Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 20 / 50
  • 21. 6.3.2 Logistic Sigmoid and Hyperbolic Tangent logistic sigmoid activation function: g(z) = σ(z) mainly used prior to the introduction of ReLU saturate across most of the domain. Gradient-Based learning is difficult now only used for output units ()or other setting than feed-forward network) hyperbolic tangent activation function: g(z) = tanh(z) = 2σ(2z) − 1 mainly used prior to the introduction of ReLU performs better than the logistic sigmoid Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 21 / 50
  • 22. 6.3.3 Other Hidden Units A wide variety of differentiable functions perform well. identity function. (It is acceptable for some layers to be purely linear) softmax function radial basis function hi = exp(− 1 σ2 i ||W:,i − x||2 ) softplus function. ()generally discouraged) hard tanh g(a) = max(−1, min(1, a)) Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 22 / 50
  • 23. 6.4 Architecture Design architecture: the overall structure of network. Most NN are organized into layers. 1st layer: h(1) = g(1) (W(1)⊤ x + b(1) ) 2nd layer: h(2) = g(2) (W(2)⊤ h(1) + b(2) ) ... Main considerations: Depth: number of layers Width: number of units in each layer Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 23 / 50
  • 24. 6.4.1 Universal Approximation Properties and Depth Universal approximation theorem: if a feedforward network has (1) a linear output layer, (2) at least one hidden layer with any ”squashing” activation function, and (3) has enough hidden units, then it can approximate any Borel measurable function from one finite-dimensional space to another, with any desired nonzero amount of error ”spuashing” function ... e.g. logistic sigmoid Borel measurable function ... including any continuous function on a closed and bounded subset of Rn In other words: a large feedforward network will be able to represent any function we are trying to learn Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 24 / 50
  • 25. 6.4.1 Universal Approximation Properties and Depth But... ’represent’ ̸= ’learn’. MLP may fail to find parameters or choose the wrong functions. Even if one hidden layer is enough, the layer may be infeasibly large. Towards deeper models: In many circumstances, using deeper models can reduce the number of units. Statistical reason: Choosing a deep model means that we believe the learning problem consists of discovering a set of underlying factors, which can be described in terms of simpler underlying factors Some experiments suggests deep architectures express a useful prior Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 25 / 50
  • 26. 6.4.3 Other Architectural Considerations ...skipped... Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 26 / 50
  • 27. 6.5 Back-Propagation and Other Differentiation Algorithms forward propagation (順伝播): The input x provides the initial information It propagates up to the hidden layers and finally produce ˆy Forward propagation can continue until it produce a scalar cost J(θ) back-propagation algorithm (backprop, 誤差逆伝播法): The cost J(θ) provides the initial information It flows backward through the network in order to compute the gradients a simpler procedure than evaluating the gradient analytically It is not the whole learning algorithm, but the method for computing the gradient. Another algorithm (e.g. stochastic gradient descent) is used to perform learning. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 27 / 50
  • 28. 6.5.1 Computational Graphs Graph expression of computation node: a variable edge from x to y: an operation to a variable x which computes y Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 28 / 50
  • 29. 6.5.1 Computational Graphs Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 29 / 50
  • 30. 6.5.2 Chain Rule of Calculus Let x be a real number. Suppose that y = g(x), z = f(y). Then dz dx = dz dy dy dx Suppose that x ∈ Rm , y ∈ Rn , y = g(x), z = f(y). Then ∂z ∂xi = ∑ j ∂z ∂yj ∂yj ∂xi In vector notation, ∇xz = ( ∂y ∂x )⊤ ∇yz where ∂y ∂x is n × m Jacobian matrix of g. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 30 / 50
  • 31. 6.5.3 Recursively applying the Chain Rule to Obtain Backprop Setting: Consider a computational graph describing how to compute a single scalar u(n) (e.g. the loss of a training example) We want to obtain u(n)’s gradient with respect to the ni input nodes u(1) , . . . , u(ni) The nodes of the graph have been ordered in such a way that we can compute their output one after the other, starting u(ni+1) and going up to u(n) . u(i) = f(A(i) ) where A(i) is the set of all parent nodes of u(i) Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 31 / 50
  • 32. 6.5.3 Recursively applying the Chain Rule to Obtain Backprop algorithm for the forward propagation computation: Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 32 / 50
  • 33. 6.5.3 Recursively applying the Chain Rule to Obtain Backprop algorithm for the backprop that specifies the actual gradient computation: Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 33 / 50
  • 34. 6.5.3 Recursively applying the Chain Rule to Obtain Backprop algorithm for the backprop that specifies the actual gradient computation: Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 34 / 50
  • 35. 6.5.4 Backprop Computation in Fully Connected MLP Point: apply the chain rule in order to get derivative ∂u(n) ∂u(j) : ∂u(n) ∂u(j) = ∑ i∈(children of u(j)) ∂u(n) ∂u(i) ∂u(i) ∂u(j) Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 35 / 50
  • 36. 6.5.4 Backprop Computation in Fully Connected MLP Consider a computational graph of a fully-connected multi layer MLP. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 36 / 50
  • 37. 6.5.4 Backprop Computation in Fully Connected MLP Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 37 / 50
  • 38. 6.5.5 Symbol-to-Symbol Derivatives symbol-to-number differentiation take a computational graph and a set of numerical input values return a set of gradient values at those input values used by Torch and Caffe symbol-to-symbol derivatives approach take a computational graph add additional nodes of symbolic descriptions of the desired derivatives used by Theano and TensorFlow Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 38 / 50
  • 39. 6.5.5 Symbol-to-Symbol Derivatives Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 39 / 50
  • 40. 6.5.6 General Back-Propagation To compute the gradient of z with respect to its ancestors x, the gradient of z with respect to z: dz dz = 1. It is the current gradient. the gradient of z with respect to its parent: (the current gradient) x (Jacobian of the operation that produced z) the gradient of z with respect to its grandparent: (the current gradient) x (Jacobian of the operation that produced the parent) ... if we reach a node through multiple paths, simply sum the gradients Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 40 / 50
  • 41. 6.5.6 General Back-Propagation More formally... Assume the subroutines below: get_operation(V): returns the operation that compute V (the edges into V) get_consumers(V, g): returns the list of V ’s children in the graph g get_inputs(V, g): returns the list of V ’s parent in the graph g Each operation op has methods below: op.f(inputs): implementation of operation op.bprop(inputs, X, G): implementation of the chain rule. X: the input whose gradient we with to compute. G: the gradient on the output of the operation . return ∑ i(∇Xop.f(inputs)i)Gi. E.g. We use a operation (op) C = AB. G is the gradient of a scalar z with respect to C. You can call op.bprop((A, B), A, G) to get the gradient with respect to A, which is given by GB⊤ . Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 41 / 50
  • 42. 6.5.6 General Back-Propagation Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 42 / 50
  • 43. 6.5.6 General Back-Propagation Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 43 / 50
  • 44. 6.5.7 Example: Back-Propagation for MLP Training ...skipped... Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 44 / 50
  • 45. 6.5.8 Complications Actual implementation of backprop has to be more complex... Most implementation need to support operations that can return more than one tensor. how to control memory consumption handling of various data type (32bit fp, 64bit fp, int, ...) tracking undefined gradient Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 45 / 50
  • 46. 6.5.9 Differentiation outside the DL Community ...skipped... Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 46 / 50
  • 47. 6.5.10 Higher Order Derivatives ...skipped... Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 47 / 50
  • 48. 6.6 Historical Notes 17c: the chain rule 19c: the gradient descent technique 1940s: machine learning models (e.g. perceptron) based on linear models. Critics (e.g. Minsky) pointed out the flows of the linear model, which led to a backlash against the entire NN approach. 1960s-70s: efficient applications of the chain rule 1980s: applying the chain rule for learning of nonlinear functions in NN PDP (Rumelhart et al, 1986). Popularization of backprop & multilayer NN, and of ”connectionism”. early 1990s: a peak of NN research. 2006: renaissance of modern DL Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 48 / 50
  • 49. 6.6 Historical Notes Why did NN performance improve in 1986-2015? Two main factors are: larger datasets larger networks with powerful computers and better software Small number of algorithmic change have also improved NN. loss function: replacement of MSE with cross-entropy family hidden units: replacement of sigmoid with piecewise linear function (e.g. ReLU) Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 49 / 50
  • 50. 6.6 Historical Notes Even after 2006, feedforward network continued to have a bad reputation. It was widely believed that feedforward networks would not perform well unless they were assisted by other models Since 2012, feed forward network with gradient-based learning has been viewed as a powerful technology. They continue to have unfulfilled potential. Shigeru ONO (Insight Factory) DL Chap.6 DL 読書会: 2020/07 50 / 50