SlideShare a Scribd company logo
Regularization for Deep Learning
Goodfellow, Bengio, & Courville (2016) Deep Learning, Chap 7.
Shigeru ONO (Insight Factory)
DL 読書会: 2020/08
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 1 / 40
TOC
1 7.1 Parameter Norm Penalties
2 7.2 Norm Penalties as Constrained Optimization
3 7.3 Regularization and Under-Constrained Problems
4 7.4 Dataset Augmentation
5 7.5 Noise Robustness
6 7.6 Semi-Supervised Learning
7 7.7 Multitask Learning
8 7.8 Early Stopping
9 7.9 Parameter Tying and Parameter Sharing
10 7.10 Sparse Representation
11 7.11 Bagging and Other Ensemble Methods
12 7.12 Dropout
13 7.13 Adversarial Training
14 7.14 Tangent Distance, Tangent Prop and Manifold Tangent Classifier
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 2 / 40
(introduction)
Regularization:
any modification we make to a learning algorithm that is intended to reduce
its generalization error
possibly at the expense of increasing training error
In the context of DL, most regularization strategies are based on regularizing
estimators
Possible situations (See Chap.5) :
(1) the model family excluded the true DGP (underfitting)
(2) the model family matched the true DGP
(3) the model family included the true DGP but also many other possible DGP
(overfitting)
The goal of regularization is to take the model from (3) into (2). But...
In most applications of DL, the true DGP is outside the model family (=(1)).
Controlling the complexity of the model is not to find the model of the right
size, but to find the model with appropriate regularization in which
generalization error is minimized.
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 3 / 40
7.1 Parameter Norm Penalties
Adding a parameter norm penalty Ω(θ) to the objective function J.
˜J(θ; X, y) = J(θ; X, y) + αΩ(θ)
α(≥ 0): weight of the relative contribution of Ω.
For NN, we typically choose Ω that penalizes only w (the weights of the
affine transformation at each layer).
It is reasonable to use the same α at all layers.
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 4 / 40
7.1.1 L2
Parameter Regularization
Ω(θ) =
1
2
||w||2
aka. weight decay, ridge regression, Tikhonov regularization.
Bayesian interpretation: MAP inference with a Gaussian prior on the weights.
(See 5.6.1)
Total objective function:
˜J(w; X, y) = J(w; X, y) +
α
2
w⊤
w
Parameter gradient:
∇w
˜J(w; X, y) = αw + ∇wJ(w; X, y)
What happens in a single gradient step? ... The learning rule is modified to
shrink w by a constant factor.
w ← (1 − ϵα)w − ϵ∇wJ(w; X, y)
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 5 / 40
7.1.1 L2
Parameter Regularization
What happens over the entire course of training? (in general)
Unregularized:
Let w∗
be the weights which minimize the unregularized objective function:
w∗
= arg min
w
J(w).
Make a quadratic approximation to the J(w) in the neighborhood of w∗
:
ˆJ(θ) = J(w∗
) +
1
2
(w − w∗
)⊤
H(w − w∗
)
where H is the Hessian matrix of J with respect to w evaluated at w∗
.
The minimum of ˆJ occurs where its gradient ∇w
ˆJ(w) = H(w − w∗
) is 0.
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 6 / 40
7.1.1 L2
Parameter Regularization
(Cont’d)
Regularized:
Let ˜w be the weights with minimize the regularized objective function ˜J.
The minimum of ˜J occurs where α˜w + H(w − w∗
) = 0.
It follows that ˜w = (H + αI)−1
Hw∗
H is real and symmetric. We can have a eigenvalue decomposition
H = QΛQ⊤
.
˜w = Q(Λ + αI)−1
ΛQ⊤
w∗
. i.e. The weight decay rescales w∗
along the axes
defined by the eigenvector of H.
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 7 / 40
7.1.1 L2
Parameter Regularization
What happens over the entire course of training? (in the case of linear regression)
Unregularized:
Cost function: (Xw − y)⊤
(Xw − y)
Solution: w = (X⊤
X)−1
X⊤
y
Regularized:
Cost function: (Xw − y)⊤
(Xw − y) + 1
2 αw⊤
w
Solution: w = (X⊤
X + αI)−1
X⊤
y.
i.e. The regularization cause the algorithm to ”perceive” that X has higher
variance (than the variance it really has).
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 8 / 40
7.1.1 L2
Parameter Regularization
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 9 / 40
7.1.2 L1
Regularization
Ω(θ) = ||w||1 =
∑
i
|wi|
Total objective function :
˜J(w; X, y) = J(w; X, y) + α||w||1
Parameter gradient:
∇w
˜J(w; X, y) = αsign(w) + ∇wJ(w; X, y)
It does not admit clean algebraic solution.
For simple linear model with a quadratic cost function,
∇w
ˆJ(w) = H(w − w∗
)
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 10 / 40
7.1.2 L1
Regularization
(Cont’d)
Assume the Hessian is diagonal, H = diag([H11, . . . , Hnn]) (i.e. no correlation
between the input features)
Then we have a quadratic approximation of the cost function:
ˆJ(w; X, y) = J(w∗
; X, y) +
∑
i
(
1
2
Hii(wi − w∗
i )2
+ α|wi|
)
The solution is:
wi = sign(w∗
i ) max
(
|w∗
i | −
α
Hii
, 0
)
Consider the situation where w∗
i > 0 for all i. Then
When w∗
i ≤ α
Hii
, the optimal value is wi = 0.
When w∗
i > α
Hii
, the optimal value is just shifted by a distance α
Hii
.
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 11 / 40
7.1.2 L1
Regularization
(Cont’d)
In short, the solution is more sparse (i.e. some parameter have an optimal
value of zero).
It has been used as a feature selection mechanism. E.g. LASSO
Bayesian interpretation: MAP inference with a isotropic Laplace prior on the
weights.
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 12 / 40
7.2 Norm Penalties as Constrained Optimization
We can think of the penalties as constraints.
Cost function:
˜J(θ; X, y) = J(θ; X, y) + αΩ(θ)
If we wanted to constrain as Ω(θ) < k, we could construct a generalized
Lagrangian
L(θ, α; X, y) = J(θ; X, y) + α(Ω(θ) − k)
The solution is
θ∗
= arg min
θ
max
α,α≥0
L(θ, α)
we can fix α as its optimal value α∗
:
θ∗
= arg min
θ
L(θ, α∗
) = arg min
θ
J(θ; X, y) + α∗
Ω(θ)
This is same as the problem of minimizing ˜J.
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 13 / 40
7.2 Norm Penalties as Constrained Optimization
Sometimes we may with to use explicit constraints rather than penalties:
when we know the appropriate value of k
when the penalties can cause optimization to get stuck in local minima
corresponding to small θ.
when we with to impose some stability on the optimization procedure
Approach:
Srebro & Shraibman (2005): constraining the norm of each column of the
weight matrix of a layer
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 14 / 40
7.3 Regularization and Under-Constrained Problems
Sometimes regularization is necessary for ML problems to be properly defined.
when the problem depends on (X⊤
X)−1
but X⊤
X is singular.
when the problem has no closed form solution. E.g., logistic regression
applied to a problem where the class are linear separable. If weight w can
achieve perfect classification, 2w will also achieve with higher likelihood.
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 15 / 40
7.4 Dataset Augmentation
Idea: Create fake data and add it to the training set.
an effective technique particularly for object recognition. E.g. translating the
training images a few pixels in each direction.
Injecting noise in the input to a NN. It can improve the robustness of NNs.
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 16 / 40
7.5 Noise Robustness
Idea: Add noise to the weights.
It can be interpreted as a stochastic implementation of Bayesian inference
over the weight.
Noise reflect our uncertainty on the model weights.
It can also be interpreted as equivalent to a more traditional form of
regularization.
Consider we wish to train a function ˆy(x) using the least-square cost function
J = Ep(x,y)[(ˆy(x) − y)2
]
Assume that we also include a random perturbation ϵw ∼ N(ϵ; 0, ηI) of the
network weights.
The objective function becomes ˜JW = Ep(x,y,ϵW)[(ˆyeW (x) − y)2
]
For small η, it is equivalent to J with a regularization term
ηEp(x,y)[||∇Wˆy(x)||2
].
It push the model into regions where the model is relatively insensitive to small
variations in the weights, finding points that are not merely minima, but
minima surrounded by flat regions.
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 17 / 40
7.5.1 Injecting Noise at the Output Targets
Idea: Explicitly model the noise on the y labels.
label smoothing: regularize a model based on a softmax with k output values
by replacing classification target
0 with ϵ/(k − 1)
1 with 1 − ϵ
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 18 / 40
7.6 Semi-Supervised Learning
Idea: Use both unlabeled example (from P(x)) and labeled example (from P(x, y))
in order to estimate P(y|x)
In the context of DL, semi-supervised learning usually refers to learning a
representation h = f(x).
The goal is to learn a representation so that examples from the same class
have similar representations.
Construct models in which a generative model of either P(x) or P(x, y) shares
parameters with a discriminative model of P(y|x)
One can find a better trade-off of two types of criterion:
The supervised criterion: − log P(y|x)
The unsupervised (generative) criterion: − log P(x) or − log P(x, y)
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 19 / 40
7.7 Multitask Learning
Idea: Pool the examples arising out of several tasks
The model can be divided into two parts:
Task-specific parameters
Generic parameters, shared across the tasks
It can improve generalization and generalization error bounds
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 20 / 40
7.7 Multitask Learning
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 21 / 40
7.8 Early Stopping
Idea : Obtain a model with the parameters at the point in time with the lowest
validation set error (rather than with the latest parameters in the training process)
the most commonly use form of regularization in DL
can be interpreted as a hyperparameter (the number of training steps)
selection algorithm
requires a validation set, which is not fed to the model
One can perform extra training (where all training data is used) after initial
learning (with early stopping). Two basic strategies:
Initialize the model again and retrain on all the data (for the same number of
steps as the first round)
Keep the parameter and continue training (but now using all the data). It is
not as well behaved.
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 22 / 40
7.8 Early Stopping
How early stopping acts as regularizer:
Restricting both the number of iterations and the learning rate limit the
volume of parameter space reachable from the initial parameter value.
In a simple linear model with a quadratic error function and simple gradient
decent, early stopping is equivalent to L2
regularization. [...skipped...]
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 23 / 40
7.8 Early Stopping
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 24 / 40
7.9 Parameter Tying and Parameter Sharing
Sometimes we may know there should be some dependencies between the
parameters.
Parameter Tying:
E.g. two models performing the same classification task but with different
input distributions:
ˆy(A)
= f(w(A))
, x), ˆy(B)
= f(w(B))
, x)
We believe the model parameters should be close to each other
We can use a penalty Ω(w(A)
, w(B)
) = ||w(A)
− w(B)
||2
Parameter Sharing:
force sets of parameters to be equal
can lead to significant reduction of memory
The most popular use: convolutional neural network (CNNs) (See Chap.9)
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 25 / 40
7.10 Sparse Representation
Idea: place a penalty on the activation of the unit (rather than on the parameters)
Norm penalty regularization of representation:
h: sparse representation of the data x
add a norm penalty on the representation Ω(h) to the loss function J:
˜J(θ; X, y) = J(θ; X, y) + αΩ(h)
We can use L1
penalty Ω(h) = ||h||1 or other types of penalties
Orthogonal matching pursuit (OMP-k):
encodes x with h that solves the constrained optimization problem
arg min
h,||h||0<k
||x − Wh||2
where ||h||0 is the number of nonzero entries of h
OMP-1 can be a very effective feature extractor for DL
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 26 / 40
7.11 Bagging and Other Ensemble Methods
Ensemble methods
combine several models (trained separately) in order to reduce generalization
error
an example of a general strategy called model averaging
On average, the ensemble will perform at least as well as any of its members.
If the members make independent errors, the ensemble will perform
significantly better.
Bagging (bootstrap aggregating)
construct k different datasets of same size by sampling with replacement
from the original dataset
Model i is trained on data set i
Boosting:
construct an ensemble with higher capacity then individual models
Boosting of NN: incrementally add NN to the ensemble
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 27 / 40
7.12 Dropout
Background:
Bagging involves training multiple models and evaluating them on each test
example.
This seems impractical when each model is a large NN.
Dropout can be thought of as a method of making bagging practical.
What is Dropout?
make all subnetworks that can be formed by removing nonoutput units from
an base network
In many cases, we can remove a unit by multiplying its output value by zero.
Let µ be a vector of binary mask, which is applied to all the input and hidden
units.
train them with a minibatch-based algorithm
Each time we load an example into a minibatch, we randomly sample µ and
apply it.
Typically, an input unit is included with probability 0.8, and a hidden unit is
included with 0.5.
Run forward propagation, back-propagation, and the learning update.
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 28 / 40
7.12 Dropout
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 29 / 40
7.12 Dropout
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 30 / 40
7.12 Dropout
How to make a prediction:
At training time, µ is sampled from the probability distribution p(µ)
Each submodel defined by µ defines a probability distribution p(y|x, µ)
To make a prediction from all submodels, we can use arithmetic mean:∑
µ p(µ)p(y|x, µ)
But geometric mean performs better. Let ˜pensemble(y|x) be the geometric
mean of p(y|x, µ).
˜p(y|x) is not guaranteed to be a probability distribution. We must
renormalize:
pensemble(y|x) =
˜pensemble(y|x)
∑
y′ ˜pensemble(y′|x)
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 31 / 40
7.12 Dropout
Weight scaling inference rule:
We can approximate pensemble by evaluating p(y|x) in one model
This model uses all units, but with the weights going out of unit i multiplied
by the probability of including unit i
if an inclusion probability of a unit is 1/2, the weight of the unit is multiplied
by 1/2 at the end of training, or the states of the unit is multiplied by 2 during
training
There is not yet any theoretical argument for this rule, but empirically it
performs well
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 32 / 40
7.12 Dropout
Advantages of dropout:
very computationally cheap
it does not significantly limit the type of model or training procedure
Limitations:
it reduces the effective capacity of a model. To offset this effect, we must
increase the size of the model.
it is less effective when extremely few labeled training examples are available.
When additional unlabeled data is available, unsupervised feature learning
can gain an advantage over dropout.
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 33 / 40
7.12 Dropout
fast dropout:
analytical approximations to the sum over all submodels
more principled approach than the weight scaling inference rule.
Interpretation of dropout:
an experiments using ”dropout boosting”
use exactly the same mask noise as dropout
trains the entire ensemble to jointly (not independently) maximize the
log-likelihood on the training set
shows almost no regularization effect
This demonstrates that dropout is a type of bagging. Dropout in itself have
no robustness to noise.
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 34 / 40
7.12 Dropout
Other approaches inspired by dropout:
DropConnect: each product between a single scalar weight and a single
hidden unit state is considered a unit that can be dropped
Stochastic pooling: build ensembles of CNNs
real valued mask: multiplying the weights by µ ∼ N(1, I) can outperform
dropout
Another view of dropout:
Dropout regularize each hidden unit to be not merely a good feature but a
feature that is good in many context.
Masking can be seen as a form of highly intelligent, adaptive destruction of
the information content of the input (rather than destruction of the raw
input). It allows the model to make use of all the knowledge about the input
distribution which has acquired so far.
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 35 / 40
7.13 Adversarial Training
Adversarial example:
an input x′
near a data point x such that the model output is very different
at x′
In many cases, human observer cannot tell the difference between x and x′
One of the causes of these examples is excessive linearity in NN. The value of
a linear function can change very rapidly if it has numerous inputs.
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 36 / 40
7.13 Adversarial Training
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 37 / 40
7.13 Adversarial Training
Adversarial Training:
training on adversarially perturbed examples from the training set
a way of explicitly introducing a local constancy prior into NN
Virtual adversarial example:
Suppose the model assigns some label ˆy at a point x which has no true label.
We can seek an adversarial example x′
that causes the model to output a
label y′
(̸= ˆy)
We can train the model to assign the same label to x and x′
This encourages the model to learn a function which is robust to small change
This provide a means of semi-supervised learning
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 38 / 40
7.14 Tangent Distance, Tangent Prop and Manifold
Tangent Classifier
Manifold hypothesis:
the data lies near a low-dimensional manifold
Tangent distance algorithm
non-parametric nearest neighbor algorithm, where the distance between
points x1 and x2 is the distance between the manifolds M1 and M2 to which
they respectively belong
approximate Mi by its tangent plane at xi
The user has to specify the tangent vectors
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 39 / 40
7.14 Tangent Distance, Tangent Prop and Manifold
Tangent Classifier
Tangent prop algorithm:
[...skipped...]
double backprop:
[...skipped...]
Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 40 / 40

More Related Content

What's hot

Deep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsDeep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applications
Buhwan Jeong
 
Feed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descentFeed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descent
Muhammad Rasel
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial Networks
Mustafa Yagmur
 
Basic Generative Adversarial Networks
Basic Generative Adversarial NetworksBasic Generative Adversarial Networks
Basic Generative Adversarial Networks
Dong Heon Cho
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
Larry Guo
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
Jie-Han Chen
 
Simple Introduction to AutoEncoder
Simple Introduction to AutoEncoderSimple Introduction to AutoEncoder
Simple Introduction to AutoEncoder
Jun Lang
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
Knoldus Inc.
 
linear classification
linear classificationlinear classification
linear classification
nep_test_account
 
Explicit Density Models
Explicit Density ModelsExplicit Density Models
Explicit Density Models
Sangwoo Mo
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017
Balázs Hidasi
 
Flow based generative models
Flow based generative modelsFlow based generative models
Flow based generative models
수철 박
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
NAVER Engineering
 
Introduction to object detection
Introduction to object detectionIntroduction to object detection
Introduction to object detection
Brodmann17
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
Adri Jovin
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Brief intro : Invariance and Equivariance
Brief intro : Invariance and EquivarianceBrief intro : Invariance and Equivariance
Brief intro : Invariance and Equivariance
홍배 김
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
ananth
 
Ensemble learning Techniques
Ensemble learning TechniquesEnsemble learning Techniques
Ensemble learning Techniques
Babu Priyavrat
 
Bayesian Global Optimization
Bayesian Global OptimizationBayesian Global Optimization
Bayesian Global Optimization
Amazon Web Services
 

What's hot (20)

Deep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsDeep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applications
 
Feed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descentFeed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descent
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial Networks
 
Basic Generative Adversarial Networks
Basic Generative Adversarial NetworksBasic Generative Adversarial Networks
Basic Generative Adversarial Networks
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Simple Introduction to AutoEncoder
Simple Introduction to AutoEncoderSimple Introduction to AutoEncoder
Simple Introduction to AutoEncoder
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
linear classification
linear classificationlinear classification
linear classification
 
Explicit Density Models
Explicit Density ModelsExplicit Density Models
Explicit Density Models
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017
 
Flow based generative models
Flow based generative modelsFlow based generative models
Flow based generative models
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Introduction to object detection
Introduction to object detectionIntroduction to object detection
Introduction to object detection
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
 
Brief intro : Invariance and Equivariance
Brief intro : Invariance and EquivarianceBrief intro : Invariance and Equivariance
Brief intro : Invariance and Equivariance
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Ensemble learning Techniques
Ensemble learning TechniquesEnsemble learning Techniques
Ensemble learning Techniques
 
Bayesian Global Optimization
Bayesian Global OptimizationBayesian Global Optimization
Bayesian Global Optimization
 

Similar to Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7

Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
Ono Shigeru
 
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof..."Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
Paris Women in Machine Learning and Data Science
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic track
arogozhnikov
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3
arogozhnikov
 
Machine learning (1)
Machine learning (1)Machine learning (1)
Machine learning (1)
NYversity
 
Iclr2016 vaeまとめ
Iclr2016 vaeまとめIclr2016 vaeまとめ
Iclr2016 vaeまとめ
Deep Learning JP
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
arogozhnikov
 
slides for "Supervised Model Learning with Feature Grouping based on a Discre...
slides for "Supervised Model Learning with Feature Grouping based on a Discre...slides for "Supervised Model Learning with Feature Grouping based on a Discre...
slides for "Supervised Model Learning with Feature Grouping based on a Discre...
Kensuke Mitsuzawa
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
cairo university
 
Project Paper
Project PaperProject Paper
Project Paper
Brian Whetter
 
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
Karl Rudeen
 
More on randomization semi-definite programming and derandomization
More on randomization semi-definite programming and derandomizationMore on randomization semi-definite programming and derandomization
More on randomization semi-definite programming and derandomization
Abner Chih Yi Huang
 
Bagging_and_Boosting.pptx
Bagging_and_Boosting.pptxBagging_and_Boosting.pptx
Bagging_and_Boosting.pptx
ABINASHPADHY6
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
Yoonho Lee
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
Young-Geun Choi
 
Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
Devashish Patel
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...
Valentin De Bortoli
 
Introduction to Machine Learning Lectures
Introduction to Machine Learning LecturesIntroduction to Machine Learning Lectures
Introduction to Machine Learning Lectures
ssuserfece35
 
A New Lagrangian Relaxation Approach To The Generalized Assignment Problem
A New Lagrangian Relaxation Approach To The Generalized Assignment ProblemA New Lagrangian Relaxation Approach To The Generalized Assignment Problem
A New Lagrangian Relaxation Approach To The Generalized Assignment Problem
Kim Daniels
 
Dynamic Feature Induction: The Last Gist to the State-of-the-Art
Dynamic Feature Induction: The Last Gist to the State-of-the-ArtDynamic Feature Induction: The Last Gist to the State-of-the-Art
Dynamic Feature Induction: The Last Gist to the State-of-the-Art
Jinho Choi
 

Similar to Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7 (20)

Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
 
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof..."Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic track
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3
 
Machine learning (1)
Machine learning (1)Machine learning (1)
Machine learning (1)
 
Iclr2016 vaeまとめ
Iclr2016 vaeまとめIclr2016 vaeまとめ
Iclr2016 vaeまとめ
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
 
slides for "Supervised Model Learning with Feature Grouping based on a Discre...
slides for "Supervised Model Learning with Feature Grouping based on a Discre...slides for "Supervised Model Learning with Feature Grouping based on a Discre...
slides for "Supervised Model Learning with Feature Grouping based on a Discre...
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
 
Project Paper
Project PaperProject Paper
Project Paper
 
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
 
More on randomization semi-definite programming and derandomization
More on randomization semi-definite programming and derandomizationMore on randomization semi-definite programming and derandomization
More on randomization semi-definite programming and derandomization
 
Bagging_and_Boosting.pptx
Bagging_and_Boosting.pptxBagging_and_Boosting.pptx
Bagging_and_Boosting.pptx
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...
 
Introduction to Machine Learning Lectures
Introduction to Machine Learning LecturesIntroduction to Machine Learning Lectures
Introduction to Machine Learning Lectures
 
A New Lagrangian Relaxation Approach To The Generalized Assignment Problem
A New Lagrangian Relaxation Approach To The Generalized Assignment ProblemA New Lagrangian Relaxation Approach To The Generalized Assignment Problem
A New Lagrangian Relaxation Approach To The Generalized Assignment Problem
 
Dynamic Feature Induction: The Last Gist to the State-of-the-Art
Dynamic Feature Induction: The Last Gist to the State-of-the-ArtDynamic Feature Induction: The Last Gist to the State-of-the-Art
Dynamic Feature Induction: The Last Gist to the State-of-the-Art
 

More from Ono Shigeru

Miller_Resnick_Zhackhauser_2005
Miller_Resnick_Zhackhauser_2005Miller_Resnick_Zhackhauser_2005
Miller_Resnick_Zhackhauser_2005
Ono Shigeru
 
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 9
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 9Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 9
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 9
Ono Shigeru
 
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 7
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 7Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 7
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 7
Ono Shigeru
 
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 5
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 5Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 5
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 5
Ono Shigeru
 
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 3
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 3Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 3
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 3
Ono Shigeru
 
Hong&Page(2012): Some Microfoundations of Collective Wisdom
Hong&Page(2012): Some Microfoundations of Collective WisdomHong&Page(2012): Some Microfoundations of Collective Wisdom
Hong&Page(2012): Some Microfoundations of Collective WisdomOno Shigeru
 

More from Ono Shigeru (6)

Miller_Resnick_Zhackhauser_2005
Miller_Resnick_Zhackhauser_2005Miller_Resnick_Zhackhauser_2005
Miller_Resnick_Zhackhauser_2005
 
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 9
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 9Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 9
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 9
 
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 7
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 7Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 7
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering, Chapter 7
 
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 5
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 5Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 5
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 5
 
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 3
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 3Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 3
Lilien, G.L. & Rangaswamy, A. (2004) Marketing Engineering: Chapter 3
 
Hong&Page(2012): Some Microfoundations of Collective Wisdom
Hong&Page(2012): Some Microfoundations of Collective WisdomHong&Page(2012): Some Microfoundations of Collective Wisdom
Hong&Page(2012): Some Microfoundations of Collective Wisdom
 

Recently uploaded

一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
dataschool1
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
ywqeos
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
9gr6pty
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
ugydym
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
Bangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts ServiceBangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts Service
nhero3888
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
Vineet
 
Senior Engineering Sample EM DOE - Sheet1.pdf
Senior Engineering Sample EM DOE  - Sheet1.pdfSenior Engineering Sample EM DOE  - Sheet1.pdf
Senior Engineering Sample EM DOE - Sheet1.pdf
Vineet
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service LucknowCall Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
hiju9823
 
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
Rebecca Bilbro
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
osoyvvf
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
eoxhsaa
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
yuvarajkumar334
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
Salesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - CanariasSalesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - Canarias
davidpietrzykowski1
 

Recently uploaded (20)

一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
Bangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts ServiceBangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts Service
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
 
Senior Engineering Sample EM DOE - Sheet1.pdf
Senior Engineering Sample EM DOE  - Sheet1.pdfSenior Engineering Sample EM DOE  - Sheet1.pdf
Senior Engineering Sample EM DOE - Sheet1.pdf
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service LucknowCall Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
 
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
 
Salesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - CanariasSalesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - Canarias
 

Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7

  • 1. Regularization for Deep Learning Goodfellow, Bengio, & Courville (2016) Deep Learning, Chap 7. Shigeru ONO (Insight Factory) DL 読書会: 2020/08 Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 1 / 40
  • 2. TOC 1 7.1 Parameter Norm Penalties 2 7.2 Norm Penalties as Constrained Optimization 3 7.3 Regularization and Under-Constrained Problems 4 7.4 Dataset Augmentation 5 7.5 Noise Robustness 6 7.6 Semi-Supervised Learning 7 7.7 Multitask Learning 8 7.8 Early Stopping 9 7.9 Parameter Tying and Parameter Sharing 10 7.10 Sparse Representation 11 7.11 Bagging and Other Ensemble Methods 12 7.12 Dropout 13 7.13 Adversarial Training 14 7.14 Tangent Distance, Tangent Prop and Manifold Tangent Classifier Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 2 / 40
  • 3. (introduction) Regularization: any modification we make to a learning algorithm that is intended to reduce its generalization error possibly at the expense of increasing training error In the context of DL, most regularization strategies are based on regularizing estimators Possible situations (See Chap.5) : (1) the model family excluded the true DGP (underfitting) (2) the model family matched the true DGP (3) the model family included the true DGP but also many other possible DGP (overfitting) The goal of regularization is to take the model from (3) into (2). But... In most applications of DL, the true DGP is outside the model family (=(1)). Controlling the complexity of the model is not to find the model of the right size, but to find the model with appropriate regularization in which generalization error is minimized. Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 3 / 40
  • 4. 7.1 Parameter Norm Penalties Adding a parameter norm penalty Ω(θ) to the objective function J. ˜J(θ; X, y) = J(θ; X, y) + αΩ(θ) α(≥ 0): weight of the relative contribution of Ω. For NN, we typically choose Ω that penalizes only w (the weights of the affine transformation at each layer). It is reasonable to use the same α at all layers. Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 4 / 40
  • 5. 7.1.1 L2 Parameter Regularization Ω(θ) = 1 2 ||w||2 aka. weight decay, ridge regression, Tikhonov regularization. Bayesian interpretation: MAP inference with a Gaussian prior on the weights. (See 5.6.1) Total objective function: ˜J(w; X, y) = J(w; X, y) + α 2 w⊤ w Parameter gradient: ∇w ˜J(w; X, y) = αw + ∇wJ(w; X, y) What happens in a single gradient step? ... The learning rule is modified to shrink w by a constant factor. w ← (1 − ϵα)w − ϵ∇wJ(w; X, y) Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 5 / 40
  • 6. 7.1.1 L2 Parameter Regularization What happens over the entire course of training? (in general) Unregularized: Let w∗ be the weights which minimize the unregularized objective function: w∗ = arg min w J(w). Make a quadratic approximation to the J(w) in the neighborhood of w∗ : ˆJ(θ) = J(w∗ ) + 1 2 (w − w∗ )⊤ H(w − w∗ ) where H is the Hessian matrix of J with respect to w evaluated at w∗ . The minimum of ˆJ occurs where its gradient ∇w ˆJ(w) = H(w − w∗ ) is 0. Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 6 / 40
  • 7. 7.1.1 L2 Parameter Regularization (Cont’d) Regularized: Let ˜w be the weights with minimize the regularized objective function ˜J. The minimum of ˜J occurs where α˜w + H(w − w∗ ) = 0. It follows that ˜w = (H + αI)−1 Hw∗ H is real and symmetric. We can have a eigenvalue decomposition H = QΛQ⊤ . ˜w = Q(Λ + αI)−1 ΛQ⊤ w∗ . i.e. The weight decay rescales w∗ along the axes defined by the eigenvector of H. Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 7 / 40
  • 8. 7.1.1 L2 Parameter Regularization What happens over the entire course of training? (in the case of linear regression) Unregularized: Cost function: (Xw − y)⊤ (Xw − y) Solution: w = (X⊤ X)−1 X⊤ y Regularized: Cost function: (Xw − y)⊤ (Xw − y) + 1 2 αw⊤ w Solution: w = (X⊤ X + αI)−1 X⊤ y. i.e. The regularization cause the algorithm to ”perceive” that X has higher variance (than the variance it really has). Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 8 / 40
  • 9. 7.1.1 L2 Parameter Regularization Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 9 / 40
  • 10. 7.1.2 L1 Regularization Ω(θ) = ||w||1 = ∑ i |wi| Total objective function : ˜J(w; X, y) = J(w; X, y) + α||w||1 Parameter gradient: ∇w ˜J(w; X, y) = αsign(w) + ∇wJ(w; X, y) It does not admit clean algebraic solution. For simple linear model with a quadratic cost function, ∇w ˆJ(w) = H(w − w∗ ) Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 10 / 40
  • 11. 7.1.2 L1 Regularization (Cont’d) Assume the Hessian is diagonal, H = diag([H11, . . . , Hnn]) (i.e. no correlation between the input features) Then we have a quadratic approximation of the cost function: ˆJ(w; X, y) = J(w∗ ; X, y) + ∑ i ( 1 2 Hii(wi − w∗ i )2 + α|wi| ) The solution is: wi = sign(w∗ i ) max ( |w∗ i | − α Hii , 0 ) Consider the situation where w∗ i > 0 for all i. Then When w∗ i ≤ α Hii , the optimal value is wi = 0. When w∗ i > α Hii , the optimal value is just shifted by a distance α Hii . Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 11 / 40
  • 12. 7.1.2 L1 Regularization (Cont’d) In short, the solution is more sparse (i.e. some parameter have an optimal value of zero). It has been used as a feature selection mechanism. E.g. LASSO Bayesian interpretation: MAP inference with a isotropic Laplace prior on the weights. Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 12 / 40
  • 13. 7.2 Norm Penalties as Constrained Optimization We can think of the penalties as constraints. Cost function: ˜J(θ; X, y) = J(θ; X, y) + αΩ(θ) If we wanted to constrain as Ω(θ) < k, we could construct a generalized Lagrangian L(θ, α; X, y) = J(θ; X, y) + α(Ω(θ) − k) The solution is θ∗ = arg min θ max α,α≥0 L(θ, α) we can fix α as its optimal value α∗ : θ∗ = arg min θ L(θ, α∗ ) = arg min θ J(θ; X, y) + α∗ Ω(θ) This is same as the problem of minimizing ˜J. Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 13 / 40
  • 14. 7.2 Norm Penalties as Constrained Optimization Sometimes we may with to use explicit constraints rather than penalties: when we know the appropriate value of k when the penalties can cause optimization to get stuck in local minima corresponding to small θ. when we with to impose some stability on the optimization procedure Approach: Srebro & Shraibman (2005): constraining the norm of each column of the weight matrix of a layer Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 14 / 40
  • 15. 7.3 Regularization and Under-Constrained Problems Sometimes regularization is necessary for ML problems to be properly defined. when the problem depends on (X⊤ X)−1 but X⊤ X is singular. when the problem has no closed form solution. E.g., logistic regression applied to a problem where the class are linear separable. If weight w can achieve perfect classification, 2w will also achieve with higher likelihood. Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 15 / 40
  • 16. 7.4 Dataset Augmentation Idea: Create fake data and add it to the training set. an effective technique particularly for object recognition. E.g. translating the training images a few pixels in each direction. Injecting noise in the input to a NN. It can improve the robustness of NNs. Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 16 / 40
  • 17. 7.5 Noise Robustness Idea: Add noise to the weights. It can be interpreted as a stochastic implementation of Bayesian inference over the weight. Noise reflect our uncertainty on the model weights. It can also be interpreted as equivalent to a more traditional form of regularization. Consider we wish to train a function ˆy(x) using the least-square cost function J = Ep(x,y)[(ˆy(x) − y)2 ] Assume that we also include a random perturbation ϵw ∼ N(ϵ; 0, ηI) of the network weights. The objective function becomes ˜JW = Ep(x,y,ϵW)[(ˆyeW (x) − y)2 ] For small η, it is equivalent to J with a regularization term ηEp(x,y)[||∇Wˆy(x)||2 ]. It push the model into regions where the model is relatively insensitive to small variations in the weights, finding points that are not merely minima, but minima surrounded by flat regions. Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 17 / 40
  • 18. 7.5.1 Injecting Noise at the Output Targets Idea: Explicitly model the noise on the y labels. label smoothing: regularize a model based on a softmax with k output values by replacing classification target 0 with ϵ/(k − 1) 1 with 1 − ϵ Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 18 / 40
  • 19. 7.6 Semi-Supervised Learning Idea: Use both unlabeled example (from P(x)) and labeled example (from P(x, y)) in order to estimate P(y|x) In the context of DL, semi-supervised learning usually refers to learning a representation h = f(x). The goal is to learn a representation so that examples from the same class have similar representations. Construct models in which a generative model of either P(x) or P(x, y) shares parameters with a discriminative model of P(y|x) One can find a better trade-off of two types of criterion: The supervised criterion: − log P(y|x) The unsupervised (generative) criterion: − log P(x) or − log P(x, y) Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 19 / 40
  • 20. 7.7 Multitask Learning Idea: Pool the examples arising out of several tasks The model can be divided into two parts: Task-specific parameters Generic parameters, shared across the tasks It can improve generalization and generalization error bounds Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 20 / 40
  • 21. 7.7 Multitask Learning Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 21 / 40
  • 22. 7.8 Early Stopping Idea : Obtain a model with the parameters at the point in time with the lowest validation set error (rather than with the latest parameters in the training process) the most commonly use form of regularization in DL can be interpreted as a hyperparameter (the number of training steps) selection algorithm requires a validation set, which is not fed to the model One can perform extra training (where all training data is used) after initial learning (with early stopping). Two basic strategies: Initialize the model again and retrain on all the data (for the same number of steps as the first round) Keep the parameter and continue training (but now using all the data). It is not as well behaved. Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 22 / 40
  • 23. 7.8 Early Stopping How early stopping acts as regularizer: Restricting both the number of iterations and the learning rate limit the volume of parameter space reachable from the initial parameter value. In a simple linear model with a quadratic error function and simple gradient decent, early stopping is equivalent to L2 regularization. [...skipped...] Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 23 / 40
  • 24. 7.8 Early Stopping Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 24 / 40
  • 25. 7.9 Parameter Tying and Parameter Sharing Sometimes we may know there should be some dependencies between the parameters. Parameter Tying: E.g. two models performing the same classification task but with different input distributions: ˆy(A) = f(w(A)) , x), ˆy(B) = f(w(B)) , x) We believe the model parameters should be close to each other We can use a penalty Ω(w(A) , w(B) ) = ||w(A) − w(B) ||2 Parameter Sharing: force sets of parameters to be equal can lead to significant reduction of memory The most popular use: convolutional neural network (CNNs) (See Chap.9) Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 25 / 40
  • 26. 7.10 Sparse Representation Idea: place a penalty on the activation of the unit (rather than on the parameters) Norm penalty regularization of representation: h: sparse representation of the data x add a norm penalty on the representation Ω(h) to the loss function J: ˜J(θ; X, y) = J(θ; X, y) + αΩ(h) We can use L1 penalty Ω(h) = ||h||1 or other types of penalties Orthogonal matching pursuit (OMP-k): encodes x with h that solves the constrained optimization problem arg min h,||h||0<k ||x − Wh||2 where ||h||0 is the number of nonzero entries of h OMP-1 can be a very effective feature extractor for DL Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 26 / 40
  • 27. 7.11 Bagging and Other Ensemble Methods Ensemble methods combine several models (trained separately) in order to reduce generalization error an example of a general strategy called model averaging On average, the ensemble will perform at least as well as any of its members. If the members make independent errors, the ensemble will perform significantly better. Bagging (bootstrap aggregating) construct k different datasets of same size by sampling with replacement from the original dataset Model i is trained on data set i Boosting: construct an ensemble with higher capacity then individual models Boosting of NN: incrementally add NN to the ensemble Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 27 / 40
  • 28. 7.12 Dropout Background: Bagging involves training multiple models and evaluating them on each test example. This seems impractical when each model is a large NN. Dropout can be thought of as a method of making bagging practical. What is Dropout? make all subnetworks that can be formed by removing nonoutput units from an base network In many cases, we can remove a unit by multiplying its output value by zero. Let µ be a vector of binary mask, which is applied to all the input and hidden units. train them with a minibatch-based algorithm Each time we load an example into a minibatch, we randomly sample µ and apply it. Typically, an input unit is included with probability 0.8, and a hidden unit is included with 0.5. Run forward propagation, back-propagation, and the learning update. Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 28 / 40
  • 29. 7.12 Dropout Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 29 / 40
  • 30. 7.12 Dropout Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 30 / 40
  • 31. 7.12 Dropout How to make a prediction: At training time, µ is sampled from the probability distribution p(µ) Each submodel defined by µ defines a probability distribution p(y|x, µ) To make a prediction from all submodels, we can use arithmetic mean:∑ µ p(µ)p(y|x, µ) But geometric mean performs better. Let ˜pensemble(y|x) be the geometric mean of p(y|x, µ). ˜p(y|x) is not guaranteed to be a probability distribution. We must renormalize: pensemble(y|x) = ˜pensemble(y|x) ∑ y′ ˜pensemble(y′|x) Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 31 / 40
  • 32. 7.12 Dropout Weight scaling inference rule: We can approximate pensemble by evaluating p(y|x) in one model This model uses all units, but with the weights going out of unit i multiplied by the probability of including unit i if an inclusion probability of a unit is 1/2, the weight of the unit is multiplied by 1/2 at the end of training, or the states of the unit is multiplied by 2 during training There is not yet any theoretical argument for this rule, but empirically it performs well Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 32 / 40
  • 33. 7.12 Dropout Advantages of dropout: very computationally cheap it does not significantly limit the type of model or training procedure Limitations: it reduces the effective capacity of a model. To offset this effect, we must increase the size of the model. it is less effective when extremely few labeled training examples are available. When additional unlabeled data is available, unsupervised feature learning can gain an advantage over dropout. Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 33 / 40
  • 34. 7.12 Dropout fast dropout: analytical approximations to the sum over all submodels more principled approach than the weight scaling inference rule. Interpretation of dropout: an experiments using ”dropout boosting” use exactly the same mask noise as dropout trains the entire ensemble to jointly (not independently) maximize the log-likelihood on the training set shows almost no regularization effect This demonstrates that dropout is a type of bagging. Dropout in itself have no robustness to noise. Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 34 / 40
  • 35. 7.12 Dropout Other approaches inspired by dropout: DropConnect: each product between a single scalar weight and a single hidden unit state is considered a unit that can be dropped Stochastic pooling: build ensembles of CNNs real valued mask: multiplying the weights by µ ∼ N(1, I) can outperform dropout Another view of dropout: Dropout regularize each hidden unit to be not merely a good feature but a feature that is good in many context. Masking can be seen as a form of highly intelligent, adaptive destruction of the information content of the input (rather than destruction of the raw input). It allows the model to make use of all the knowledge about the input distribution which has acquired so far. Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 35 / 40
  • 36. 7.13 Adversarial Training Adversarial example: an input x′ near a data point x such that the model output is very different at x′ In many cases, human observer cannot tell the difference between x and x′ One of the causes of these examples is excessive linearity in NN. The value of a linear function can change very rapidly if it has numerous inputs. Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 36 / 40
  • 37. 7.13 Adversarial Training Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 37 / 40
  • 38. 7.13 Adversarial Training Adversarial Training: training on adversarially perturbed examples from the training set a way of explicitly introducing a local constancy prior into NN Virtual adversarial example: Suppose the model assigns some label ˆy at a point x which has no true label. We can seek an adversarial example x′ that causes the model to output a label y′ (̸= ˆy) We can train the model to assign the same label to x and x′ This encourages the model to learn a function which is robust to small change This provide a means of semi-supervised learning Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 38 / 40
  • 39. 7.14 Tangent Distance, Tangent Prop and Manifold Tangent Classifier Manifold hypothesis: the data lies near a low-dimensional manifold Tangent distance algorithm non-parametric nearest neighbor algorithm, where the distance between points x1 and x2 is the distance between the manifolds M1 and M2 to which they respectively belong approximate Mi by its tangent plane at xi The user has to specify the tangent vectors Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 39 / 40
  • 40. 7.14 Tangent Distance, Tangent Prop and Manifold Tangent Classifier Tangent prop algorithm: [...skipped...] double backprop: [...skipped...] Shigeru ONO (Insight Factory) DL Chap.7 DL 読書会: 2020/08 40 / 40