8. Squared Error Loss
• The Mean Squared Error (MSE) either assesses the quality of a predictor
(i.e., a function mapping arbitrary inputs to a sample of values of some
random variable), or of an estimator (i.e., a mathematical function
mapping a sample of data to an estimate of a parameter of the population
from which the data is sampled). The definition of an MSE differs according
to whether one is describing a predictor or an estimator.
Predictor Estimator
variance and bias relationship
11. Loss Functions
Loss function is a method that evaluates how well the algorithm learns the data
and produces correct outputs. It computes the distance between our predicted
value and the actual value using a mathematical formula.
In layman's terms, a loss function measures how wrong the model is in terms of
its ability to estimate the relationship between x and y.
Loss functions can be categorized into two groups:
• Classification - which is about predicting a label, by identifying which category an object
belongs to based on different parameters.
• Regression - which is about predicting a continuous output, by finding the correlations
between dependent and independent variables.
12. Loss function by category
Regression Loss Functions
Squared Error Loss
Absolute Error Loss
Huber Loss
Binary Classification Loss Functions
Binary Cross-Entropy
Hinge Loss
Multi-class Classification Loss Functions
Multi-class Cross Entropy Loss
Kullback Leibler Divergence Loss
13.
14. Mean Squared Error (MSE)
Advantage: The MSE is great for ensuring that our trained
model has no outlier predictions with huge errors, since the
MSE puts larger weight on theses errors due to the squaring
part of the function.
Disadvantage: If our model makes a single very bad prediction,
the squaring part of the function magnifies the error. Yet in
many practical cases we don’t care much about these outliers
and are aiming for more of a well-rounded model that
performs good enough on the majority.
15.
16. Mean Absolute Error (MAE)
Advantage: The beauty of the MAE is that its advantage
directly covers the MSE disadvantage. Since we are taking the
absolute value, all of the errors will be weighted on the same
linear scale. Thus, unlike the MSE, we won’t be putting too
much weight on our outliers and our loss function provides a
generic and even measure of how well our model is
performing.
Disadvantage: If we do in fact care about the outlier
predictions of our model, then the MAE won’t be as effective.
The large errors coming from the outliers end up being
weighted the exact same as lower errors. This might results in
our model being great most of the time, but making a few very
poor predictions every so-often.
17. The MSE is great for learning outliers while the MAE is great
for ignoring them. But what about something in the
middle?
What this equation essentially says is: for loss values less
than delta, use the MSE; for loss values greater than delta,
use the MAE.
Consider an example where we have a dataset
of 100 values we would like our model to be
trained to predict. Out of all that data, 25% of
the expected values are 5 while the other 75%
are 10.
An MSE loss wouldn’t quite do the trick, since
we don’t really have “outliers”; 25% is by no
means a small fraction. On the other hand we
don’t necessarily want to weight that 25% too
low with an MAE. Those values of 5 aren’t close
to the median (10 — since 75% of the points
have a value of 10), but they’re also not really
outliers.
Our solution?
18. Huber Loss
• Using the MAE for larger loss
values mitigates the weight that
we put on outliers so that we still
get a well-rounded model. At the
same time we use the MSE for
the smaller loss values to
maintain a quadratic function
near the center.
• This has the effect of magnifying
the loss values as long as they
are greater than 1. Once the loss
for those data points dips below
1, the quadratic function down-
weights them to focus the
training on the higher-error data
points.
19. Entropy
• The core idea of information theory is that the "informational value" of a
communicated message depends on the degree to which the content of
the message is surprising.
• If a highly likely event occurs, the message carries very little information.
On the other hand, if a highly unlikely event occurs, the message is much
more informative.
• For instance, the knowledge that some particular number will not be the
winning number of a lottery provides very little information, because any
particular chosen number will almost certainly not win. However,
knowledge that a particular number will win a lottery has high
informational value because it communicates the outcome of a very low
probability event.
Entropy: Origin of the Second Law of Thermodynamics
20. Entropy
• Entropy measures the expected (i.e., average) amount of information
conveyed by identifying the outcome of a random trial. This implies
that casting a die has higher entropy than tossing a coin because
each outcome of a die toss has smaller probability (about p = 1 / 6
than each outcome of a coin toss ( p = 1 / 2 ).
21. https://www.youtube.com/watch?v=Tr_gv5CKB1Y
Boltzmann's Entropy Equation: A History
from Clausius to Planck
Lecture 04, concept 12: Deriving the Boltzmann distribution -
general case
https://www.youtube.com/watch?v=tDKjLzbXYQI
The Maxwell–Boltzmann distribution | AP Chemistry | Khan
Academy
https://www.youtube.com/watch?v=xQ9D4Jz95-A&t=10s
23. Relationship to thermodynamic entropy - Boltzmann
Data compression - Shannon
Entropy as a measure of diversity
Entropy in cryptography - Kullback–Leibler divergence (statistical distance)
Data as a Markov process
Entropy for continuous random variables
Prior probability?
38. What’s the Difference between a Loss
Function and a Cost Function?
• although cost function and loss function are synonymous and used
interchangeably, they are different.
• A loss function is for a single training example. It is also sometimes
called an error function. A cost function, on the other hand, is the
average loss over the entire training dataset. The optimization
strategies aim at minimizing the cost function.
40. Learning Rate
The learning rate affects the amount by which you adjust
parameters during optimization
in order to minimize the error of neural network’s guesses. It
is a coefficient
that scales the size of the steps (updates) a neural network
takes to its parameter vector
x as it crosses the loss function space.
During backpropagation we multiply the error gradient by the
learning rate, and then
update a connection weight’s last iteration with the product to
reach a new weight.
The learning rate determines how much of the gradient we
want to use for the algorithm’s
next step. A large error and steep gradient combine with the
learning rate to
produce a large step. As we approach minimal error and the
gradient flattens out, the
step size tends to shorten.
41. Learning Rate
A large learning rate coefficient (e.g., 1) will make your parameters take leaps, and
small ones (e.g., 0.00001) will make it inch along slowly. Large leaps will save time
initially, but they can be disastrous if they lead us to overshoot our minimum. A
learning rate too large oversteps the nadir, making the algorithm bounce back and
forth on either side of the minimum without ever coming to rest.
In contrast, small learning rates should lead you eventually to an error minimum (it
might be a local minimum rather than a global one), but they can take a very long
time and add to the burden of an already computationally intensive process. Time
matters when neural network training can take weeks on large datasets. If you can’t
wait another week for the results, choose a moderate learning rate (e.g., 0.1) and
experiment with several others in the same ballpark to get the best speed and accuracy
at once
42. Regularization
Regularization helps with the effects of out-of-control parameters by using different
methods to minimize parameter size over time.
In mathematical notation, we see regularization represented by the coefficient
lambda, controlling the trade-off between finding a good fit and keeping the value
of certain feature weights low as the exponents on features increase.
Regularization coefficients L1 and L2 help fight overfitting by making certain
weights smaller. Smaller-valued weights lead to simpler hypotheses, and simpler
hypotheses are the most generalizable. Unregularized weights with several higher-
order polynomials in the feature set tend to overfit the training set.
As the input training set size grows, the effect of regularization decreases and the
parameters tend to increase in magnitude. This is appropriate, because an excess
of features relative to training set examples leads to overfitting in the first place.
Bigger data is the ultimate regularizer
43.
44. Intuitive Explanation of Ridge / Lasso Regression
https://www.youtube.com/watch?v=9LNpiiKCQUo&t=524s
45. Momentum
Momentum helps the learning
algorithm get out of spots in the
search space where it would
otherwise become stuck. In the
error-scape, it helps the updater find
the gulleys that lead toward the
minima. Momentum is to the
learning rate what the learning rate
is to weights, and it helps us produce
better quality models
47. Sparsity
The sparsity hyperparameter recognizes that for some inputs only a few features are
relevant. For example, let’s assume that a network can classify a million images. Any
one of those images will be indicated by a limited number of features. But to effectively
classify millions of images a network must be able to recognize considerably
more features, many of which don’t appear most of the time. An example of this
would be how photos of sea urchins don’t contain noses and hooves. This contrasts to
how in submarine images the nose and hoof features will be 0.
The features that indicate sea urchins will be few and far between, in the vastness of
the neural network’s layers. That’s a problem, because sparse features can limit the
number of nodes that activate and impede a network’s ability to learn.
In response to sparsity, biases force neurons to activate and the activations stay around a
mean that keeps the network from becoming stuck.
Hoofs
49. Optimization algorithms for deep learning
The gradient descent algorithm
is not the only optimization
algorithm available to optimize
our network weights, however
it's the basis for most other
algorithms.
50. Using momentum with gradient descent
Using gradient descent with momentum speeds up gradient descent by
increasing the speed of learning in directions the gradient has been
constant in direction while slowing learning in directions the gradient
fluctuates in direction. It allows the velocity of gradient descent to increase.
Momentum works by introducing a velocity term, and using a weighted
moving average of that term in the update rule, as follows:
Most typically is set to 0.9 in the case of momentum, and usually this is
not a hyperparameter that needs to be changed.
51. The RMSProp algorithm (Geoffrey Hinton)
RMSProp is another algorithm that can speed up gradient descent by
speeding up learning in some directions, and dampening oscillations
in other directions, across the multidimensional space that the
network weights represent:
https://www.coursera.org/learn/neural-networks/home/welcome
52. RMSprop
• There are two ways to introduce RMSprop.
• First, is to look at it as the adaptation of rprop algorithm for mini-
batch learning. It was the initial motivation for developing this
algorithm.
• Another way is to look at its similarities with Adagrad and view
RMSprop as a way to deal with its radically diminishing learning
rates.
53. Rprop to RMSprop
• Rprop doesn’t really work when we have very large datasets and need to perform mini-batch weights updates. Why it doesn’t
work with mini-batches ? Well, people have tried it, but found it hard to make it work. The reason it doesn’t work is that it
violates the central idea behind stochastic gradient descent, which is when we have small enough learning rate, it averages the
gradients over successive mini-batches. Consider the weight, that gets the gradient 0.1 on nine mini-batches, and the gradient of -
0.9 on tenths mini-batch. What we’d like is to those gradients to roughly cancel each other out, so that the stay approximately the
same. But it’s not what happens with rprop. With rprop, we increment the weight 9 times and decrement only once, so the weight
grows much larger.
• To combine the robustness of rprop (by just using sign of the gradient), efficiency we get from mini-batches, and averaging over
mini-batches which allows to combine gradients in the right way, we must look at rprop from different perspective. Rprop is
equivalent of using the gradient but also dividing by the size of the gradient, so we get the same magnitude no matter how big a
small that particular gradient is. The problem with mini-batches is that we divide by different gradient every time, so why not force
the number we divide by to be similar for adjacent mini-batches ? The central idea of RMSprop is keep the moving average of the
squared gradients for each weight. And then we divide the gradient by square root the mean square. Which is why it’s called
RMSprop(root mean square). With math equations the update rule looks like this:
54. • As you can see from the above equation we adapt learning rate by
dividing by the root of squared gradient, but since we only have the
estimate of the gradient on the current mini-batch, wee need instead
to use the moving average of it. Default value for the moving average
parameter that you can use in your projects is 0.9. It works very well
for most applications. In code the algorithm might look like this:
55. Similarity with Adagrad
• Adagrad is adaptive learning rate algorithms that looks a lot like
RMSprop. Adagrad adds element-wise scaling of the gradient based
on the historical sum of squares in each dimension. This means that
we keep a running sum of squared gradients. And then we adapt the
learning rate by dividing it by that sum. In code we can express it like
this:
Adagrad: Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic
Optimization. Journal of Machine Learning Research, 12, 2121–2159. Retrieved from
http://jmlr.org/papers/v12/duchi11a.html
56.
57. RMSprop
• What’s this scaling does when we have high condition number ? If we have two
coordinates — one that has always big gradients and one that has small gradients we’ll
be diving by the corresponding big or small number so we accelerate movement among
small direction, and in the direction where gradients are large we’re going to slow
down as we divide by some large number.
• What happens over the course of training ? Steps get smaller and smaller and smaller,
because we keep updating the squared grads growing over training. So we divide by the
larger number every time. In the convex optimization, this makes a lot of sense, because
when we approach minina we want to slow down. In non-convex case it’s bad as we can
get stuck on saddle point. We can look at RMSprop as algorithms that addresses that
concern a little bit.
• With RMSprop we still keep that estimate of squared gradients, but instead of letting
that estimate continually accumulate over training, we keep a moving average of it.
59. The Adam optimizer
Adam is one of the best performing known optimizer and it's my first
choice. It works well across a wide variety of problems. It combines the
best parts of both momentum and RMSProp into a single update rule:
60.
61. Bias and variance errors in deep learning
With traditional predictive models, there is usually some compromise when we try
to find an error from bias and an error from variance. So let's see what these two
errors are:
Bias error: Bias error is the error that is introduced by the model. For example, if
you attempted to model a non-linear function with a linear model, your model
would be under specified and the bias error would be high.
Variance error: Variance error is the error that is introduced by randomness in
the training data. When we fit our training distribution so well that our model no
longer generalizes, we have overfit or introduce a variance error.
62. The train, val, and test datasets
The val dataset, or the validation dataset, will be used to find ideal hyperparameters, and to
measure overfitting. At the end of an epoch, which is when the network has has the opportunity to
observe every data point in the training set, we will make a prediction on the val set. That prediction will
be used to watch for overfitting and will help us know when the network has finished training. Using
the val set at the end of each epoch like this somewhat differs from the typical usage.
63.
64. K-Fold cross-validation: Too Expensive
Shuffle the dataset randomly.
Split the dataset into k groups
For each unique group:
Take the group as a hold out or test data set
Take the remaining groups as a training data set
Fit a model on the training set and evaluate it on the test set
Retain the evaluation score and discard the model
Summarize the skill of the model using the sample of model
evaluation scores
65.
66. Variations on Cross-Validation
• Train/Test Split: Taken to one extreme, k may be set to 2 (not 1) such that a single
train/test split is created to evaluate the model.
• LOOCV: Taken to another extreme, k may be set to the total number of observations in
the dataset such that each observation is given a chance to be the held out of the
dataset. This is called leave-one-out cross-validation, or LOOCV for short.
• Stratified: The splitting of data into folds may be governed by criteria such as ensuring
that each fold has the same proportion of observations with a given categorical value,
such as the class outcome value. This is called stratified cross-validation.
• Repeated: This is where the k-fold cross-validation procedure is repeated n times,
where importantly, the data sample is shuffled prior to each repetition, which results in a
different split of the sample.
• Nested: This is where k-fold cross-validation is performed within each fold of cross-
validation, often to perform hyperparameter tuning during model evaluation. This is
called nested cross-validation or double cross-validation.
67.
68.
69.
70.
71.
72.
73.
74.
75. The Mathematics of Neural Networks
https://www.youtube.com/watch?v=e5xKayCBOeU
76.
77.
78.
79.
80.
81.
82. Hyperparameters
In machine learning, we have both model parameters and parameters
we tune to make networks train better and faster. These tuning
parameters are called hyperparameters, and they deal with controlling
optimization functions and model selection during training with our
learning algorithm.
Hyperparameter selection focuses on ensuring that the model neither
underfits nor overfits the training dataset, while learning the structure
of the data as quickly as possible.
83. Hyperparameters
Hyperparameters fall into several categories:
• Layer size
• Magnitude (momentum, learning rate)
• Regularization (dropout, drop connect, L1, L2)
• Activations (and activation function families)
• Weight initialization strategy
• Loss functions
• Settings for epochs during training (mini-batch size)
• Normalization scheme for input data (vectorization)
84. Boltzmann Machine
• A Boltzmann machine (also called Sherrington–Kirkpatrick model with
external field or stochastic Ising–Lenz–Little model) is a stochastic spin-
glass model with an external field, i.e., a Sherrington–Kirkpatrick model,
that is a stochastic Ising Model. It is a statistical physics technique applied
in the context of cognitive science. It is also classified as Markov random
field.
• Boltzmann machines are theoretically intriguing because of the locality
and Hebbian nature of their training algorithm (being trained by Hebb's
rule), and because of their parallelism and the resemblance of their
dynamics to simple physical processes.
• Boltzmann machines with unconstrained connectivity have not been
proven useful for practical problems in machine learning or inference, but
if the connectivity is properly constrained, the learning can be made
efficient enough to be useful for practical problems.
85. Boltzmann Machine
• They are named after the Boltzmann distribution
in statistical mechanics, which is used in their
sampling function. They were heavily popularized
and promoted by Geoffrey Hinton, Terry
Sejnowski and Yann LeCun in cognitive sciences
communities and in machine learning. As a more
general class within machine learning these
models are called "energy based models" (EBM),
because Hamiltonian of spin glasses are used as
a starting point to define the learning task.
86.
87.
88. General Boltzmann Machine
A network of symmetrically connected, neuron-like units that make
stochastic decisions about whether to be on or off.
89.
90. Restricted Boltzmann Machines (RBMs)
• A restricted Boltzmann machine (RBM) is a generative stochastic
artificial neural network that can learn a probability distribution over
its set of inputs.
• RBMs were initially invented under the name Harmonium by Paul
Smolensky in 1986, and rose to prominence after Geoffrey Hinton and
collaborators invented fast learning algorithms for them in the mid-
2000. RBMs have found applications in dimensionality reduction,
classification, collaborative filtering, feature learning, topic
modelling and even many body quantum mechanics. They can be
trained in either supervised or unsupervised ways, depending on the
task.
91. Restricted Boltzmann Machines (RBMs)
• As their name implies, RBMs are a variant of Boltzmann machines, with the
restriction that their neurons must form a bipartite graph: a pair of nodes
from each of the two groups of units (commonly referred to as the
"visible" and "hidden" units respectively) may have a symmetric
connection between them; and there are no connections between nodes
within a group. By contrast, "unrestricted" Boltzmann machines may have
connections between hidden units. This restriction allows for more
efficient training algorithms than are available for the general class of
Boltzmann machines, in particular the gradient-based contrastive
divergence algorithm.
• Restricted Boltzmann machines can also be used in deep learning
networks. In particular, deep belief networks can be formed by "stacking"
RBMs and optionally fine-tuning the resulting deep network with gradient
descent and backpropagation.
92. Restricted Boltzmann Machines (RBMs)
• The “restricted” part of the name “Restricted Boltzmann Machines”
means that connections between nodes of the same layer are
prohibited (e.g., there are no visible-visible or hidden-hidden
connections along which signal passes).
• RBMs are also a type of autoencoder
• https://github.com/echen/restricted-boltzmann-machines
93. Learning More About
Restricted Boltzmann Machines
A Practical Guide to Training Restricted Boltzmann Machines
http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf
Bay Area Vision Meeting: Unsupervised Feature Learning and Deep
Learning
https://www.youtube.com/watch?v=ZmNOAtZIgIk
Restricted Boltzmann Machines for Collaborative Filtering
https://www.cs.toronto.edu/~rsalakhu/papers/rbmcf.pdf
94. Network layout (RBM)
• Visible units
• Hidden units
• Weights
• Visible bias units
• Hidden bias units
95. Contrastive Divergence (FYI)
RBMs calculate gradients by using an algorithm called contrastive
divergence. Contrastive divergence is the name of the algorithm
used in sampling for the layer-wise pretraining of a RBM. Also
called CD-k, contrastive divergence minimizes the KL divergence
(the delta between the real distribution of the data and the guess)
by sampling k steps of a Markov chain to compute a guess.
96. Reconstruction Cross-Entropy
The objective function here is usually reconstruction cross-entropy,
or KL divergence (the mathematicians and cryptanalysts Solomon
Kullback and Richard Leibler first published a paper on the technique
in 1951). “Cross” refers to the comparison between two distributions.
“Entropy” is a term from information theory that refers to
uncertainty. For example, a normal curve with a wide spread, or
variance, also implies more uncertainty about where data points
will fall. That uncertainty is called entropy.
113. More references
Lecture 11/16 : Hopfield nets and Boltzmann machines
https://www.youtube.com/watch?v=IP3W7cI01VY
Lecture 11.5 — How a Boltzmann machine models data — [ Deep
Learning | Geoffrey Hinton | UofT ]
https://www.youtube.com/watch?v=kytxEr0KK7Q
Neural networks [5.7] : Restricted Boltzmann machine - example
https://www.youtube.com/watch?v=n26NdEtma8U
S18 Lecture 22: Boltzmann Machines
https://www.youtube.com/watch?v=it_PXVIMyWg&t=6s
114. Four Major Architectures of Deep Networks
Unsupervised Pretrained Networks (UPNs)
Autoencoders
Deep Belief Networks (DBNs)
Generative Adversarial Networks (GANs)
Convolutional Neural Networks (CNNs)
Recurrent Neural Networks
Recursive Neural Networks
115. Unsupervised Pretrained Networks (UPNs)
• In SGD optimization, one typically initiates model weights at random and tries to
go towards minimum cost by following the opposite of gradient of objective
function. For deep nets, this has not shown much of success and it is believed to
be result of extremely non-convex (and high-dimensional) nature of their
objective function.
• What Y. Bengio and others found out was that, instead of starting weights at
random and hoping that SGD will take you to minimum point of such a rugged
landscape, you can pre-train each layer like an autoencoder.
• Here is how it works: you build an autoencoder with first layer as encoding layer
and the transpose of that as decoder. And you train it unsupervised, that is you
train it to reconstruct the input (refer to AutoEncoders, they are great for
unsupervised feature extraction tasks). Once trained, you fix weights of that layer
to those you just found. Then, you move to next layers and repeat the same until
you pre-train all layers of deep net (greedy approach). At this point, you go back to
the original problem that you wanted to solve with deep net
(classification/regression) and you optimize it with SGD but starting from weights
you just learned during pre-training.
• They found that this gives much better results. I think no one knows why exactly
this works, but the idea is that by pre-training you start from more favorable
regions of feature space.
116. Unsupervised Pretrained Networks (UPNs)
• Unsupervised pre-training initializes a discriminative neural net from
one which was trained using an unsupervised criterion, such as a
deep belief network or a deep autoencoder. This method can
sometimes help with both the optimization and the overfitting issues.
• Unsupervised Pre-training Acts as a Regularizer
117. Autoencoders
• We use autoencoders to learn compressed representations of
datasets. Typically, we use them to reduce a dataset’s
dimensionality. The output of the autoencoder network is a
reconstruction of the input data in the most efficient form.
118.
119.
120. Defining Features of Autoencoders
• Autoencoders differ from multilayer perceptrons in a couple of ways:
• They use unlabeled data in unsupervised learning.
• They build a compressed representation of the input data.
• Unsupervised learning of unlabeled data. The autoencoder learns
directly from unlabeled data. This is connected to the second major
difference between multilayer perceptrons and autoencoders.
• Learning to reproduce the input data. The goal of a multilayer
perceptron network is to generate predictions over a class (e.g., fraud
versus not fraud). An autoencoder is trained to reproduce its own
input data.
121. Defining Features of Autoencoders
• Autoencoders rely on backpropagation to update their weights. The main difference
between RBMs and the more general class of autoencoders is in how they calculate the
gradients.
• Two important variants of autoencoders to note are compression autoencoders and
denoising autoencoders.
Compression autoencoders. The network
input must pass through a bottleneck region of the network before being expanded
back into the output representation.
Denoising autoencoders. The denoising autoencoder is the scenario in which the
autoencoder is given a corrupted version (e.g., some features are removed randomly)
of the input and the network is forced to learn the uncorrupted output.
122. Variational Autoencoders
• A more recent type of autoencoder
model is the variational autoencoder
(VAE) introduced by Kingma and
Welling. The VAE is similar to
compression
• and denoising autoencoders in that
they are all trained in an unsupervised
manner to reconstruct inputs.
• However, the mechanisms that the
VAEs use to perform training are quite
different. In a compression/denoising
autoencoder, activations are mapped
to activations throughout the layers,
as in a standard neural network;
comparatively, a VAE uses a
probabilistic approach for the
forward pass.
https://arxiv.org/abs/1312.6114
123. Deep Belief Networks
• DBNs are composed of layers of Restricted Boltzmann Machines
(RBMs) for the pretrain phase and then a feed-forward network for
the fine-tune phase.
124. Feature Extraction with RBM Layers
We use RBMs to extract higher-level features from the raw input
vectors. To do that, we want to set the hidden unit states and weights
such that when we show the RBM an input record and ask the RBM to
reconstruct the record the record, it generates something pretty close
to the original input vector. Hinton talks about this effect in terms of
how machines “dream about data.”
The fundamental purpose of RBMs in the context of deep learning and
DBNs is to learn these higher-level features of a dataset in an
unsupervised training fashion. It was discovered that we could train
better neural networks by letting RBMs learn progressively higher-level
features using the learned features from a lower level RBM pretrain
layer as the input to a higher-level RBM pretrain layer.
125. Learning Higher-order Features Automatically
Activation render at the beginning of training
Features emerge in later activation render
Portions of MNIST digits emerge towards end of training
126. Building Autoencoders in Keras
a simple autoencoder based on a fully-connected layer
a sparse autoencoder
a deep fully-connected autoencoder
a deep convolutional autoencoder
an image denoising model
a sequence-to-sequence autoencoder
a variational autoencoder
https://blog.keras.io/building-autoencoders-in-keras.html
127.
128. What are autoencoders?
• "Autoencoding" is a data compression algorithm where the compression and decompression functions are
1) data-specific, 2) lossy, and 3) learned automatically from examples rather than engineered by a human.
Additionally, in almost all contexts where the term "autoencoder" is used, the compression and
decompression functions are implemented with neural networks.
• 1) Autoencoders are data-specific, which means that they will only be able to compress data similar to what
they have been trained on. This is different from, say, the MPEG-2 Audio Layer III (MP3) compression
algorithm, which only holds assumptions about "sound" in general, but not about specific types of sounds.
An autoencoder trained on pictures of faces would do a rather poor job of compressing pictures of trees,
because the features it would learn would be face-specific.
• 2) Autoencoders are lossy, which means that the decompressed outputs will be degraded compared to the
original inputs (similar to MP3 or JPEG compression). This differs from lossless arithmetic compression.
• 3) Autoencoders are learned automatically from data examples, which is a useful property: it means that it
is easy to train specialized instances of the algorithm that will perform well on a specific type of input. It
doesn't require any new engineering, just appropriate training data.
• The fact that autoencoders are data-specific makes them generally impractical for real-world data
compression problems: you can only use them on data that is similar to what they were trained on, and
making them more general thus requires lots of training data.
131. Deep Belief Network (DBN)
• In machine learning, a deep belief network (DBN) is a generative graphical model, or alternatively
a class of deep neural network, composed of multiple layers of latent variables ("hidden units"),
with connections between the layers but not between units within each layer.
• When trained on a set of examples without supervision, a DBN can learn to probabilistically
reconstruct its inputs. The layers then act as feature detectors. After this learning step, a DBN can
be further trained with supervision to perform classification.
• DBNs can be viewed as a composition of simple, unsupervised networks such as restricted
Boltzmann machines (RBMs) or autoencoders, where each sub-network's hidden layer serves as
the visible layer for the next. An RBM is an undirected, generative energy-based model with a
"visible" input layer and a hidden layer and connections between but not within layers. This
composition leads to a fast, layer-by-layer unsupervised training procedure, where contrastive
divergence is applied to each sub-network in turn, starting from the "lowest" pair of layers (the
lowest visible layer is a training set).
• The observation that DBNs can be trained greedily, one layer at a time, led to one of the first
effective deep learning algorithms.:6 Overall, there are many attractive implementations and
uses of DBNs in real-life applications and scenarios (e.g., electroencephalography, drug discovery
http://www.scholarpedia.org/article/Deep_belief_networks
135. References
• Bengio, Y., Lamblin, P., Popovici, P., Larochelle, H. (2007) Greedy Layer-Wise Training of Deep Networks, Advances in Neural
Information Processing Systems 19, MIT Press, Cambridge, MA.
• Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18:1527-
1554.
• Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313:504-507.
• Larochelle, H., Erhan, D., Courville, A., Bergstra, J., Bengio, Y. (2007) An Empirical Evaluation of Deep Architectures on Problems
with Many Factors of Variation. International Conference on Machine Learning.
• LeCun, Y. and Bengio, Y. (2007) Scaling Learning Algorithms Towards AI. In Bottou et al. (Eds.) Large-Scale Kernel Machines, MIT
Press.
• M. Ranzato, F.J. Huang, Y. Boureau, Y. LeCun (2007) Unsupervised Learning of Invariant Feature Hierarchies with Applications to
Object Recognition. Proc. of Computer Vision and Pattern Recognition Conference (CVPR 2007), Minneapolis, Minnesota, 2007
• Salakhutdinov, R. R. and Hinton,G. E. (2007) Semantic Hashing. In Proceedings of the SIGIR Workshop on Information Retrieval
and Applications of Graphical Models, Amsterdam.
• Sutskever, I. and Hinton, G. E. (2007) Learning multilevel distributed representations for high-dimensional sequences. AI and
Statistics, 2007, Puerto Rico.
• Taylor, G. W., Hinton, G. E. and Roweis, S. (2007) Modeling human motion using binary latent variables. Advances in Neural
Information Processing Systems 19, MIT Press, Cambridge, MA
• Welling, M., Rosen-Zvi, M., and Hinton, G. E. (2005). Exponential family harmoniums with an application to information retrieval.
Advances in Neural Information Processing Systems 17, pages 1481-1488. MIT Press, Cambridge, MA.
136. Generative Adversarial Networks (GANs)
• The Generative Adversarial Network
(GAN) comprises of two models: a
generative model G and a discriminative
model D. The generative model can be
considered as a counterfeiter who is
trying to generate fake currency and use
it without being caught, whereas the
discriminative model is similar to police,
trying to catch the fake currency. This
competition goes on till the
counterfeiter becomes smart enough to
successfully fool the police.
137.
138.
139. Generative Adversarial Networks (GANs)
• Discriminator: The role is to distinguish between actual and generated
(fake) data.
• Generator: The role is to create data in such a way that it can fool the
discriminator.
140. Derivation of the loss function
Discriminator loss
the objective of the discriminator is to correctly classify the fake and real dataset. For this, equations (1) and (2)
should be maximized and final loss function for the discriminator can be given as,
141. Generator loss
the generator is competing against discriminator. So, it will try
to minimize the equation (3) and loss function is given as,
Combined loss function
Remember that the above loss function is valid only for a single
data point, to consider entire dataset we need to take the
expectation of the above equation as
142. It can be noticed from the above algorithm
that the generator and discriminator are
trained separately. In the first section, real
data and fake data are inserted into the
discriminator with correct labels and training
takes place. Gradients are propagated
keeping generator fixed. Also, we update the
discriminator by ascending its stochastic
gradient because for discriminator we want
to maximize the loss function given in
equation (6).
On the other hand, we update the generator
by keeping discriminator fixed and passing
fake data with fake labels in order to fool the
discriminator. Here, we update the generator
by descending its stochastic gradient because
for the generator we want to minimize the
loss function given in equation (6).
145. Limitations: Mode Collapse
• During training, the generator may get stuck into a setting where it
always produces the same output. This is called mode collapse. This
happens because the main aim of G was to fool D not to generate
diverse output.
146. References
• Atienza, Rowel. Advanced Deep Learning with Keras: Apply deep
learning techniques, autoencoders, GANs, variational autoencoders,
deep reinforcement learning, policy gradients, and more. Packt
Publishing Ltd, 2018.
• Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in
neural information processing systems. 2014.
• Wang, Zhengwei, Qi She, and Tomas E. Ward. “Generative Adversarial
Networks: A Survey and Taxonomy.” arXiv preprint arXiv:1906.01529
(2019).
149. Convolution
• A convolution is an operation on two
vectors, matrices, or tensors, that
returns a third vector, matrix, or tensor.
• https://deeplearningmath.org/convolut
ional-neural-networks.html
155. What is a Color Model?
• A color model is an abstract mathematical model that describes how colors
can be represented as a set of numbers (e.g., a triple in RGB or a quad in
CMYK). Color models can usually be described using a coordinate system,
and each color in the system is represented by a single point in the
coordinate space.
• For a given color model, to interpret a tuple or quad as a color, we can
define a set of rules and definitions used to accurately calibrate and
generate colors, i.e. a mapping function. A color space identifies a specific
combination of color models and mapping functions. Identifying the color
space automatically identifies the associated color model. For example,
Adobe RGB and sRGB are two different color spaces, both based on the
RGB color model.
156. RGB
• RGB color model stores individual values for red, green, and blue. With a color space based on the
RGB color model, the three primaries are added together to create colors from completely white to
completely black.
• The RGB color space is associated with the device. Thus, different scanners get different color image
data when scanning the same image; different monitors have different color display results when
rendering the same image.
• There are many different RGB color spaces derived from this color model, standard RGB (sRGB) is a
popular example.
157. HSV
• HSV (hue, saturation, value), also known as HSB (hue, saturation, brightness),
is often used by artists because it is often more natural to think about a color
in terms of hue and saturation than in terms of additive or subtractive color
components.
• The system is closer to people’s experience and perception of color than RGB.
For example, in painting terms, hue, saturation, and values are expressed in
terms of color, shading, and toning.
• The HSV model space can be described as an inverted hexagonal pyramid.
• The top surface is a regular hexagon, showing the change in hue in the H
direction, from 0 ° to 360 ° is the entire spectrum of visible light. The six
corners of the hexagon represent the positions of the six colors of red, yellow,
green, cyan, blue, and magenta, each of which is 60 ° apart.
• The saturation S is represented by the S direction from the center to the
hexagonal boundary, and the value varies from 0 to 1. The closer to the
hexagonal boundary, the higher the color saturation. The color of the
hexagonal boundary is the most saturated, i.e. S = 1; the color saturation at
the center of the hexagon is 0, i.e. S = 0.
• The height of the hexagonal pyramid (also known as the central axis) is
denoted by V, which represents a black to white gradation from bottom to
top. The bottom of V is black, V = 0; the top of V is white, V = 1.
158. YUV
• The Y′UV model defines a color space in terms
of one luma component (Y′) and two
chrominance (UV) components. The Y′ channel
saves black and white data. If there is only the Y
component and there are no U and V
components, then the graph represented is
grayscale.
• The Y component can be calculated with the
following equation: Y = 0. 299R+ 0. 587G+ 0.
114*B, which is the commonly used grayscale
formula. The color difference U and V are
compressed by B-Y and R-Y in different
proportions.
• Compared with RGB, Y’UV does not necessarily
store a triple tuple for each pixel. Y′UV images
can be sampled in several different ways. For
example, with YUV420, it saves one luma
component for every point and two chroma
values—a Cr (U) value and a Cb (V) value—
every 2×2 points. I.E., 6 bytes per 4 pixels.
159. YUV
• The scope of the terms Y′UV, YUV, YCbCr, YPbPr, etc., is sometimes
ambiguous and overlapping. Historically, the terms YUV and Y′UV
were used for a specific analog encoding of color information in
television systems, while YCbCr was used for digital encoding of color
information suited for video and still-image compression and
transmission such as MPEG and JPEG. Today, the term YUV is
commonly used in the computer industry to describe file-formats that
are encoded using YCbCr.
160. Color Space
Depending on the information represented by each pixel, images can be divided into binary images, grayscale images, RGB images,
and index images, etc.
Binary Image
In a binary image, the pixel value is represented by a 0 or 1. Generally, 0 is for black and 1 is for white.
Grayscale Image
The grayscale image adds a color depth between black and white in the binary image to form a grayscale image. Such images are
usually displayed as grayscales from the darkest black to the brightest white, and each color depth is called a grayscale, usually
denoted by L. In grayscale images, pixels can take integer values between 0 and L-1.
RGB Image
In RGB or color, images, the information for each pixel requires a tuple of numbers to represent. So we need a three-dimensional
matrix to represent an image. Almost all colors in nature can be composed of three colors: red (R), green (G), and blue (B). So each
pixel can be represented by a red/green/blue tuple in an RGB image.
Indexed Image
An indexed image consists of a colormap matrix, which uses direct mapping of pixel values in an array to colormap values. The color
of each pixel in an image is determined by using the corresponding value. We discuss this in more detail below.
161. How an Image is Stored in Memory
• The x86 hardware does not
have an addressing mode
that accesses elements of
multi-dimensional arrays.
When loading an image into
memory space, the multi-
dimensional object is
converted into a one-
dimensional array. Row
major ordering or column
major ordering are
commonly used.
162. Seven Grayscale Conversion Algorithms
• Method 1 - Averaging (aka “quick and dirty”)
• Method 2 - Correcting for the human eye (sometimes called “luma”
or “luminance,” though such terminology isn’t really accurate)
• Method 3 – Desaturation
• Method 4 - Decomposition (think of it as de-composition, e.g. not the
biological process!)
• Method 5 - Single color channel
• Method 6 - Custom # of gray shades
• Method 7 - Custom # of gray shades with dithering (in this example,
horizontal error-diffusion dithering)
https://tannerhelland.com/2011/10/01/grayscale-image-algorithm-vb6.html
164. M5 Gray = Red …or: Gray = Green …or: Gray = Blue
M6
ConversionFactor = 255 / (NumberOfShades - 1)
AverageValue = (Red + Green + Blue) / 3
Gray = Integer((AverageValue / ConversionFactor) + 0.5) * ConversionFactor
Notes:
-NumberOfShades is a value between 2 and 256
-technically, any grayscale algorithm could be used to calculate AverageValue; it simply
provides an initial gray value estimate
-the "+ 0.5" addition is an optional parameter that imitates rounding the value of an
integer conversion; YMMV depending on which programming language you use, as
some round automatically
M7
167. Check out
• How to Colorize an Image
• https://tannerhelland.com/2011/04/28/colorize-image-vb6.html
https://github.com/tannerhelland/vb6-code/tree/master/Colorize-effect
177. Sobel operator
• The Sobel operator, sometimes
called the Sobel–Feldman
operator or Sobel filter, is used
in image processing and
computer vision, particularly
within edge detection
algorithms where it creates an
image emphasising edges. It is
named after Irwin Sobel and
Gary Feldman, colleagues at the
Stanford Artificial Intelligence
Laboratory (SAIL)
191. Convolutional Neural Networks (CNNs)
The goal of a CNN is to learn higher-order
features in the data via convolutions.
They are well suited to object recognition
with images and consistently top image
classification competitions. They can
identify faces, individuals, street signs,
platypuses, and many other aspects of
visual data. CNNs overlap with text
analysis via optical character recognition,
but they are also useful when analyzing
words as discrete textual units. They’re
also good at analyzing sound.
The efficacy of CNNs in image recognition
is one of the main reasons why the world
recognizes the power of deep learning. As
Figure CNNs are good at building position
and (somewhat) rotation invariant
features from raw image data
https://poloclub.github.io/cnn-explainer/
192. CNNs and Structure in Data
CNNs tend to be most useful when there is some structure to the input data. An example would be how images
and audio data that have a specific set of repeating patterns and input values next to each other are related
spatially. Conversely, the columnar data exported from a relational database management system (RDBMS)
tends to have no structural relationships spatially. Columns next to one another just happen to be materialized
that way in the database exported materialized view.
193. The motivation of convolutions
• Sparse interaction, or Local connectivity.
• The receptive field of the neuron, or the filter size.
• The connections are local in space (width and height), but
always full in depth
• A set of learnable filters
• Parameters sharing, the weights are tied
• Equivariant representation, translation invariant
194. Convolution and matrix multiplication
• Discrete convolution can be viewed as multiplication
by matrix
• The kernel is a doubly block circulant matrix
• It is very sparse!
195. The ‘convolution’ operation
• The convolution is commutative because we have flipped the kernel
• Many implement a cross-correlation without flipping
• A convolution can be defined for 1, 2, 3, and N D
• The 2D convolution is different from a real 3D convolution, which
integrates the spatio-temporal information, the standard CNN convolution
has only ‘spatial’ spreading
• In CNN, even for 3D RGB input images, the standard convolution is
2D in each channel,
• each channel has a different filter or kernel, the convolution per channel is
then summed up in all channels to produce a scalar for non-linearity
activation
• The filiter in each channel is not normalized, so no need to have different
linear combination coefficients.
• 1*1 convolution is a dot product in different channel, a linear combination
of different chanels
• The backward pass of a convolution is also a convolution with
spatially flipped filters.
VisGraph, HKUST
196. The convolution layers
• Stacking several small convolution layers is different from
convolution cascating
• As each small convolution is followed by the nonlinearities ReLU
• The nonlinearities make the features more expressive!
• Have fewer parameters with small filters, but more memory.
• Cascating simply enlarges the spatial extent, the receptive field
• Whether each conv layer is also followed by a pooling?
• Lenet does not!
• AlexNet first did not.
VisGraph, HKUST
197. The Pooling Layer
• Reduce the spatial size
• Reduce the amount of parameters
• Avoid over-fitting
• Backpropagation for a max: only routing the gradient to
the input that had the highest value in the forward pass
• It is unclear whether the pooling is essential.
198. Pooling layer down-samples the volume spatially, independently in each depth
slice of the input volume.
Left: the input volume of size [224x224x64] is pooled with filter size 2, stride 2
into output volume of size [112x112x64]. Notice that the volume depth is
preserved.
Right: The most common down-sampling operation is max, giving rise to max
pooling, here shown with a stride of 2. That is, each max is taken over 4
numbers (little 2x2 square).
200. AlexNet 2012
• A strong prior has very low entropy, e.g. a Gaussian
with low variance
• An infinitely strong prior says that some parameters
are forbidden, and places zero probability on them
• The convolution ‘prior’ says the identical and zero
weights
• The pooling forces the invariance of small translations
201. The convolution and pooling act as an
infinitely strong prior!
• A strong prior has very low entropy, e.g. a Gaussian
with low variance
• An infinitely strong prior says that some parameters
are forbidden, and places zero probability on them
• The convolution ‘prior’ says the identical and zero
weights
• The pooling forces the invariance of small translations
202. The neuroscientific basis for CNN
• The primary visual cortex, V1, about which we know the most
• The brain region LGN, lateral geniculate nucleus, at the back of the head carries the
signal from the eye to V1, a convolutional layer captures three aspects of V1
• It has a 2-dimensional structure
• V1 contains many simple cells, linear units
• V1 has many complex cells, corresponding to features with shift invariance, similar to pooling
• When viewing an object, info flows from the retina, through LGN, to V1, then
onward to V2, then V4, then IT, inferotemporal cortex, corresponding to the last
layer of CNN features
• Not modeled at all. The mammalian vision system develops an attention mechanism
• The human eye is mostly very low resolution, except for a tiny patch fovea.
• The human brain makes several eye movements saccades to salient parts of the scene
• The human vision perceives 3D
• A simple cell responds to a specific spatial frequency of brightness in a specific
direction at a specific location --- Gabor-like functions
203. Receptive field
Left: An example input volume in red (e.g. a 32x32x3 CIFAR-10 image), and an example volume of
neurons in the first Convolutional layer. Each neuron in the convolutional layer is connected only to
a local region in the input volume spatially, but to the full depth (i.e. all color channels). Note, there
are multiple neurons (5 in this example) along the depth, all looking at the same region in the input
- see discussion of depth columns in text below. Right: The neurons from the Neural Network
chapter remain unchanged: They still compute a dot product of their weights with the input
followed by a non-linearity, but their connectivity is now restricted to be local spatially.
208. CNN architectures
• The conventional linear structure, linear list of layers, feedforward
• Generally a DAG, directed acyclic graph
• ResNet simply adds back
• Different terminology: complex layer and simple layer
– A complex (complete) convolutional layer, including different stages such
as convolution per se, batch normalization, nonlinearity, and pooling
– Each stage is a layer, even there are no parameters
• The traditional CNNs are just a few complex convolutional layers to
extract features, then are followed by a softmax classification
output layer
• Convolutional networks output a high-dimensional, structured
object, rather than just predicting a class label for a classiciation
task or a real valuefor a regression task, it it an output tensor
– S_i,j,k is the probability that pixel j,k belongs to class I
209. The popular CNN
• LeNet, 1998
• AlexNet, 2012
• VGGNet, 2014
• ResNet, 2015
213. Stochastic Gradient Descent
• Gradient descent, follows the gradient of an
entire training set downhill
• Stochastic gradient descent, follows the gradient
of randomly selected minbatches downhill
214. The dropout regularization
• Randomly shutdown a subset of units in training
• It is a sparse representation
• It is a different net each time, but all nets share the parameters
• A net with n units can be seen as a collection of 2^n possible thinned nets,
all of which share weights.
• At test time, it is a single net with averaging
• Avoid over-fitting
215. Smaller Network: CNN
• We know it is good to learn a small model.
• From this fully connected model, do we really need all the edges?
• Can some of these be shared?
216. Consider learning an image:
• Some patterns are much smaller than the whole
image
“beak” detector
Can represent a small region with fewer parameters
217. Same pattern appears in different places:
They can be compressed!
What about training a lot of such “small” detectors
and each detector must “move around”.
“ -left
b k” detector
“ i b k”
detector
They can be compressed
to the same parameters.
218. A convolutional layer
A filter
A CNN is a neural network with some convolutional layers
(and some other layers). A convolutional layer has a number
of filters that does convolutional operation.
Beak detector
219. Convolution
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
-1 1 -1
-1 1 -1
-1 1 -1
Filter 2
…
…
These are the network
parameters to be learned.
Each filter detects a
small pattern (3 x 3).
230. Why Pooling
• Subsampling pixels will not change the object
Subsampling
bird
bird
We can subsample the pixels to make image
smaller fewer parameters to characterize the image
231. A CNN compresses a fully connected network
in two ways:
• Reducing number of connections
• Shared weights on the edges
• Max pooling further reduces the complexity
232. Max pooling is a pooling operation that selects the maximum
element from the region of the feature map covered by the
filter. Thus, the output after max-pooling layer would be a
feature map containing the most prominent features of the
previous feature map
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
3 0
1
3
-1 1
3
0
2 x 2 image
Each filter
is a channel
New image
but smaller
Conv
Max
Pooling
236. The whole CNN
Convolution
Max Pooling
Convolution
Max Pooling
Can
repeat
many
times
A new image
The number of channels
is the number of filters
Smaller than the original
image
3 0
1
3
-1 1
3
0
237.
238.
239. The whole CNN
Fully Connected
Feedforward network
c g ……
Convolution
Max Pooling
Convolution
Max Pooling
Flattened
A new image
A new image
241. Only modified the network structure and input
format (vector -> 3-D tensor)
CNN in Keras
Convolution
Max Pooling
Convolution
Max Pooling
input
1 -1 -1
-1 1 -1
-1 -1 1
-1 1 -1
-1 1 -1
-1 1 -1
There are
25 3x3
filters.
…
…
Input_shape = ( 28 , 28 , 1)
1: black/white, 3: RGB
28 x 28 pixels
3 -1
-3 1
3
242. Only modified the network structure and input
format (vector -> 3-D array)
CNN in Keras
Convolution
Max Pooling
Convolution
Max Pooling
Input
1 x 28 x 28
25 x 26 x 26
25 x 13 x 13
50 x 11 x 11
50 x 5 x 5
How many parameters for
each filter?
How many parameters
for each filter?
9
225=
25x9
243. Only modified the network structure and input
format (vector -> 3-D array)
CNN in Keras
Convolution
Max Pooling
Convolution
Max Pooling
Input
1 x 28 x 28
25 x 26 x 26
25 x 13 x 13
50 x 11 x 11
50 x 5 x 5
Flattened
1250
Fully connected
feedforward network
Output