SlideShare a Scribd company logo
1 of 248
Download to read offline
Deep Learning
“If at first the idea is not
absurd, then there is no
hope for it.”
― Albert Einstein
Core Components
Parameters
Layers
Activation functions
Loss functions
Optimization methods
Hyperparameters
Loss Functions
• Squared error loss
• Logistic loss
• Hinge loss
• Negative log likelihood
Squared Error Loss
• The Mean Squared Error (MSE) either assesses the quality of a predictor
(i.e., a function mapping arbitrary inputs to a sample of values of some
random variable), or of an estimator (i.e., a mathematical function
mapping a sample of data to an estimate of a parameter of the population
from which the data is sampled). The definition of an MSE differs according
to whether one is describing a predictor or an estimator.
Predictor Estimator
variance and bias relationship
Proof
https://www.youtube.com/watch?v=QBbC3Cjsnjg
Loss Functions - EXPLAINED!
Loss Functions
Loss function is a method that evaluates how well the algorithm learns the data
and produces correct outputs. It computes the distance between our predicted
value and the actual value using a mathematical formula.
In layman's terms, a loss function measures how wrong the model is in terms of
its ability to estimate the relationship between x and y.
Loss functions can be categorized into two groups:
• Classification - which is about predicting a label, by identifying which category an object
belongs to based on different parameters.
• Regression - which is about predicting a continuous output, by finding the correlations
between dependent and independent variables.
Loss function by category
Regression Loss Functions
Squared Error Loss
Absolute Error Loss
Huber Loss
Binary Classification Loss Functions
Binary Cross-Entropy
Hinge Loss
Multi-class Classification Loss Functions
Multi-class Cross Entropy Loss
Kullback Leibler Divergence Loss
Mean Squared Error (MSE)
Advantage: The MSE is great for ensuring that our trained
model has no outlier predictions with huge errors, since the
MSE puts larger weight on theses errors due to the squaring
part of the function.
Disadvantage: If our model makes a single very bad prediction,
the squaring part of the function magnifies the error. Yet in
many practical cases we don’t care much about these outliers
and are aiming for more of a well-rounded model that
performs good enough on the majority.
Mean Absolute Error (MAE)
Advantage: The beauty of the MAE is that its advantage
directly covers the MSE disadvantage. Since we are taking the
absolute value, all of the errors will be weighted on the same
linear scale. Thus, unlike the MSE, we won’t be putting too
much weight on our outliers and our loss function provides a
generic and even measure of how well our model is
performing.
Disadvantage: If we do in fact care about the outlier
predictions of our model, then the MAE won’t be as effective.
The large errors coming from the outliers end up being
weighted the exact same as lower errors. This might results in
our model being great most of the time, but making a few very
poor predictions every so-often.
The MSE is great for learning outliers while the MAE is great
for ignoring them. But what about something in the
middle?
What this equation essentially says is: for loss values less
than delta, use the MSE; for loss values greater than delta,
use the MAE.
Consider an example where we have a dataset
of 100 values we would like our model to be
trained to predict. Out of all that data, 25% of
the expected values are 5 while the other 75%
are 10.
An MSE loss wouldn’t quite do the trick, since
we don’t really have “outliers”; 25% is by no
means a small fraction. On the other hand we
don’t necessarily want to weight that 25% too
low with an MAE. Those values of 5 aren’t close
to the median (10 — since 75% of the points
have a value of 10), but they’re also not really
outliers.
Our solution?
Huber Loss
• Using the MAE for larger loss
values mitigates the weight that
we put on outliers so that we still
get a well-rounded model. At the
same time we use the MSE for
the smaller loss values to
maintain a quadratic function
near the center.
• This has the effect of magnifying
the loss values as long as they
are greater than 1. Once the loss
for those data points dips below
1, the quadratic function down-
weights them to focus the
training on the higher-error data
points.
Entropy
• The core idea of information theory is that the "informational value" of a
communicated message depends on the degree to which the content of
the message is surprising.
• If a highly likely event occurs, the message carries very little information.
On the other hand, if a highly unlikely event occurs, the message is much
more informative.
• For instance, the knowledge that some particular number will not be the
winning number of a lottery provides very little information, because any
particular chosen number will almost certainly not win. However,
knowledge that a particular number will win a lottery has high
informational value because it communicates the outcome of a very low
probability event.
Entropy: Origin of the Second Law of Thermodynamics
Entropy
• Entropy measures the expected (i.e., average) amount of information
conveyed by identifying the outcome of a random trial. This implies
that casting a die has higher entropy than tossing a coin because
each outcome of a die toss has smaller probability (about p = 1 / 6
than each outcome of a coin toss ( p = 1 / 2 ).
https://www.youtube.com/watch?v=Tr_gv5CKB1Y
Boltzmann's Entropy Equation: A History
from Clausius to Planck
Lecture 04, concept 12: Deriving the Boltzmann distribution -
general case
https://www.youtube.com/watch?v=tDKjLzbXYQI
The Maxwell–Boltzmann distribution | AP Chemistry | Khan
Academy
https://www.youtube.com/watch?v=xQ9D4Jz95-A&t=10s
https://en.wikipedia.org/wiki/Entropy_(information_theory)
Relationship to thermodynamic entropy - Boltzmann
Data compression - Shannon
Entropy as a measure of diversity
Entropy in cryptography - Kullback–Leibler divergence (statistical distance)
Data as a Markov process
Entropy for continuous random variables
Prior probability?
https://www.youtube.com/watch?v=uQtzfpzLj-k
Thermodynamics (statistical): Boltzmann distribution derivation
https://www.youtube.com/watch?v=PinkT8X4cGM&list=PLvSd
MJxMoHLtVvu-oLQDyXx06H_hSAKCD
Binomial Distribution/ Bernoulli distribution
Intuitively Understanding the KL Divergence
https://www.youtube.com/watch?v=SxGYPqCgJWM
What’s the Difference between a Loss
Function and a Cost Function?
• although cost function and loss function are synonymous and used
interchangeably, they are different.
• A loss function is for a single training example. It is also sometimes
called an error function. A cost function, on the other hand, is the
average loss over the entire training dataset. The optimization
strategies aim at minimizing the cost function.
Hyperparameters
Learning Rate
Regularization
Momentum
Sparsity
Learning Rate
The learning rate affects the amount by which you adjust
parameters during optimization
in order to minimize the error of neural network’s guesses. It
is a coefficient
that scales the size of the steps (updates) a neural network
takes to its parameter vector
x as it crosses the loss function space.
During backpropagation we multiply the error gradient by the
learning rate, and then
update a connection weight’s last iteration with the product to
reach a new weight.
The learning rate determines how much of the gradient we
want to use for the algorithm’s
next step. A large error and steep gradient combine with the
learning rate to
produce a large step. As we approach minimal error and the
gradient flattens out, the
step size tends to shorten.
Learning Rate
A large learning rate coefficient (e.g., 1) will make your parameters take leaps, and
small ones (e.g., 0.00001) will make it inch along slowly. Large leaps will save time
initially, but they can be disastrous if they lead us to overshoot our minimum. A
learning rate too large oversteps the nadir, making the algorithm bounce back and
forth on either side of the minimum without ever coming to rest.
In contrast, small learning rates should lead you eventually to an error minimum (it
might be a local minimum rather than a global one), but they can take a very long
time and add to the burden of an already computationally intensive process. Time
matters when neural network training can take weeks on large datasets. If you can’t
wait another week for the results, choose a moderate learning rate (e.g., 0.1) and
experiment with several others in the same ballpark to get the best speed and accuracy
at once
Regularization
Regularization helps with the effects of out-of-control parameters by using different
methods to minimize parameter size over time.
In mathematical notation, we see regularization represented by the coefficient
lambda, controlling the trade-off between finding a good fit and keeping the value
of certain feature weights low as the exponents on features increase.
Regularization coefficients L1 and L2 help fight overfitting by making certain
weights smaller. Smaller-valued weights lead to simpler hypotheses, and simpler
hypotheses are the most generalizable. Unregularized weights with several higher-
order polynomials in the feature set tend to overfit the training set.
As the input training set size grows, the effect of regularization decreases and the
parameters tend to increase in magnitude. This is appropriate, because an excess
of features relative to training set examples leads to overfitting in the first place.
Bigger data is the ultimate regularizer
Intuitive Explanation of Ridge / Lasso Regression
https://www.youtube.com/watch?v=9LNpiiKCQUo&t=524s
Momentum
Momentum helps the learning
algorithm get out of spots in the
search space where it would
otherwise become stuck. In the
error-scape, it helps the updater find
the gulleys that lead toward the
minima. Momentum is to the
learning rate what the learning rate
is to weights, and it helps us produce
better quality models
Optimization Tricks: momentum, batch-norm, and more
https://www.youtube.com/watch?v=kK8-jCCR4is
Sparsity
The sparsity hyperparameter recognizes that for some inputs only a few features are
relevant. For example, let’s assume that a network can classify a million images. Any
one of those images will be indicated by a limited number of features. But to effectively
classify millions of images a network must be able to recognize considerably
more features, many of which don’t appear most of the time. An example of this
would be how photos of sea urchins don’t contain noses and hooves. This contrasts to
how in submarine images the nose and hoof features will be 0.
The features that indicate sea urchins will be few and far between, in the vastness of
the neural network’s layers. That’s a problem, because sparse features can limit the
number of nodes that activate and impede a network’s ability to learn.
In response to sparsity, biases force neurons to activate and the activations stay around a
mean that keeps the network from becoming stuck.
Hoofs
Optimization Methods
https://www.youtube.com/watch?v=mdKjMPmcWjY
Optimization algorithms for deep learning
The gradient descent algorithm
is not the only optimization
algorithm available to optimize
our network weights, however
it's the basis for most other
algorithms.
Using momentum with gradient descent
Using gradient descent with momentum speeds up gradient descent by
increasing the speed of learning in directions the gradient has been
constant in direction while slowing learning in directions the gradient
fluctuates in direction. It allows the velocity of gradient descent to increase.
Momentum works by introducing a velocity term, and using a weighted
moving average of that term in the update rule, as follows:
Most typically is set to 0.9 in the case of momentum, and usually this is
not a hyperparameter that needs to be changed.
The RMSProp algorithm (Geoffrey Hinton)
RMSProp is another algorithm that can speed up gradient descent by
speeding up learning in some directions, and dampening oscillations
in other directions, across the multidimensional space that the
network weights represent:
https://www.coursera.org/learn/neural-networks/home/welcome
RMSprop
• There are two ways to introduce RMSprop.
• First, is to look at it as the adaptation of rprop algorithm for mini-
batch learning. It was the initial motivation for developing this
algorithm.
• Another way is to look at its similarities with Adagrad and view
RMSprop as a way to deal with its radically diminishing learning
rates.
Rprop to RMSprop
• Rprop doesn’t really work when we have very large datasets and need to perform mini-batch weights updates. Why it doesn’t
work with mini-batches ? Well, people have tried it, but found it hard to make it work. The reason it doesn’t work is that it
violates the central idea behind stochastic gradient descent, which is when we have small enough learning rate, it averages the
gradients over successive mini-batches. Consider the weight, that gets the gradient 0.1 on nine mini-batches, and the gradient of -
0.9 on tenths mini-batch. What we’d like is to those gradients to roughly cancel each other out, so that the stay approximately the
same. But it’s not what happens with rprop. With rprop, we increment the weight 9 times and decrement only once, so the weight
grows much larger.
• To combine the robustness of rprop (by just using sign of the gradient), efficiency we get from mini-batches, and averaging over
mini-batches which allows to combine gradients in the right way, we must look at rprop from different perspective. Rprop is
equivalent of using the gradient but also dividing by the size of the gradient, so we get the same magnitude no matter how big a
small that particular gradient is. The problem with mini-batches is that we divide by different gradient every time, so why not force
the number we divide by to be similar for adjacent mini-batches ? The central idea of RMSprop is keep the moving average of the
squared gradients for each weight. And then we divide the gradient by square root the mean square. Which is why it’s called
RMSprop(root mean square). With math equations the update rule looks like this:
• As you can see from the above equation we adapt learning rate by
dividing by the root of squared gradient, but since we only have the
estimate of the gradient on the current mini-batch, wee need instead
to use the moving average of it. Default value for the moving average
parameter that you can use in your projects is 0.9. It works very well
for most applications. In code the algorithm might look like this:
Similarity with Adagrad
• Adagrad is adaptive learning rate algorithms that looks a lot like
RMSprop. Adagrad adds element-wise scaling of the gradient based
on the historical sum of squares in each dimension. This means that
we keep a running sum of squared gradients. And then we adapt the
learning rate by dividing it by that sum. In code we can express it like
this:
Adagrad: Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic
Optimization. Journal of Machine Learning Research, 12, 2121–2159. Retrieved from
http://jmlr.org/papers/v12/duchi11a.html
RMSprop
• What’s this scaling does when we have high condition number ? If we have two
coordinates — one that has always big gradients and one that has small gradients we’ll
be diving by the corresponding big or small number so we accelerate movement among
small direction, and in the direction where gradients are large we’re going to slow
down as we divide by some large number.
• What happens over the course of training ? Steps get smaller and smaller and smaller,
because we keep updating the squared grads growing over training. So we divide by the
larger number every time. In the convex optimization, this makes a lot of sense, because
when we approach minina we want to slow down. In non-convex case it’s bad as we can
get stuck on saddle point. We can look at RMSprop as algorithms that addresses that
concern a little bit.
• With RMSprop we still keep that estimate of squared gradients, but instead of letting
that estimate continually accumulate over training, we keep a moving average of it.
Mini batch optimization
https://www.researchgate.net/publication/357735089_Mini-batch_optimization_enables_training_of_ODE_models_on_large-scale_datasets/figures?lo=1
The Adam optimizer
Adam is one of the best performing known optimizer and it's my first
choice. It works well across a wide variety of problems. It combines the
best parts of both momentum and RMSProp into a single update rule:
Bias and variance errors in deep learning
With traditional predictive models, there is usually some compromise when we try
to find an error from bias and an error from variance. So let's see what these two
errors are:
Bias error: Bias error is the error that is introduced by the model. For example, if
you attempted to model a non-linear function with a linear model, your model
would be under specified and the bias error would be high.
Variance error: Variance error is the error that is introduced by randomness in
the training data. When we fit our training distribution so well that our model no
longer generalizes, we have overfit or introduce a variance error.
The train, val, and test datasets
The val dataset, or the validation dataset, will be used to find ideal hyperparameters, and to
measure overfitting. At the end of an epoch, which is when the network has has the opportunity to
observe every data point in the training set, we will make a prediction on the val set. That prediction will
be used to watch for overfitting and will help us know when the network has finished training. Using
the val set at the end of each epoch like this somewhat differs from the typical usage.
K-Fold cross-validation: Too Expensive
Shuffle the dataset randomly.
Split the dataset into k groups
For each unique group:
Take the group as a hold out or test data set
Take the remaining groups as a training data set
Fit a model on the training set and evaluate it on the test set
Retain the evaluation score and discard the model
Summarize the skill of the model using the sample of model
evaluation scores
Variations on Cross-Validation
• Train/Test Split: Taken to one extreme, k may be set to 2 (not 1) such that a single
train/test split is created to evaluate the model.
• LOOCV: Taken to another extreme, k may be set to the total number of observations in
the dataset such that each observation is given a chance to be the held out of the
dataset. This is called leave-one-out cross-validation, or LOOCV for short.
• Stratified: The splitting of data into folds may be governed by criteria such as ensuring
that each fold has the same proportion of observations with a given categorical value,
such as the class outcome value. This is called stratified cross-validation.
• Repeated: This is where the k-fold cross-validation procedure is repeated n times,
where importantly, the data sample is shuffled prior to each repetition, which results in a
different split of the sample.
• Nested: This is where k-fold cross-validation is performed within each fold of cross-
validation, often to perform hyperparameter tuning during model evaluation. This is
called nested cross-validation or double cross-validation.
The Mathematics of Neural Networks
https://www.youtube.com/watch?v=e5xKayCBOeU
Hyperparameters
In machine learning, we have both model parameters and parameters
we tune to make networks train better and faster. These tuning
parameters are called hyperparameters, and they deal with controlling
optimization functions and model selection during training with our
learning algorithm.
Hyperparameter selection focuses on ensuring that the model neither
underfits nor overfits the training dataset, while learning the structure
of the data as quickly as possible.
Hyperparameters
Hyperparameters fall into several categories:
• Layer size
• Magnitude (momentum, learning rate)
• Regularization (dropout, drop connect, L1, L2)
• Activations (and activation function families)
• Weight initialization strategy
• Loss functions
• Settings for epochs during training (mini-batch size)
• Normalization scheme for input data (vectorization)
Boltzmann Machine
• A Boltzmann machine (also called Sherrington–Kirkpatrick model with
external field or stochastic Ising–Lenz–Little model) is a stochastic spin-
glass model with an external field, i.e., a Sherrington–Kirkpatrick model,
that is a stochastic Ising Model. It is a statistical physics technique applied
in the context of cognitive science. It is also classified as Markov random
field.
• Boltzmann machines are theoretically intriguing because of the locality
and Hebbian nature of their training algorithm (being trained by Hebb's
rule), and because of their parallelism and the resemblance of their
dynamics to simple physical processes.
• Boltzmann machines with unconstrained connectivity have not been
proven useful for practical problems in machine learning or inference, but
if the connectivity is properly constrained, the learning can be made
efficient enough to be useful for practical problems.
Boltzmann Machine
• They are named after the Boltzmann distribution
in statistical mechanics, which is used in their
sampling function. They were heavily popularized
and promoted by Geoffrey Hinton, Terry
Sejnowski and Yann LeCun in cognitive sciences
communities and in machine learning. As a more
general class within machine learning these
models are called "energy based models" (EBM),
because Hamiltonian of spin glasses are used as
a starting point to define the learning task.
General Boltzmann Machine
A network of symmetrically connected, neuron-like units that make
stochastic decisions about whether to be on or off.
Restricted Boltzmann Machines (RBMs)
• A restricted Boltzmann machine (RBM) is a generative stochastic
artificial neural network that can learn a probability distribution over
its set of inputs.
• RBMs were initially invented under the name Harmonium by Paul
Smolensky in 1986, and rose to prominence after Geoffrey Hinton and
collaborators invented fast learning algorithms for them in the mid-
2000. RBMs have found applications in dimensionality reduction,
classification, collaborative filtering, feature learning, topic
modelling and even many body quantum mechanics. They can be
trained in either supervised or unsupervised ways, depending on the
task.
Restricted Boltzmann Machines (RBMs)
• As their name implies, RBMs are a variant of Boltzmann machines, with the
restriction that their neurons must form a bipartite graph: a pair of nodes
from each of the two groups of units (commonly referred to as the
"visible" and "hidden" units respectively) may have a symmetric
connection between them; and there are no connections between nodes
within a group. By contrast, "unrestricted" Boltzmann machines may have
connections between hidden units. This restriction allows for more
efficient training algorithms than are available for the general class of
Boltzmann machines, in particular the gradient-based contrastive
divergence algorithm.
• Restricted Boltzmann machines can also be used in deep learning
networks. In particular, deep belief networks can be formed by "stacking"
RBMs and optionally fine-tuning the resulting deep network with gradient
descent and backpropagation.
Restricted Boltzmann Machines (RBMs)
• The “restricted” part of the name “Restricted Boltzmann Machines”
means that connections between nodes of the same layer are
prohibited (e.g., there are no visible-visible or hidden-hidden
connections along which signal passes).
• RBMs are also a type of autoencoder
• https://github.com/echen/restricted-boltzmann-machines
Learning More About
Restricted Boltzmann Machines
A Practical Guide to Training Restricted Boltzmann Machines
http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf
Bay Area Vision Meeting: Unsupervised Feature Learning and Deep
Learning
https://www.youtube.com/watch?v=ZmNOAtZIgIk
Restricted Boltzmann Machines for Collaborative Filtering
https://www.cs.toronto.edu/~rsalakhu/papers/rbmcf.pdf
Network layout (RBM)
• Visible units
• Hidden units
• Weights
• Visible bias units
• Hidden bias units
Contrastive Divergence (FYI)
RBMs calculate gradients by using an algorithm called contrastive
divergence. Contrastive divergence is the name of the algorithm
used in sampling for the layer-wise pretraining of a RBM. Also
called CD-k, contrastive divergence minimizes the KL divergence
(the delta between the real distribution of the data and the guess)
by sampling k steps of a Markov chain to compute a guess.
Reconstruction Cross-Entropy
The objective function here is usually reconstruction cross-entropy,
or KL divergence (the mathematicians and cryptanalysts Solomon
Kullback and Richard Leibler first published a paper on the technique
in 1951). “Cross” refers to the comparison between two distributions.
“Entropy” is a term from information theory that refers to
uncertainty. For example, a normal curve with a wide spread, or
variance, also implies more uncertainty about where data points
will fall. That uncertainty is called entropy.
What Is Cross-Entropy?
https://machinelearningmastery.com/what-is-information-entropy/
RBMs Usage
• Dimensionality reduction
• Classification
• Regression
• Collaborative filtering
• Topic modeling
https://yosinski.com/media/papers/Yosinski2012VisuallyDebuggingRestrictedBoltzmannMachine.pdf
More references
Lecture 11/16 : Hopfield nets and Boltzmann machines
https://www.youtube.com/watch?v=IP3W7cI01VY
Lecture 11.5 — How a Boltzmann machine models data — [ Deep
Learning | Geoffrey Hinton | UofT ]
https://www.youtube.com/watch?v=kytxEr0KK7Q
Neural networks [5.7] : Restricted Boltzmann machine - example
https://www.youtube.com/watch?v=n26NdEtma8U
S18 Lecture 22: Boltzmann Machines
https://www.youtube.com/watch?v=it_PXVIMyWg&t=6s
Four Major Architectures of Deep Networks
Unsupervised Pretrained Networks (UPNs)
Autoencoders
Deep Belief Networks (DBNs)
Generative Adversarial Networks (GANs)
Convolutional Neural Networks (CNNs)
Recurrent Neural Networks
Recursive Neural Networks
Unsupervised Pretrained Networks (UPNs)
• In SGD optimization, one typically initiates model weights at random and tries to
go towards minimum cost by following the opposite of gradient of objective
function. For deep nets, this has not shown much of success and it is believed to
be result of extremely non-convex (and high-dimensional) nature of their
objective function.
• What Y. Bengio and others found out was that, instead of starting weights at
random and hoping that SGD will take you to minimum point of such a rugged
landscape, you can pre-train each layer like an autoencoder.
• Here is how it works: you build an autoencoder with first layer as encoding layer
and the transpose of that as decoder. And you train it unsupervised, that is you
train it to reconstruct the input (refer to AutoEncoders, they are great for
unsupervised feature extraction tasks). Once trained, you fix weights of that layer
to those you just found. Then, you move to next layers and repeat the same until
you pre-train all layers of deep net (greedy approach). At this point, you go back to
the original problem that you wanted to solve with deep net
(classification/regression) and you optimize it with SGD but starting from weights
you just learned during pre-training.
• They found that this gives much better results. I think no one knows why exactly
this works, but the idea is that by pre-training you start from more favorable
regions of feature space.
Unsupervised Pretrained Networks (UPNs)
• Unsupervised pre-training initializes a discriminative neural net from
one which was trained using an unsupervised criterion, such as a
deep belief network or a deep autoencoder. This method can
sometimes help with both the optimization and the overfitting issues.
• Unsupervised Pre-training Acts as a Regularizer
Autoencoders
• We use autoencoders to learn compressed representations of
datasets. Typically, we use them to reduce a dataset’s
dimensionality. The output of the autoencoder network is a
reconstruction of the input data in the most efficient form.
Defining Features of Autoencoders
• Autoencoders differ from multilayer perceptrons in a couple of ways:
• They use unlabeled data in unsupervised learning.
• They build a compressed representation of the input data.
• Unsupervised learning of unlabeled data. The autoencoder learns
directly from unlabeled data. This is connected to the second major
difference between multilayer perceptrons and autoencoders.
• Learning to reproduce the input data. The goal of a multilayer
perceptron network is to generate predictions over a class (e.g., fraud
versus not fraud). An autoencoder is trained to reproduce its own
input data.
Defining Features of Autoencoders
• Autoencoders rely on backpropagation to update their weights. The main difference
between RBMs and the more general class of autoencoders is in how they calculate the
gradients.
• Two important variants of autoencoders to note are compression autoencoders and
denoising autoencoders.
Compression autoencoders. The network
input must pass through a bottleneck region of the network before being expanded
back into the output representation.
Denoising autoencoders. The denoising autoencoder is the scenario in which the
autoencoder is given a corrupted version (e.g., some features are removed randomly)
of the input and the network is forced to learn the uncorrupted output.
Variational Autoencoders
• A more recent type of autoencoder
model is the variational autoencoder
(VAE) introduced by Kingma and
Welling. The VAE is similar to
compression
• and denoising autoencoders in that
they are all trained in an unsupervised
manner to reconstruct inputs.
• However, the mechanisms that the
VAEs use to perform training are quite
different. In a compression/denoising
autoencoder, activations are mapped
to activations throughout the layers,
as in a standard neural network;
comparatively, a VAE uses a
probabilistic approach for the
forward pass.
https://arxiv.org/abs/1312.6114
Deep Belief Networks
• DBNs are composed of layers of Restricted Boltzmann Machines
(RBMs) for the pretrain phase and then a feed-forward network for
the fine-tune phase.
Feature Extraction with RBM Layers
We use RBMs to extract higher-level features from the raw input
vectors. To do that, we want to set the hidden unit states and weights
such that when we show the RBM an input record and ask the RBM to
reconstruct the record the record, it generates something pretty close
to the original input vector. Hinton talks about this effect in terms of
how machines “dream about data.”
The fundamental purpose of RBMs in the context of deep learning and
DBNs is to learn these higher-level features of a dataset in an
unsupervised training fashion. It was discovered that we could train
better neural networks by letting RBMs learn progressively higher-level
features using the learned features from a lower level RBM pretrain
layer as the input to a higher-level RBM pretrain layer.
Learning Higher-order Features Automatically
Activation render at the beginning of training
Features emerge in later activation render
Portions of MNIST digits emerge towards end of training
Building Autoencoders in Keras
a simple autoencoder based on a fully-connected layer
a sparse autoencoder
a deep fully-connected autoencoder
a deep convolutional autoencoder
an image denoising model
a sequence-to-sequence autoencoder
a variational autoencoder
https://blog.keras.io/building-autoencoders-in-keras.html
What are autoencoders?
• "Autoencoding" is a data compression algorithm where the compression and decompression functions are
1) data-specific, 2) lossy, and 3) learned automatically from examples rather than engineered by a human.
Additionally, in almost all contexts where the term "autoencoder" is used, the compression and
decompression functions are implemented with neural networks.
• 1) Autoencoders are data-specific, which means that they will only be able to compress data similar to what
they have been trained on. This is different from, say, the MPEG-2 Audio Layer III (MP3) compression
algorithm, which only holds assumptions about "sound" in general, but not about specific types of sounds.
An autoencoder trained on pictures of faces would do a rather poor job of compressing pictures of trees,
because the features it would learn would be face-specific.
• 2) Autoencoders are lossy, which means that the decompressed outputs will be degraded compared to the
original inputs (similar to MP3 or JPEG compression). This differs from lossless arithmetic compression.
• 3) Autoencoders are learned automatically from data examples, which is a useful property: it means that it
is easy to train specialized instances of the algorithm that will perform well on a specific type of input. It
doesn't require any new engineering, just appropriate training data.
• The fact that autoencoders are data-specific makes them generally impractical for real-world data
compression problems: you can only use them on data that is similar to what they were trained on, and
making them more general thus requires lots of training data.
http://arxiv.org/abs/1603.09246
Code
• simplest possible autoencoder.ipynb
Deep Belief Network (DBN)
• In machine learning, a deep belief network (DBN) is a generative graphical model, or alternatively
a class of deep neural network, composed of multiple layers of latent variables ("hidden units"),
with connections between the layers but not between units within each layer.
• When trained on a set of examples without supervision, a DBN can learn to probabilistically
reconstruct its inputs. The layers then act as feature detectors. After this learning step, a DBN can
be further trained with supervision to perform classification.
• DBNs can be viewed as a composition of simple, unsupervised networks such as restricted
Boltzmann machines (RBMs) or autoencoders, where each sub-network's hidden layer serves as
the visible layer for the next. An RBM is an undirected, generative energy-based model with a
"visible" input layer and a hidden layer and connections between but not within layers. This
composition leads to a fast, layer-by-layer unsupervised training procedure, where contrastive
divergence is applied to each sub-network in turn, starting from the "lowest" pair of layers (the
lowest visible layer is a training set).
• The observation that DBNs can be trained greedily, one layer at a time, led to one of the first
effective deep learning algorithms.:6 Overall, there are many attractive implementations and
uses of DBNs in real-life applications and scenarios (e.g., electroencephalography, drug discovery
http://www.scholarpedia.org/article/Deep_belief_networks
http://www.cs.toronto.edu/~hinton/absps/fastnc.pdf
References
• Bengio, Y., Lamblin, P., Popovici, P., Larochelle, H. (2007) Greedy Layer-Wise Training of Deep Networks, Advances in Neural
Information Processing Systems 19, MIT Press, Cambridge, MA.
• Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18:1527-
1554.
• Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313:504-507.
• Larochelle, H., Erhan, D., Courville, A., Bergstra, J., Bengio, Y. (2007) An Empirical Evaluation of Deep Architectures on Problems
with Many Factors of Variation. International Conference on Machine Learning.
• LeCun, Y. and Bengio, Y. (2007) Scaling Learning Algorithms Towards AI. In Bottou et al. (Eds.) Large-Scale Kernel Machines, MIT
Press.
• M. Ranzato, F.J. Huang, Y. Boureau, Y. LeCun (2007) Unsupervised Learning of Invariant Feature Hierarchies with Applications to
Object Recognition. Proc. of Computer Vision and Pattern Recognition Conference (CVPR 2007), Minneapolis, Minnesota, 2007
• Salakhutdinov, R. R. and Hinton,G. E. (2007) Semantic Hashing. In Proceedings of the SIGIR Workshop on Information Retrieval
and Applications of Graphical Models, Amsterdam.
• Sutskever, I. and Hinton, G. E. (2007) Learning multilevel distributed representations for high-dimensional sequences. AI and
Statistics, 2007, Puerto Rico.
• Taylor, G. W., Hinton, G. E. and Roweis, S. (2007) Modeling human motion using binary latent variables. Advances in Neural
Information Processing Systems 19, MIT Press, Cambridge, MA
• Welling, M., Rosen-Zvi, M., and Hinton, G. E. (2005). Exponential family harmoniums with an application to information retrieval.
Advances in Neural Information Processing Systems 17, pages 1481-1488. MIT Press, Cambridge, MA.
Generative Adversarial Networks (GANs)
• The Generative Adversarial Network
(GAN) comprises of two models: a
generative model G and a discriminative
model D. The generative model can be
considered as a counterfeiter who is
trying to generate fake currency and use
it without being caught, whereas the
discriminative model is similar to police,
trying to catch the fake currency. This
competition goes on till the
counterfeiter becomes smart enough to
successfully fool the police.
Generative Adversarial Networks (GANs)
• Discriminator: The role is to distinguish between actual and generated
(fake) data.
• Generator: The role is to create data in such a way that it can fool the
discriminator.
Derivation of the loss function
Discriminator loss
the objective of the discriminator is to correctly classify the fake and real dataset. For this, equations (1) and (2)
should be maximized and final loss function for the discriminator can be given as,
Generator loss
the generator is competing against discriminator. So, it will try
to minimize the equation (3) and loss function is given as,
Combined loss function
Remember that the above loss function is valid only for a single
data point, to consider entire dataset we need to take the
expectation of the above equation as
It can be noticed from the above algorithm
that the generator and discriminator are
trained separately. In the first section, real
data and fake data are inserted into the
discriminator with correct labels and training
takes place. Gradients are propagated
keeping generator fixed. Also, we update the
discriminator by ascending its stochastic
gradient because for discriminator we want
to maximize the loss function given in
equation (6).
On the other hand, we update the generator
by keeping discriminator fixed and passing
fake data with fake labels in order to fool the
discriminator. Here, we update the generator
by descending its stochastic gradient because
for the generator we want to minimize the
loss function given in equation (6).
Global Optimality of Pg = Pdata
Limitations: Vanishing Gradient
Limitations: Mode Collapse
• During training, the generator may get stuck into a setting where it
always produces the same output. This is called mode collapse. This
happens because the main aim of G was to fool D not to generate
diverse output.
References
• Atienza, Rowel. Advanced Deep Learning with Keras: Apply deep
learning techniques, autoencoders, GANs, variational autoencoders,
deep reinforcement learning, policy gradients, and more. Packt
Publishing Ltd, 2018.
• Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in
neural information processing systems. 2014.
• Wang, Zhengwei, Qi She, and Tomas E. Ward. “Generative Adversarial
Networks: A Survey and Taxonomy.” arXiv preprint arXiv:1906.01529
(2019).
Convolution
Convolution
• A convolution is an operation on two
vectors, matrices, or tensors, that
returns a third vector, matrix, or tensor.
• https://deeplearningmath.org/convolut
ional-neural-networks.html
Edge Detection
Edge Detection
What is a Color Model?
• A color model is an abstract mathematical model that describes how colors
can be represented as a set of numbers (e.g., a triple in RGB or a quad in
CMYK). Color models can usually be described using a coordinate system,
and each color in the system is represented by a single point in the
coordinate space.
• For a given color model, to interpret a tuple or quad as a color, we can
define a set of rules and definitions used to accurately calibrate and
generate colors, i.e. a mapping function. A color space identifies a specific
combination of color models and mapping functions. Identifying the color
space automatically identifies the associated color model. For example,
Adobe RGB and sRGB are two different color spaces, both based on the
RGB color model.
RGB
• RGB color model stores individual values for red, green, and blue. With a color space based on the
RGB color model, the three primaries are added together to create colors from completely white to
completely black.
• The RGB color space is associated with the device. Thus, different scanners get different color image
data when scanning the same image; different monitors have different color display results when
rendering the same image.
• There are many different RGB color spaces derived from this color model, standard RGB (sRGB) is a
popular example.
HSV
• HSV (hue, saturation, value), also known as HSB (hue, saturation, brightness),
is often used by artists because it is often more natural to think about a color
in terms of hue and saturation than in terms of additive or subtractive color
components.
• The system is closer to people’s experience and perception of color than RGB.
For example, in painting terms, hue, saturation, and values are expressed in
terms of color, shading, and toning.
• The HSV model space can be described as an inverted hexagonal pyramid.
• The top surface is a regular hexagon, showing the change in hue in the H
direction, from 0 ° to 360 ° is the entire spectrum of visible light. The six
corners of the hexagon represent the positions of the six colors of red, yellow,
green, cyan, blue, and magenta, each of which is 60 ° apart.
• The saturation S is represented by the S direction from the center to the
hexagonal boundary, and the value varies from 0 to 1. The closer to the
hexagonal boundary, the higher the color saturation. The color of the
hexagonal boundary is the most saturated, i.e. S = 1; the color saturation at
the center of the hexagon is 0, i.e. S = 0.
• The height of the hexagonal pyramid (also known as the central axis) is
denoted by V, which represents a black to white gradation from bottom to
top. The bottom of V is black, V = 0; the top of V is white, V = 1.
YUV
• The Y′UV model defines a color space in terms
of one luma component (Y′) and two
chrominance (UV) components. The Y′ channel
saves black and white data. If there is only the Y
component and there are no U and V
components, then the graph represented is
grayscale.
• The Y component can be calculated with the
following equation: Y = 0. 299R+ 0. 587G+ 0.
114*B, which is the commonly used grayscale
formula. The color difference U and V are
compressed by B-Y and R-Y in different
proportions.
• Compared with RGB, Y’UV does not necessarily
store a triple tuple for each pixel. Y′UV images
can be sampled in several different ways. For
example, with YUV420, it saves one luma
component for every point and two chroma
values—a Cr (U) value and a Cb (V) value—
every 2×2 points. I.E., 6 bytes per 4 pixels.
YUV
• The scope of the terms Y′UV, YUV, YCbCr, YPbPr, etc., is sometimes
ambiguous and overlapping. Historically, the terms YUV and Y′UV
were used for a specific analog encoding of color information in
television systems, while YCbCr was used for digital encoding of color
information suited for video and still-image compression and
transmission such as MPEG and JPEG. Today, the term YUV is
commonly used in the computer industry to describe file-formats that
are encoded using YCbCr.
Color Space
Depending on the information represented by each pixel, images can be divided into binary images, grayscale images, RGB images,
and index images, etc.
Binary Image
In a binary image, the pixel value is represented by a 0 or 1. Generally, 0 is for black and 1 is for white.
Grayscale Image
The grayscale image adds a color depth between black and white in the binary image to form a grayscale image. Such images are
usually displayed as grayscales from the darkest black to the brightest white, and each color depth is called a grayscale, usually
denoted by L. In grayscale images, pixels can take integer values between 0 and L-1.
RGB Image
In RGB or color, images, the information for each pixel requires a tuple of numbers to represent. So we need a three-dimensional
matrix to represent an image. Almost all colors in nature can be composed of three colors: red (R), green (G), and blue (B). So each
pixel can be represented by a red/green/blue tuple in an RGB image.
Indexed Image
An indexed image consists of a colormap matrix, which uses direct mapping of pixel values in an array to colormap values. The color
of each pixel in an image is determined by using the corresponding value. We discuss this in more detail below.
How an Image is Stored in Memory
• The x86 hardware does not
have an addressing mode
that accesses elements of
multi-dimensional arrays.
When loading an image into
memory space, the multi-
dimensional object is
converted into a one-
dimensional array. Row
major ordering or column
major ordering are
commonly used.
Seven Grayscale Conversion Algorithms
• Method 1 - Averaging (aka “quick and dirty”)
• Method 2 - Correcting for the human eye (sometimes called “luma”
or “luminance,” though such terminology isn’t really accurate)
• Method 3 – Desaturation
• Method 4 - Decomposition (think of it as de-composition, e.g. not the
biological process!)
• Method 5 - Single color channel
• Method 6 - Custom # of gray shades
• Method 7 - Custom # of gray shades with dithering (in this example,
horizontal error-diffusion dithering)
https://tannerhelland.com/2011/10/01/grayscale-image-algorithm-vb6.html
M1: Gray = (Red + Green + Blue) / 3
M2:
Gray = (Red * 0.3 + Green * 0.59 + Blue * 0.11) Photoshop, GIMP
Gray = (Red * 0.2126 + Green * 0.7152 + Blue * 0.0722) Luma
Gray = (Red * 0.299 + Green * 0.587 + Blue * 0.114)
http://poynton.ca/
M3: Gray = ( Max(Red, Green, Blue) + Min(Red, Green, Blue) ) / 2
M4:
Maximum decomposition: Gray = Max(Red, Green, Blue)
Minimum decomposition: Gray = Min(Red, Green, Blue)
M5 Gray = Red …or: Gray = Green …or: Gray = Blue
M6
ConversionFactor = 255 / (NumberOfShades - 1)
AverageValue = (Red + Green + Blue) / 3
Gray = Integer((AverageValue / ConversionFactor) + 0.5) * ConversionFactor
Notes:
-NumberOfShades is a value between 2 and 256
-technically, any grayscale algorithm could be used to calculate AverageValue; it simply
provides an initial gray value estimate
-the "+ 0.5" addition is an optional parameter that imitates rounding the value of an
integer conversion; YMMV depending on which programming language you use, as
some round automatically
M7
Dither
https://en.wikipedia.org/wiki/Dither#Algorithms
Check out
• How to Colorize an Image
• https://tannerhelland.com/2011/04/28/colorize-image-vb6.html
https://github.com/tannerhelland/vb6-code/tree/master/Colorize-effect
https://www.youtube.com/watch?v=ExOOElyZ2Hk
This researcher created an algorithm that removes the water from underwater images
https://www.youtube.com/watch?v=-sdGCvSfWFk
https://www.youtube.com/watch?v=Pi8v6i8Y32s
Edge Detection
Edge Detection
• Canny edge detector
• Kovalevsky
• First-order Methods (Sobel)
• Thresholding and linking
• Edge thinning
• Second-order approaches (early Marr–Hildreth operator)
• Differential
• Phase congruency-based
• Phase Stretch Transform (PST)
• Subpixel
Sobel operator
• The Sobel operator, sometimes
called the Sobel–Feldman
operator or Sobel filter, is used
in image processing and
computer vision, particularly
within edge detection
algorithms where it creates an
image emphasising edges. It is
named after Irwin Sobel and
Gary Feldman, colleagues at the
Stanford Artificial Intelligence
Laboratory (SAIL)
Sobel operator
Sobel operator
Convolutional Neural Networks (CNNs)
The goal of a CNN is to learn higher-order
features in the data via convolutions.
They are well suited to object recognition
with images and consistently top image
classification competitions. They can
identify faces, individuals, street signs,
platypuses, and many other aspects of
visual data. CNNs overlap with text
analysis via optical character recognition,
but they are also useful when analyzing
words as discrete textual units. They’re
also good at analyzing sound.
The efficacy of CNNs in image recognition
is one of the main reasons why the world
recognizes the power of deep learning. As
Figure CNNs are good at building position
and (somewhat) rotation invariant
features from raw image data
https://poloclub.github.io/cnn-explainer/
CNNs and Structure in Data
CNNs tend to be most useful when there is some structure to the input data. An example would be how images
and audio data that have a specific set of repeating patterns and input values next to each other are related
spatially. Conversely, the columnar data exported from a relational database management system (RDBMS)
tends to have no structural relationships spatially. Columns next to one another just happen to be materialized
that way in the database exported materialized view.
The motivation of convolutions
• Sparse interaction, or Local connectivity.
• The receptive field of the neuron, or the filter size.
• The connections are local in space (width and height), but
always full in depth
• A set of learnable filters
• Parameters sharing, the weights are tied
• Equivariant representation, translation invariant
Convolution and matrix multiplication
• Discrete convolution can be viewed as multiplication
by matrix
• The kernel is a doubly block circulant matrix
• It is very sparse!
The ‘convolution’ operation
• The convolution is commutative because we have flipped the kernel
• Many implement a cross-correlation without flipping
• A convolution can be defined for 1, 2, 3, and N D
• The 2D convolution is different from a real 3D convolution, which
integrates the spatio-temporal information, the standard CNN convolution
has only ‘spatial’ spreading
• In CNN, even for 3D RGB input images, the standard convolution is
2D in each channel,
• each channel has a different filter or kernel, the convolution per channel is
then summed up in all channels to produce a scalar for non-linearity
activation
• The filiter in each channel is not normalized, so no need to have different
linear combination coefficients.
• 1*1 convolution is a dot product in different channel, a linear combination
of different chanels
• The backward pass of a convolution is also a convolution with
spatially flipped filters.
VisGraph, HKUST
The convolution layers
• Stacking several small convolution layers is different from
convolution cascating
• As each small convolution is followed by the nonlinearities ReLU
• The nonlinearities make the features more expressive!
• Have fewer parameters with small filters, but more memory.
• Cascating simply enlarges the spatial extent, the receptive field
• Whether each conv layer is also followed by a pooling?
• Lenet does not!
• AlexNet first did not.
VisGraph, HKUST
The Pooling Layer
• Reduce the spatial size
• Reduce the amount of parameters
• Avoid over-fitting
• Backpropagation for a max: only routing the gradient to
the input that had the highest value in the forward pass
• It is unclear whether the pooling is essential.
Pooling layer down-samples the volume spatially, independently in each depth
slice of the input volume.
Left: the input volume of size [224x224x64] is pooled with filter size 2, stride 2
into output volume of size [112x112x64]. Notice that the volume depth is
preserved.
Right: The most common down-sampling operation is max, giving rise to max
pooling, here shown with a stride of 2. That is, each max is taken over 4
numbers (little 2x2 square).
The spatial hyperparameters
• Depth
• Stride
• Zero-padding
AlexNet 2012
• A strong prior has very low entropy, e.g. a Gaussian
with low variance
• An infinitely strong prior says that some parameters
are forbidden, and places zero probability on them
• The convolution ‘prior’ says the identical and zero
weights
• The pooling forces the invariance of small translations
The convolution and pooling act as an
infinitely strong prior!
• A strong prior has very low entropy, e.g. a Gaussian
with low variance
• An infinitely strong prior says that some parameters
are forbidden, and places zero probability on them
• The convolution ‘prior’ says the identical and zero
weights
• The pooling forces the invariance of small translations
The neuroscientific basis for CNN
• The primary visual cortex, V1, about which we know the most
• The brain region LGN, lateral geniculate nucleus, at the back of the head carries the
signal from the eye to V1, a convolutional layer captures three aspects of V1
• It has a 2-dimensional structure
• V1 contains many simple cells, linear units
• V1 has many complex cells, corresponding to features with shift invariance, similar to pooling
• When viewing an object, info flows from the retina, through LGN, to V1, then
onward to V2, then V4, then IT, inferotemporal cortex, corresponding to the last
layer of CNN features
• Not modeled at all. The mammalian vision system develops an attention mechanism
• The human eye is mostly very low resolution, except for a tiny patch fovea.
• The human brain makes several eye movements saccades to salient parts of the scene
• The human vision perceives 3D
• A simple cell responds to a specific spatial frequency of brightness in a specific
direction at a specific location --- Gabor-like functions
Receptive field
Left: An example input volume in red (e.g. a 32x32x3 CIFAR-10 image), and an example volume of
neurons in the first Convolutional layer. Each neuron in the convolutional layer is connected only to
a local region in the input volume spatially, but to the full depth (i.e. all color channels). Note, there
are multiple neurons (5 in this example) along the depth, all looking at the same region in the input
- see discussion of depth columns in text below. Right: The neurons from the Neural Network
chapter remain unchanged: They still compute a dot product of their weights with the input
followed by a non-linearity, but their connectivity is now restricted to be local spatially.
Receptive field
Gabor functions
Gabor-like learned features
CNN architectures and algorithms
CNN architectures
• The conventional linear structure, linear list of layers, feedforward
• Generally a DAG, directed acyclic graph
• ResNet simply adds back
• Different terminology: complex layer and simple layer
– A complex (complete) convolutional layer, including different stages such
as convolution per se, batch normalization, nonlinearity, and pooling
– Each stage is a layer, even there are no parameters
• The traditional CNNs are just a few complex convolutional layers to
extract features, then are followed by a softmax classification
output layer
• Convolutional networks output a high-dimensional, structured
object, rather than just predicting a class label for a classiciation
task or a real valuefor a regression task, it it an output tensor
– S_i,j,k is the probability that pixel j,k belongs to class I
The popular CNN
• LeNet, 1998
• AlexNet, 2012
• VGGNet, 2014
• ResNet, 2015
VGGNet
• 16 layers
• Only 3*3
convolutions
• 138 million
parameters
ResNet
• 152 layers
• ResNet50
Computational complexity
• The memory bottleneck
• GPU, a few GB
Stochastic Gradient Descent
• Gradient descent, follows the gradient of an
entire training set downhill
• Stochastic gradient descent, follows the gradient
of randomly selected minbatches downhill
The dropout regularization
• Randomly shutdown a subset of units in training
• It is a sparse representation
• It is a different net each time, but all nets share the parameters
• A net with n units can be seen as a collection of 2^n possible thinned nets,
all of which share weights.
• At test time, it is a single net with averaging
• Avoid over-fitting
Smaller Network: CNN
• We know it is good to learn a small model.
• From this fully connected model, do we really need all the edges?
• Can some of these be shared?
Consider learning an image:
• Some patterns are much smaller than the whole
image
“beak” detector
Can represent a small region with fewer parameters
Same pattern appears in different places:
They can be compressed!
What about training a lot of such “small” detectors
and each detector must “move around”.
“ -left
b k” detector
“ i b k”
detector
They can be compressed
to the same parameters.
A convolutional layer
A filter
A CNN is a neural network with some convolutional layers
(and some other layers). A convolutional layer has a number
of filters that does convolutional operation.
Beak detector
Convolution
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
-1 1 -1
-1 1 -1
-1 1 -1
Filter 2
…
…
These are the network
parameters to be learned.
Each filter detects a
small pattern (3 x 3).
Convolution
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
3 -1
stride=1
Dot
product
Convolution
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
3 -3
If stride=2
Convolution
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
3 -1 -3 -1
-3 1 0 -3
-3 -3 0 1
3 -2 -2 -1
stride=1
Convolution
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
3 -1 -3 -1
-3 1 0 -3
-3 -3 0 1
3 -2 -2 -1
-1 1 -1
-1 1 -1
-1 1 -1
Filter 2
-1 -1 -1 -1
-1 -1 -2 1
-1 -1 -2 1
-1 0 -4 3
Repeat this for each filter
stride=1
Two 4 x 4 images
Forming 2 x 4 x 4 matrix
Feature
Map
Color image: RGB 3 channels
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
-1 1 -1
-1 1 -1
-1 1 -1
Filter 2
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
-1 1 -1
-1 1 -1
-1 1 -1
-1 1 -1
-1 1 -1
-1 1 -1
Color image
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
image
convolution
-1 1 -1
-1 1 -1
-1 1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1
x
2
x
…
…
36
x
…
…
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
Convolution v.s. Fully Connected
Fully-
connected
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
1
2
3
…
8
9
…
1
3
14
15
…
Only connect to 9
inputs, not fully
connected
4:
10:
16
1
0
0
0
0
1
0
0
0
0
1
1
3
fewer parameters!
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
1:
2:
3:
…
7:
8:
9:
…
1
3:
14:
15:
…
4:
10:
16:
1
0
0
0
0
1
0
0
0
0
1
1
3
-1
Shared weights
6 x 6 image
Fewer parameters
Even fewer parameters
The whole CNN
Fully Connected
Feedforward network
c g ……
Convolution
Max Pooling
Convolution
Max Pooling
Flattened
Can
repeat
many
times
Max Pooling
3 -1 -3 -1
-3 1 0 -3
-3 -3 0 1
3 -2 -2 -1
-1 1 -1
-1 1 -1
-1 1 -1
Filter 2
-1 -1 -1 -1
-1 -1 -2 1
-1 -1 -2 1
-1 0 -4 3
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
Why Pooling
• Subsampling pixels will not change the object
Subsampling
bird
bird
We can subsample the pixels to make image
smaller fewer parameters to characterize the image
A CNN compresses a fully connected network
in two ways:
• Reducing number of connections
• Shared weights on the edges
• Max pooling further reduces the complexity
Max pooling is a pooling operation that selects the maximum
element from the region of the feature map covered by the
filter. Thus, the output after max-pooling layer would be a
feature map containing the most prominent features of the
previous feature map
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
3 0
1
3
-1 1
3
0
2 x 2 image
Each filter
is a channel
New image
but smaller
Conv
Max
Pooling
Max Pooling
The whole CNN
Convolution
Max Pooling
Convolution
Max Pooling
Can
repeat
many
times
A new image
The number of channels
is the number of filters
Smaller than the original
image
3 0
1
3
-1 1
3
0
The whole CNN
Fully Connected
Feedforward network
c g ……
Convolution
Max Pooling
Convolution
Max Pooling
Flattened
A new image
A new image
Flattening
3 0
1
3
-1 1
3
0 Flattened
3
0
1
3
-1
1
0
3
Fully Connected
Feedforward network
Only modified the network structure and input
format (vector -> 3-D tensor)
CNN in Keras
Convolution
Max Pooling
Convolution
Max Pooling
input
1 -1 -1
-1 1 -1
-1 -1 1
-1 1 -1
-1 1 -1
-1 1 -1
There are
25 3x3
filters.
…
…
Input_shape = ( 28 , 28 , 1)
1: black/white, 3: RGB
28 x 28 pixels
3 -1
-3 1
3
Only modified the network structure and input
format (vector -> 3-D array)
CNN in Keras
Convolution
Max Pooling
Convolution
Max Pooling
Input
1 x 28 x 28
25 x 26 x 26
25 x 13 x 13
50 x 11 x 11
50 x 5 x 5
How many parameters for
each filter?
How many parameters
for each filter?
9
225=
25x9
Only modified the network structure and input
format (vector -> 3-D array)
CNN in Keras
Convolution
Max Pooling
Convolution
Max Pooling
Input
1 x 28 x 28
25 x 26 x 26
25 x 13 x 13
50 x 11 x 11
50 x 5 x 5
Flattened
1250
Fully connected
feedforward network
Output
AlphaGo
Neural
Network
(19 x 19
positions)
Next move
19 x 19 matrix
Black: 1
white: -1
none: 0
Fully-connected feedforward network
can be used
But CNN performs much better
AlphaGo’s policy network
Note: AlphaGo does not use Max Pooling.
The following is quotation from their Nature article:
CNN in speech recognition
Time
Frequency
Spectrogram
CNN
Image
The filters move in the
frequency direction.
CNN in text classification
Source of image:
http://citeseerx.ist.psu.edu/viewdoc/downlo
ad?doi=10.1.1.703.6858&rep=rep1&type=p
df
?
Dimd_m_004 DL.pdf

More Related Content

Similar to Dimd_m_004 DL.pdf

chap4_Parametric_Methods.ppt
chap4_Parametric_Methods.pptchap4_Parametric_Methods.ppt
chap4_Parametric_Methods.pptShayanChowdary
 
M08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffM08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffRaman Kannan
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine LearningKnoldus Inc.
 
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONGENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONijaia
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .pptbutest
 
Statistics for deep learning
Statistics for deep learningStatistics for deep learning
Statistics for deep learningSung Yub Kim
 
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Maninda Edirisooriya
 
MACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxMACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxNAGARAJANS68
 
Machine-Learning-with-Ridge-and-Lasso-Regression.pdf
Machine-Learning-with-Ridge-and-Lasso-Regression.pdfMachine-Learning-with-Ridge-and-Lasso-Regression.pdf
Machine-Learning-with-Ridge-and-Lasso-Regression.pdfAyadIliass
 
Unit III_Ch 17_Probablistic Methods.pptx
Unit III_Ch 17_Probablistic Methods.pptxUnit III_Ch 17_Probablistic Methods.pptx
Unit III_Ch 17_Probablistic Methods.pptxsmithashetty24
 
Getting_Started_with_DL_in_Keras.pptx
Getting_Started_with_DL_in_Keras.pptxGetting_Started_with_DL_in_Keras.pptx
Getting_Started_with_DL_in_Keras.pptxMohamed Essam
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdfgadissaassefa
 
Data mining Part 1
Data mining Part 1Data mining Part 1
Data mining Part 1Gautam Kumar
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind MapAshish Patel
 
Sarcia idoese08
Sarcia idoese08Sarcia idoese08
Sarcia idoese08asarcia
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.pptDeadpool120050
 

Similar to Dimd_m_004 DL.pdf (20)

chap4_Parametric_Methods.ppt
chap4_Parametric_Methods.pptchap4_Parametric_Methods.ppt
chap4_Parametric_Methods.ppt
 
M08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffM08 BiasVarianceTradeoff
M08 BiasVarianceTradeoff
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine Learning
 
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONGENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
 
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .ppt
 
Statistics for deep learning
Statistics for deep learningStatistics for deep learning
Statistics for deep learning
 
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
 
numerical analysis
numerical analysisnumerical analysis
numerical analysis
 
MACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxMACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptx
 
Machine-Learning-with-Ridge-and-Lasso-Regression.pdf
Machine-Learning-with-Ridge-and-Lasso-Regression.pdfMachine-Learning-with-Ridge-and-Lasso-Regression.pdf
Machine-Learning-with-Ridge-and-Lasso-Regression.pdf
 
Unit III_Ch 17_Probablistic Methods.pptx
Unit III_Ch 17_Probablistic Methods.pptxUnit III_Ch 17_Probablistic Methods.pptx
Unit III_Ch 17_Probablistic Methods.pptx
 
Getting_Started_with_DL_in_Keras.pptx
Getting_Started_with_DL_in_Keras.pptxGetting_Started_with_DL_in_Keras.pptx
Getting_Started_with_DL_in_Keras.pptx
 
final paper1
final paper1final paper1
final paper1
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdf
 
Data mining Part 1
Data mining Part 1Data mining Part 1
Data mining Part 1
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind Map
 
Sarcia idoese08
Sarcia idoese08Sarcia idoese08
Sarcia idoese08
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.ppt
 
ML-Unit-4.pdf
ML-Unit-4.pdfML-Unit-4.pdf
ML-Unit-4.pdf
 

Recently uploaded

CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
Effects of rheological properties on mixing
Effects of rheological properties on mixingEffects of rheological properties on mixing
Effects of rheological properties on mixingviprabot1
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 

Recently uploaded (20)

CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Effects of rheological properties on mixing
Effects of rheological properties on mixingEffects of rheological properties on mixing
Effects of rheological properties on mixing
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 

Dimd_m_004 DL.pdf

  • 2. “If at first the idea is not absurd, then there is no hope for it.” ― Albert Einstein
  • 3.
  • 4.
  • 5.
  • 6. Core Components Parameters Layers Activation functions Loss functions Optimization methods Hyperparameters
  • 7. Loss Functions • Squared error loss • Logistic loss • Hinge loss • Negative log likelihood
  • 8. Squared Error Loss • The Mean Squared Error (MSE) either assesses the quality of a predictor (i.e., a function mapping arbitrary inputs to a sample of values of some random variable), or of an estimator (i.e., a mathematical function mapping a sample of data to an estimate of a parameter of the population from which the data is sampled). The definition of an MSE differs according to whether one is describing a predictor or an estimator. Predictor Estimator variance and bias relationship
  • 11. Loss Functions Loss function is a method that evaluates how well the algorithm learns the data and produces correct outputs. It computes the distance between our predicted value and the actual value using a mathematical formula. In layman's terms, a loss function measures how wrong the model is in terms of its ability to estimate the relationship between x and y. Loss functions can be categorized into two groups: • Classification - which is about predicting a label, by identifying which category an object belongs to based on different parameters. • Regression - which is about predicting a continuous output, by finding the correlations between dependent and independent variables.
  • 12. Loss function by category Regression Loss Functions Squared Error Loss Absolute Error Loss Huber Loss Binary Classification Loss Functions Binary Cross-Entropy Hinge Loss Multi-class Classification Loss Functions Multi-class Cross Entropy Loss Kullback Leibler Divergence Loss
  • 13.
  • 14. Mean Squared Error (MSE) Advantage: The MSE is great for ensuring that our trained model has no outlier predictions with huge errors, since the MSE puts larger weight on theses errors due to the squaring part of the function. Disadvantage: If our model makes a single very bad prediction, the squaring part of the function magnifies the error. Yet in many practical cases we don’t care much about these outliers and are aiming for more of a well-rounded model that performs good enough on the majority.
  • 15.
  • 16. Mean Absolute Error (MAE) Advantage: The beauty of the MAE is that its advantage directly covers the MSE disadvantage. Since we are taking the absolute value, all of the errors will be weighted on the same linear scale. Thus, unlike the MSE, we won’t be putting too much weight on our outliers and our loss function provides a generic and even measure of how well our model is performing. Disadvantage: If we do in fact care about the outlier predictions of our model, then the MAE won’t be as effective. The large errors coming from the outliers end up being weighted the exact same as lower errors. This might results in our model being great most of the time, but making a few very poor predictions every so-often.
  • 17. The MSE is great for learning outliers while the MAE is great for ignoring them. But what about something in the middle? What this equation essentially says is: for loss values less than delta, use the MSE; for loss values greater than delta, use the MAE. Consider an example where we have a dataset of 100 values we would like our model to be trained to predict. Out of all that data, 25% of the expected values are 5 while the other 75% are 10. An MSE loss wouldn’t quite do the trick, since we don’t really have “outliers”; 25% is by no means a small fraction. On the other hand we don’t necessarily want to weight that 25% too low with an MAE. Those values of 5 aren’t close to the median (10 — since 75% of the points have a value of 10), but they’re also not really outliers. Our solution?
  • 18. Huber Loss • Using the MAE for larger loss values mitigates the weight that we put on outliers so that we still get a well-rounded model. At the same time we use the MSE for the smaller loss values to maintain a quadratic function near the center. • This has the effect of magnifying the loss values as long as they are greater than 1. Once the loss for those data points dips below 1, the quadratic function down- weights them to focus the training on the higher-error data points.
  • 19. Entropy • The core idea of information theory is that the "informational value" of a communicated message depends on the degree to which the content of the message is surprising. • If a highly likely event occurs, the message carries very little information. On the other hand, if a highly unlikely event occurs, the message is much more informative. • For instance, the knowledge that some particular number will not be the winning number of a lottery provides very little information, because any particular chosen number will almost certainly not win. However, knowledge that a particular number will win a lottery has high informational value because it communicates the outcome of a very low probability event. Entropy: Origin of the Second Law of Thermodynamics
  • 20. Entropy • Entropy measures the expected (i.e., average) amount of information conveyed by identifying the outcome of a random trial. This implies that casting a die has higher entropy than tossing a coin because each outcome of a die toss has smaller probability (about p = 1 / 6 than each outcome of a coin toss ( p = 1 / 2 ).
  • 21. https://www.youtube.com/watch?v=Tr_gv5CKB1Y Boltzmann's Entropy Equation: A History from Clausius to Planck Lecture 04, concept 12: Deriving the Boltzmann distribution - general case https://www.youtube.com/watch?v=tDKjLzbXYQI The Maxwell–Boltzmann distribution | AP Chemistry | Khan Academy https://www.youtube.com/watch?v=xQ9D4Jz95-A&t=10s
  • 23. Relationship to thermodynamic entropy - Boltzmann Data compression - Shannon Entropy as a measure of diversity Entropy in cryptography - Kullback–Leibler divergence (statistical distance) Data as a Markov process Entropy for continuous random variables Prior probability?
  • 24.
  • 28.
  • 29. Intuitively Understanding the KL Divergence https://www.youtube.com/watch?v=SxGYPqCgJWM
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38. What’s the Difference between a Loss Function and a Cost Function? • although cost function and loss function are synonymous and used interchangeably, they are different. • A loss function is for a single training example. It is also sometimes called an error function. A cost function, on the other hand, is the average loss over the entire training dataset. The optimization strategies aim at minimizing the cost function.
  • 40. Learning Rate The learning rate affects the amount by which you adjust parameters during optimization in order to minimize the error of neural network’s guesses. It is a coefficient that scales the size of the steps (updates) a neural network takes to its parameter vector x as it crosses the loss function space. During backpropagation we multiply the error gradient by the learning rate, and then update a connection weight’s last iteration with the product to reach a new weight. The learning rate determines how much of the gradient we want to use for the algorithm’s next step. A large error and steep gradient combine with the learning rate to produce a large step. As we approach minimal error and the gradient flattens out, the step size tends to shorten.
  • 41. Learning Rate A large learning rate coefficient (e.g., 1) will make your parameters take leaps, and small ones (e.g., 0.00001) will make it inch along slowly. Large leaps will save time initially, but they can be disastrous if they lead us to overshoot our minimum. A learning rate too large oversteps the nadir, making the algorithm bounce back and forth on either side of the minimum without ever coming to rest. In contrast, small learning rates should lead you eventually to an error minimum (it might be a local minimum rather than a global one), but they can take a very long time and add to the burden of an already computationally intensive process. Time matters when neural network training can take weeks on large datasets. If you can’t wait another week for the results, choose a moderate learning rate (e.g., 0.1) and experiment with several others in the same ballpark to get the best speed and accuracy at once
  • 42. Regularization Regularization helps with the effects of out-of-control parameters by using different methods to minimize parameter size over time. In mathematical notation, we see regularization represented by the coefficient lambda, controlling the trade-off between finding a good fit and keeping the value of certain feature weights low as the exponents on features increase. Regularization coefficients L1 and L2 help fight overfitting by making certain weights smaller. Smaller-valued weights lead to simpler hypotheses, and simpler hypotheses are the most generalizable. Unregularized weights with several higher- order polynomials in the feature set tend to overfit the training set. As the input training set size grows, the effect of regularization decreases and the parameters tend to increase in magnitude. This is appropriate, because an excess of features relative to training set examples leads to overfitting in the first place. Bigger data is the ultimate regularizer
  • 43.
  • 44. Intuitive Explanation of Ridge / Lasso Regression https://www.youtube.com/watch?v=9LNpiiKCQUo&t=524s
  • 45. Momentum Momentum helps the learning algorithm get out of spots in the search space where it would otherwise become stuck. In the error-scape, it helps the updater find the gulleys that lead toward the minima. Momentum is to the learning rate what the learning rate is to weights, and it helps us produce better quality models
  • 46. Optimization Tricks: momentum, batch-norm, and more https://www.youtube.com/watch?v=kK8-jCCR4is
  • 47. Sparsity The sparsity hyperparameter recognizes that for some inputs only a few features are relevant. For example, let’s assume that a network can classify a million images. Any one of those images will be indicated by a limited number of features. But to effectively classify millions of images a network must be able to recognize considerably more features, many of which don’t appear most of the time. An example of this would be how photos of sea urchins don’t contain noses and hooves. This contrasts to how in submarine images the nose and hoof features will be 0. The features that indicate sea urchins will be few and far between, in the vastness of the neural network’s layers. That’s a problem, because sparse features can limit the number of nodes that activate and impede a network’s ability to learn. In response to sparsity, biases force neurons to activate and the activations stay around a mean that keeps the network from becoming stuck. Hoofs
  • 49. Optimization algorithms for deep learning The gradient descent algorithm is not the only optimization algorithm available to optimize our network weights, however it's the basis for most other algorithms.
  • 50. Using momentum with gradient descent Using gradient descent with momentum speeds up gradient descent by increasing the speed of learning in directions the gradient has been constant in direction while slowing learning in directions the gradient fluctuates in direction. It allows the velocity of gradient descent to increase. Momentum works by introducing a velocity term, and using a weighted moving average of that term in the update rule, as follows: Most typically is set to 0.9 in the case of momentum, and usually this is not a hyperparameter that needs to be changed.
  • 51. The RMSProp algorithm (Geoffrey Hinton) RMSProp is another algorithm that can speed up gradient descent by speeding up learning in some directions, and dampening oscillations in other directions, across the multidimensional space that the network weights represent: https://www.coursera.org/learn/neural-networks/home/welcome
  • 52. RMSprop • There are two ways to introduce RMSprop. • First, is to look at it as the adaptation of rprop algorithm for mini- batch learning. It was the initial motivation for developing this algorithm. • Another way is to look at its similarities with Adagrad and view RMSprop as a way to deal with its radically diminishing learning rates.
  • 53. Rprop to RMSprop • Rprop doesn’t really work when we have very large datasets and need to perform mini-batch weights updates. Why it doesn’t work with mini-batches ? Well, people have tried it, but found it hard to make it work. The reason it doesn’t work is that it violates the central idea behind stochastic gradient descent, which is when we have small enough learning rate, it averages the gradients over successive mini-batches. Consider the weight, that gets the gradient 0.1 on nine mini-batches, and the gradient of - 0.9 on tenths mini-batch. What we’d like is to those gradients to roughly cancel each other out, so that the stay approximately the same. But it’s not what happens with rprop. With rprop, we increment the weight 9 times and decrement only once, so the weight grows much larger. • To combine the robustness of rprop (by just using sign of the gradient), efficiency we get from mini-batches, and averaging over mini-batches which allows to combine gradients in the right way, we must look at rprop from different perspective. Rprop is equivalent of using the gradient but also dividing by the size of the gradient, so we get the same magnitude no matter how big a small that particular gradient is. The problem with mini-batches is that we divide by different gradient every time, so why not force the number we divide by to be similar for adjacent mini-batches ? The central idea of RMSprop is keep the moving average of the squared gradients for each weight. And then we divide the gradient by square root the mean square. Which is why it’s called RMSprop(root mean square). With math equations the update rule looks like this:
  • 54. • As you can see from the above equation we adapt learning rate by dividing by the root of squared gradient, but since we only have the estimate of the gradient on the current mini-batch, wee need instead to use the moving average of it. Default value for the moving average parameter that you can use in your projects is 0.9. It works very well for most applications. In code the algorithm might look like this:
  • 55. Similarity with Adagrad • Adagrad is adaptive learning rate algorithms that looks a lot like RMSprop. Adagrad adds element-wise scaling of the gradient based on the historical sum of squares in each dimension. This means that we keep a running sum of squared gradients. And then we adapt the learning rate by dividing it by that sum. In code we can express it like this: Adagrad: Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121–2159. Retrieved from http://jmlr.org/papers/v12/duchi11a.html
  • 56.
  • 57. RMSprop • What’s this scaling does when we have high condition number ? If we have two coordinates — one that has always big gradients and one that has small gradients we’ll be diving by the corresponding big or small number so we accelerate movement among small direction, and in the direction where gradients are large we’re going to slow down as we divide by some large number. • What happens over the course of training ? Steps get smaller and smaller and smaller, because we keep updating the squared grads growing over training. So we divide by the larger number every time. In the convex optimization, this makes a lot of sense, because when we approach minina we want to slow down. In non-convex case it’s bad as we can get stuck on saddle point. We can look at RMSprop as algorithms that addresses that concern a little bit. • With RMSprop we still keep that estimate of squared gradients, but instead of letting that estimate continually accumulate over training, we keep a moving average of it.
  • 59. The Adam optimizer Adam is one of the best performing known optimizer and it's my first choice. It works well across a wide variety of problems. It combines the best parts of both momentum and RMSProp into a single update rule:
  • 60.
  • 61. Bias and variance errors in deep learning With traditional predictive models, there is usually some compromise when we try to find an error from bias and an error from variance. So let's see what these two errors are: Bias error: Bias error is the error that is introduced by the model. For example, if you attempted to model a non-linear function with a linear model, your model would be under specified and the bias error would be high. Variance error: Variance error is the error that is introduced by randomness in the training data. When we fit our training distribution so well that our model no longer generalizes, we have overfit or introduce a variance error.
  • 62. The train, val, and test datasets The val dataset, or the validation dataset, will be used to find ideal hyperparameters, and to measure overfitting. At the end of an epoch, which is when the network has has the opportunity to observe every data point in the training set, we will make a prediction on the val set. That prediction will be used to watch for overfitting and will help us know when the network has finished training. Using the val set at the end of each epoch like this somewhat differs from the typical usage.
  • 63.
  • 64. K-Fold cross-validation: Too Expensive Shuffle the dataset randomly. Split the dataset into k groups For each unique group: Take the group as a hold out or test data set Take the remaining groups as a training data set Fit a model on the training set and evaluate it on the test set Retain the evaluation score and discard the model Summarize the skill of the model using the sample of model evaluation scores
  • 65.
  • 66. Variations on Cross-Validation • Train/Test Split: Taken to one extreme, k may be set to 2 (not 1) such that a single train/test split is created to evaluate the model. • LOOCV: Taken to another extreme, k may be set to the total number of observations in the dataset such that each observation is given a chance to be the held out of the dataset. This is called leave-one-out cross-validation, or LOOCV for short. • Stratified: The splitting of data into folds may be governed by criteria such as ensuring that each fold has the same proportion of observations with a given categorical value, such as the class outcome value. This is called stratified cross-validation. • Repeated: This is where the k-fold cross-validation procedure is repeated n times, where importantly, the data sample is shuffled prior to each repetition, which results in a different split of the sample. • Nested: This is where k-fold cross-validation is performed within each fold of cross- validation, often to perform hyperparameter tuning during model evaluation. This is called nested cross-validation or double cross-validation.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75. The Mathematics of Neural Networks https://www.youtube.com/watch?v=e5xKayCBOeU
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82. Hyperparameters In machine learning, we have both model parameters and parameters we tune to make networks train better and faster. These tuning parameters are called hyperparameters, and they deal with controlling optimization functions and model selection during training with our learning algorithm. Hyperparameter selection focuses on ensuring that the model neither underfits nor overfits the training dataset, while learning the structure of the data as quickly as possible.
  • 83. Hyperparameters Hyperparameters fall into several categories: • Layer size • Magnitude (momentum, learning rate) • Regularization (dropout, drop connect, L1, L2) • Activations (and activation function families) • Weight initialization strategy • Loss functions • Settings for epochs during training (mini-batch size) • Normalization scheme for input data (vectorization)
  • 84. Boltzmann Machine • A Boltzmann machine (also called Sherrington–Kirkpatrick model with external field or stochastic Ising–Lenz–Little model) is a stochastic spin- glass model with an external field, i.e., a Sherrington–Kirkpatrick model, that is a stochastic Ising Model. It is a statistical physics technique applied in the context of cognitive science. It is also classified as Markov random field. • Boltzmann machines are theoretically intriguing because of the locality and Hebbian nature of their training algorithm (being trained by Hebb's rule), and because of their parallelism and the resemblance of their dynamics to simple physical processes. • Boltzmann machines with unconstrained connectivity have not been proven useful for practical problems in machine learning or inference, but if the connectivity is properly constrained, the learning can be made efficient enough to be useful for practical problems.
  • 85. Boltzmann Machine • They are named after the Boltzmann distribution in statistical mechanics, which is used in their sampling function. They were heavily popularized and promoted by Geoffrey Hinton, Terry Sejnowski and Yann LeCun in cognitive sciences communities and in machine learning. As a more general class within machine learning these models are called "energy based models" (EBM), because Hamiltonian of spin glasses are used as a starting point to define the learning task.
  • 86.
  • 87.
  • 88. General Boltzmann Machine A network of symmetrically connected, neuron-like units that make stochastic decisions about whether to be on or off.
  • 89.
  • 90. Restricted Boltzmann Machines (RBMs) • A restricted Boltzmann machine (RBM) is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs. • RBMs were initially invented under the name Harmonium by Paul Smolensky in 1986, and rose to prominence after Geoffrey Hinton and collaborators invented fast learning algorithms for them in the mid- 2000. RBMs have found applications in dimensionality reduction, classification, collaborative filtering, feature learning, topic modelling and even many body quantum mechanics. They can be trained in either supervised or unsupervised ways, depending on the task.
  • 91. Restricted Boltzmann Machines (RBMs) • As their name implies, RBMs are a variant of Boltzmann machines, with the restriction that their neurons must form a bipartite graph: a pair of nodes from each of the two groups of units (commonly referred to as the "visible" and "hidden" units respectively) may have a symmetric connection between them; and there are no connections between nodes within a group. By contrast, "unrestricted" Boltzmann machines may have connections between hidden units. This restriction allows for more efficient training algorithms than are available for the general class of Boltzmann machines, in particular the gradient-based contrastive divergence algorithm. • Restricted Boltzmann machines can also be used in deep learning networks. In particular, deep belief networks can be formed by "stacking" RBMs and optionally fine-tuning the resulting deep network with gradient descent and backpropagation.
  • 92. Restricted Boltzmann Machines (RBMs) • The “restricted” part of the name “Restricted Boltzmann Machines” means that connections between nodes of the same layer are prohibited (e.g., there are no visible-visible or hidden-hidden connections along which signal passes). • RBMs are also a type of autoencoder • https://github.com/echen/restricted-boltzmann-machines
  • 93. Learning More About Restricted Boltzmann Machines A Practical Guide to Training Restricted Boltzmann Machines http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf Bay Area Vision Meeting: Unsupervised Feature Learning and Deep Learning https://www.youtube.com/watch?v=ZmNOAtZIgIk Restricted Boltzmann Machines for Collaborative Filtering https://www.cs.toronto.edu/~rsalakhu/papers/rbmcf.pdf
  • 94. Network layout (RBM) • Visible units • Hidden units • Weights • Visible bias units • Hidden bias units
  • 95. Contrastive Divergence (FYI) RBMs calculate gradients by using an algorithm called contrastive divergence. Contrastive divergence is the name of the algorithm used in sampling for the layer-wise pretraining of a RBM. Also called CD-k, contrastive divergence minimizes the KL divergence (the delta between the real distribution of the data and the guess) by sampling k steps of a Markov chain to compute a guess.
  • 96. Reconstruction Cross-Entropy The objective function here is usually reconstruction cross-entropy, or KL divergence (the mathematicians and cryptanalysts Solomon Kullback and Richard Leibler first published a paper on the technique in 1951). “Cross” refers to the comparison between two distributions. “Entropy” is a term from information theory that refers to uncertainty. For example, a normal curve with a wide spread, or variance, also implies more uncertainty about where data points will fall. That uncertainty is called entropy.
  • 99.
  • 100.
  • 101.
  • 102.
  • 103.
  • 104.
  • 105.
  • 106.
  • 107.
  • 108.
  • 109.
  • 110.
  • 111. RBMs Usage • Dimensionality reduction • Classification • Regression • Collaborative filtering • Topic modeling
  • 113. More references Lecture 11/16 : Hopfield nets and Boltzmann machines https://www.youtube.com/watch?v=IP3W7cI01VY Lecture 11.5 — How a Boltzmann machine models data — [ Deep Learning | Geoffrey Hinton | UofT ] https://www.youtube.com/watch?v=kytxEr0KK7Q Neural networks [5.7] : Restricted Boltzmann machine - example https://www.youtube.com/watch?v=n26NdEtma8U S18 Lecture 22: Boltzmann Machines https://www.youtube.com/watch?v=it_PXVIMyWg&t=6s
  • 114. Four Major Architectures of Deep Networks Unsupervised Pretrained Networks (UPNs) Autoencoders Deep Belief Networks (DBNs) Generative Adversarial Networks (GANs) Convolutional Neural Networks (CNNs) Recurrent Neural Networks Recursive Neural Networks
  • 115. Unsupervised Pretrained Networks (UPNs) • In SGD optimization, one typically initiates model weights at random and tries to go towards minimum cost by following the opposite of gradient of objective function. For deep nets, this has not shown much of success and it is believed to be result of extremely non-convex (and high-dimensional) nature of their objective function. • What Y. Bengio and others found out was that, instead of starting weights at random and hoping that SGD will take you to minimum point of such a rugged landscape, you can pre-train each layer like an autoencoder. • Here is how it works: you build an autoencoder with first layer as encoding layer and the transpose of that as decoder. And you train it unsupervised, that is you train it to reconstruct the input (refer to AutoEncoders, they are great for unsupervised feature extraction tasks). Once trained, you fix weights of that layer to those you just found. Then, you move to next layers and repeat the same until you pre-train all layers of deep net (greedy approach). At this point, you go back to the original problem that you wanted to solve with deep net (classification/regression) and you optimize it with SGD but starting from weights you just learned during pre-training. • They found that this gives much better results. I think no one knows why exactly this works, but the idea is that by pre-training you start from more favorable regions of feature space.
  • 116. Unsupervised Pretrained Networks (UPNs) • Unsupervised pre-training initializes a discriminative neural net from one which was trained using an unsupervised criterion, such as a deep belief network or a deep autoencoder. This method can sometimes help with both the optimization and the overfitting issues. • Unsupervised Pre-training Acts as a Regularizer
  • 117. Autoencoders • We use autoencoders to learn compressed representations of datasets. Typically, we use them to reduce a dataset’s dimensionality. The output of the autoencoder network is a reconstruction of the input data in the most efficient form.
  • 118.
  • 119.
  • 120. Defining Features of Autoencoders • Autoencoders differ from multilayer perceptrons in a couple of ways: • They use unlabeled data in unsupervised learning. • They build a compressed representation of the input data. • Unsupervised learning of unlabeled data. The autoencoder learns directly from unlabeled data. This is connected to the second major difference between multilayer perceptrons and autoencoders. • Learning to reproduce the input data. The goal of a multilayer perceptron network is to generate predictions over a class (e.g., fraud versus not fraud). An autoencoder is trained to reproduce its own input data.
  • 121. Defining Features of Autoencoders • Autoencoders rely on backpropagation to update their weights. The main difference between RBMs and the more general class of autoencoders is in how they calculate the gradients. • Two important variants of autoencoders to note are compression autoencoders and denoising autoencoders. Compression autoencoders. The network input must pass through a bottleneck region of the network before being expanded back into the output representation. Denoising autoencoders. The denoising autoencoder is the scenario in which the autoencoder is given a corrupted version (e.g., some features are removed randomly) of the input and the network is forced to learn the uncorrupted output.
  • 122. Variational Autoencoders • A more recent type of autoencoder model is the variational autoencoder (VAE) introduced by Kingma and Welling. The VAE is similar to compression • and denoising autoencoders in that they are all trained in an unsupervised manner to reconstruct inputs. • However, the mechanisms that the VAEs use to perform training are quite different. In a compression/denoising autoencoder, activations are mapped to activations throughout the layers, as in a standard neural network; comparatively, a VAE uses a probabilistic approach for the forward pass. https://arxiv.org/abs/1312.6114
  • 123. Deep Belief Networks • DBNs are composed of layers of Restricted Boltzmann Machines (RBMs) for the pretrain phase and then a feed-forward network for the fine-tune phase.
  • 124. Feature Extraction with RBM Layers We use RBMs to extract higher-level features from the raw input vectors. To do that, we want to set the hidden unit states and weights such that when we show the RBM an input record and ask the RBM to reconstruct the record the record, it generates something pretty close to the original input vector. Hinton talks about this effect in terms of how machines “dream about data.” The fundamental purpose of RBMs in the context of deep learning and DBNs is to learn these higher-level features of a dataset in an unsupervised training fashion. It was discovered that we could train better neural networks by letting RBMs learn progressively higher-level features using the learned features from a lower level RBM pretrain layer as the input to a higher-level RBM pretrain layer.
  • 125. Learning Higher-order Features Automatically Activation render at the beginning of training Features emerge in later activation render Portions of MNIST digits emerge towards end of training
  • 126. Building Autoencoders in Keras a simple autoencoder based on a fully-connected layer a sparse autoencoder a deep fully-connected autoencoder a deep convolutional autoencoder an image denoising model a sequence-to-sequence autoencoder a variational autoencoder https://blog.keras.io/building-autoencoders-in-keras.html
  • 127.
  • 128. What are autoencoders? • "Autoencoding" is a data compression algorithm where the compression and decompression functions are 1) data-specific, 2) lossy, and 3) learned automatically from examples rather than engineered by a human. Additionally, in almost all contexts where the term "autoencoder" is used, the compression and decompression functions are implemented with neural networks. • 1) Autoencoders are data-specific, which means that they will only be able to compress data similar to what they have been trained on. This is different from, say, the MPEG-2 Audio Layer III (MP3) compression algorithm, which only holds assumptions about "sound" in general, but not about specific types of sounds. An autoencoder trained on pictures of faces would do a rather poor job of compressing pictures of trees, because the features it would learn would be face-specific. • 2) Autoencoders are lossy, which means that the decompressed outputs will be degraded compared to the original inputs (similar to MP3 or JPEG compression). This differs from lossless arithmetic compression. • 3) Autoencoders are learned automatically from data examples, which is a useful property: it means that it is easy to train specialized instances of the algorithm that will perform well on a specific type of input. It doesn't require any new engineering, just appropriate training data. • The fact that autoencoders are data-specific makes them generally impractical for real-world data compression problems: you can only use them on data that is similar to what they were trained on, and making them more general thus requires lots of training data.
  • 130. Code • simplest possible autoencoder.ipynb
  • 131. Deep Belief Network (DBN) • In machine learning, a deep belief network (DBN) is a generative graphical model, or alternatively a class of deep neural network, composed of multiple layers of latent variables ("hidden units"), with connections between the layers but not between units within each layer. • When trained on a set of examples without supervision, a DBN can learn to probabilistically reconstruct its inputs. The layers then act as feature detectors. After this learning step, a DBN can be further trained with supervision to perform classification. • DBNs can be viewed as a composition of simple, unsupervised networks such as restricted Boltzmann machines (RBMs) or autoencoders, where each sub-network's hidden layer serves as the visible layer for the next. An RBM is an undirected, generative energy-based model with a "visible" input layer and a hidden layer and connections between but not within layers. This composition leads to a fast, layer-by-layer unsupervised training procedure, where contrastive divergence is applied to each sub-network in turn, starting from the "lowest" pair of layers (the lowest visible layer is a training set). • The observation that DBNs can be trained greedily, one layer at a time, led to one of the first effective deep learning algorithms.:6 Overall, there are many attractive implementations and uses of DBNs in real-life applications and scenarios (e.g., electroencephalography, drug discovery http://www.scholarpedia.org/article/Deep_belief_networks
  • 132.
  • 133.
  • 135. References • Bengio, Y., Lamblin, P., Popovici, P., Larochelle, H. (2007) Greedy Layer-Wise Training of Deep Networks, Advances in Neural Information Processing Systems 19, MIT Press, Cambridge, MA. • Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18:1527- 1554. • Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313:504-507. • Larochelle, H., Erhan, D., Courville, A., Bergstra, J., Bengio, Y. (2007) An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation. International Conference on Machine Learning. • LeCun, Y. and Bengio, Y. (2007) Scaling Learning Algorithms Towards AI. In Bottou et al. (Eds.) Large-Scale Kernel Machines, MIT Press. • M. Ranzato, F.J. Huang, Y. Boureau, Y. LeCun (2007) Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition. Proc. of Computer Vision and Pattern Recognition Conference (CVPR 2007), Minneapolis, Minnesota, 2007 • Salakhutdinov, R. R. and Hinton,G. E. (2007) Semantic Hashing. In Proceedings of the SIGIR Workshop on Information Retrieval and Applications of Graphical Models, Amsterdam. • Sutskever, I. and Hinton, G. E. (2007) Learning multilevel distributed representations for high-dimensional sequences. AI and Statistics, 2007, Puerto Rico. • Taylor, G. W., Hinton, G. E. and Roweis, S. (2007) Modeling human motion using binary latent variables. Advances in Neural Information Processing Systems 19, MIT Press, Cambridge, MA • Welling, M., Rosen-Zvi, M., and Hinton, G. E. (2005). Exponential family harmoniums with an application to information retrieval. Advances in Neural Information Processing Systems 17, pages 1481-1488. MIT Press, Cambridge, MA.
  • 136. Generative Adversarial Networks (GANs) • The Generative Adversarial Network (GAN) comprises of two models: a generative model G and a discriminative model D. The generative model can be considered as a counterfeiter who is trying to generate fake currency and use it without being caught, whereas the discriminative model is similar to police, trying to catch the fake currency. This competition goes on till the counterfeiter becomes smart enough to successfully fool the police.
  • 137.
  • 138.
  • 139. Generative Adversarial Networks (GANs) • Discriminator: The role is to distinguish between actual and generated (fake) data. • Generator: The role is to create data in such a way that it can fool the discriminator.
  • 140. Derivation of the loss function Discriminator loss the objective of the discriminator is to correctly classify the fake and real dataset. For this, equations (1) and (2) should be maximized and final loss function for the discriminator can be given as,
  • 141. Generator loss the generator is competing against discriminator. So, it will try to minimize the equation (3) and loss function is given as, Combined loss function Remember that the above loss function is valid only for a single data point, to consider entire dataset we need to take the expectation of the above equation as
  • 142. It can be noticed from the above algorithm that the generator and discriminator are trained separately. In the first section, real data and fake data are inserted into the discriminator with correct labels and training takes place. Gradients are propagated keeping generator fixed. Also, we update the discriminator by ascending its stochastic gradient because for discriminator we want to maximize the loss function given in equation (6). On the other hand, we update the generator by keeping discriminator fixed and passing fake data with fake labels in order to fool the discriminator. Here, we update the generator by descending its stochastic gradient because for the generator we want to minimize the loss function given in equation (6).
  • 143. Global Optimality of Pg = Pdata
  • 145. Limitations: Mode Collapse • During training, the generator may get stuck into a setting where it always produces the same output. This is called mode collapse. This happens because the main aim of G was to fool D not to generate diverse output.
  • 146. References • Atienza, Rowel. Advanced Deep Learning with Keras: Apply deep learning techniques, autoencoders, GANs, variational autoencoders, deep reinforcement learning, policy gradients, and more. Packt Publishing Ltd, 2018. • Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in neural information processing systems. 2014. • Wang, Zhengwei, Qi She, and Tomas E. Ward. “Generative Adversarial Networks: A Survey and Taxonomy.” arXiv preprint arXiv:1906.01529 (2019).
  • 147.
  • 149. Convolution • A convolution is an operation on two vectors, matrices, or tensors, that returns a third vector, matrix, or tensor. • https://deeplearningmath.org/convolut ional-neural-networks.html
  • 150.
  • 151.
  • 152.
  • 155. What is a Color Model? • A color model is an abstract mathematical model that describes how colors can be represented as a set of numbers (e.g., a triple in RGB or a quad in CMYK). Color models can usually be described using a coordinate system, and each color in the system is represented by a single point in the coordinate space. • For a given color model, to interpret a tuple or quad as a color, we can define a set of rules and definitions used to accurately calibrate and generate colors, i.e. a mapping function. A color space identifies a specific combination of color models and mapping functions. Identifying the color space automatically identifies the associated color model. For example, Adobe RGB and sRGB are two different color spaces, both based on the RGB color model.
  • 156. RGB • RGB color model stores individual values for red, green, and blue. With a color space based on the RGB color model, the three primaries are added together to create colors from completely white to completely black. • The RGB color space is associated with the device. Thus, different scanners get different color image data when scanning the same image; different monitors have different color display results when rendering the same image. • There are many different RGB color spaces derived from this color model, standard RGB (sRGB) is a popular example.
  • 157. HSV • HSV (hue, saturation, value), also known as HSB (hue, saturation, brightness), is often used by artists because it is often more natural to think about a color in terms of hue and saturation than in terms of additive or subtractive color components. • The system is closer to people’s experience and perception of color than RGB. For example, in painting terms, hue, saturation, and values are expressed in terms of color, shading, and toning. • The HSV model space can be described as an inverted hexagonal pyramid. • The top surface is a regular hexagon, showing the change in hue in the H direction, from 0 ° to 360 ° is the entire spectrum of visible light. The six corners of the hexagon represent the positions of the six colors of red, yellow, green, cyan, blue, and magenta, each of which is 60 ° apart. • The saturation S is represented by the S direction from the center to the hexagonal boundary, and the value varies from 0 to 1. The closer to the hexagonal boundary, the higher the color saturation. The color of the hexagonal boundary is the most saturated, i.e. S = 1; the color saturation at the center of the hexagon is 0, i.e. S = 0. • The height of the hexagonal pyramid (also known as the central axis) is denoted by V, which represents a black to white gradation from bottom to top. The bottom of V is black, V = 0; the top of V is white, V = 1.
  • 158. YUV • The Y′UV model defines a color space in terms of one luma component (Y′) and two chrominance (UV) components. The Y′ channel saves black and white data. If there is only the Y component and there are no U and V components, then the graph represented is grayscale. • The Y component can be calculated with the following equation: Y = 0. 299R+ 0. 587G+ 0. 114*B, which is the commonly used grayscale formula. The color difference U and V are compressed by B-Y and R-Y in different proportions. • Compared with RGB, Y’UV does not necessarily store a triple tuple for each pixel. Y′UV images can be sampled in several different ways. For example, with YUV420, it saves one luma component for every point and two chroma values—a Cr (U) value and a Cb (V) value— every 2×2 points. I.E., 6 bytes per 4 pixels.
  • 159. YUV • The scope of the terms Y′UV, YUV, YCbCr, YPbPr, etc., is sometimes ambiguous and overlapping. Historically, the terms YUV and Y′UV were used for a specific analog encoding of color information in television systems, while YCbCr was used for digital encoding of color information suited for video and still-image compression and transmission such as MPEG and JPEG. Today, the term YUV is commonly used in the computer industry to describe file-formats that are encoded using YCbCr.
  • 160. Color Space Depending on the information represented by each pixel, images can be divided into binary images, grayscale images, RGB images, and index images, etc. Binary Image In a binary image, the pixel value is represented by a 0 or 1. Generally, 0 is for black and 1 is for white. Grayscale Image The grayscale image adds a color depth between black and white in the binary image to form a grayscale image. Such images are usually displayed as grayscales from the darkest black to the brightest white, and each color depth is called a grayscale, usually denoted by L. In grayscale images, pixels can take integer values between 0 and L-1. RGB Image In RGB or color, images, the information for each pixel requires a tuple of numbers to represent. So we need a three-dimensional matrix to represent an image. Almost all colors in nature can be composed of three colors: red (R), green (G), and blue (B). So each pixel can be represented by a red/green/blue tuple in an RGB image. Indexed Image An indexed image consists of a colormap matrix, which uses direct mapping of pixel values in an array to colormap values. The color of each pixel in an image is determined by using the corresponding value. We discuss this in more detail below.
  • 161. How an Image is Stored in Memory • The x86 hardware does not have an addressing mode that accesses elements of multi-dimensional arrays. When loading an image into memory space, the multi- dimensional object is converted into a one- dimensional array. Row major ordering or column major ordering are commonly used.
  • 162. Seven Grayscale Conversion Algorithms • Method 1 - Averaging (aka “quick and dirty”) • Method 2 - Correcting for the human eye (sometimes called “luma” or “luminance,” though such terminology isn’t really accurate) • Method 3 – Desaturation • Method 4 - Decomposition (think of it as de-composition, e.g. not the biological process!) • Method 5 - Single color channel • Method 6 - Custom # of gray shades • Method 7 - Custom # of gray shades with dithering (in this example, horizontal error-diffusion dithering) https://tannerhelland.com/2011/10/01/grayscale-image-algorithm-vb6.html
  • 163. M1: Gray = (Red + Green + Blue) / 3 M2: Gray = (Red * 0.3 + Green * 0.59 + Blue * 0.11) Photoshop, GIMP Gray = (Red * 0.2126 + Green * 0.7152 + Blue * 0.0722) Luma Gray = (Red * 0.299 + Green * 0.587 + Blue * 0.114) http://poynton.ca/ M3: Gray = ( Max(Red, Green, Blue) + Min(Red, Green, Blue) ) / 2 M4: Maximum decomposition: Gray = Max(Red, Green, Blue) Minimum decomposition: Gray = Min(Red, Green, Blue)
  • 164. M5 Gray = Red …or: Gray = Green …or: Gray = Blue M6 ConversionFactor = 255 / (NumberOfShades - 1) AverageValue = (Red + Green + Blue) / 3 Gray = Integer((AverageValue / ConversionFactor) + 0.5) * ConversionFactor Notes: -NumberOfShades is a value between 2 and 256 -technically, any grayscale algorithm could be used to calculate AverageValue; it simply provides an initial gray value estimate -the "+ 0.5" addition is an optional parameter that imitates rounding the value of an integer conversion; YMMV depending on which programming language you use, as some round automatically M7
  • 166.
  • 167. Check out • How to Colorize an Image • https://tannerhelland.com/2011/04/28/colorize-image-vb6.html https://github.com/tannerhelland/vb6-code/tree/master/Colorize-effect
  • 168.
  • 169.
  • 170.
  • 171. https://www.youtube.com/watch?v=ExOOElyZ2Hk This researcher created an algorithm that removes the water from underwater images
  • 174.
  • 176. Edge Detection • Canny edge detector • Kovalevsky • First-order Methods (Sobel) • Thresholding and linking • Edge thinning • Second-order approaches (early Marr–Hildreth operator) • Differential • Phase congruency-based • Phase Stretch Transform (PST) • Subpixel
  • 177. Sobel operator • The Sobel operator, sometimes called the Sobel–Feldman operator or Sobel filter, is used in image processing and computer vision, particularly within edge detection algorithms where it creates an image emphasising edges. It is named after Irwin Sobel and Gary Feldman, colleagues at the Stanford Artificial Intelligence Laboratory (SAIL)
  • 180.
  • 181.
  • 182.
  • 183.
  • 184.
  • 185.
  • 186.
  • 187.
  • 188.
  • 189.
  • 190.
  • 191. Convolutional Neural Networks (CNNs) The goal of a CNN is to learn higher-order features in the data via convolutions. They are well suited to object recognition with images and consistently top image classification competitions. They can identify faces, individuals, street signs, platypuses, and many other aspects of visual data. CNNs overlap with text analysis via optical character recognition, but they are also useful when analyzing words as discrete textual units. They’re also good at analyzing sound. The efficacy of CNNs in image recognition is one of the main reasons why the world recognizes the power of deep learning. As Figure CNNs are good at building position and (somewhat) rotation invariant features from raw image data https://poloclub.github.io/cnn-explainer/
  • 192. CNNs and Structure in Data CNNs tend to be most useful when there is some structure to the input data. An example would be how images and audio data that have a specific set of repeating patterns and input values next to each other are related spatially. Conversely, the columnar data exported from a relational database management system (RDBMS) tends to have no structural relationships spatially. Columns next to one another just happen to be materialized that way in the database exported materialized view.
  • 193. The motivation of convolutions • Sparse interaction, or Local connectivity. • The receptive field of the neuron, or the filter size. • The connections are local in space (width and height), but always full in depth • A set of learnable filters • Parameters sharing, the weights are tied • Equivariant representation, translation invariant
  • 194. Convolution and matrix multiplication • Discrete convolution can be viewed as multiplication by matrix • The kernel is a doubly block circulant matrix • It is very sparse!
  • 195. The ‘convolution’ operation • The convolution is commutative because we have flipped the kernel • Many implement a cross-correlation without flipping • A convolution can be defined for 1, 2, 3, and N D • The 2D convolution is different from a real 3D convolution, which integrates the spatio-temporal information, the standard CNN convolution has only ‘spatial’ spreading • In CNN, even for 3D RGB input images, the standard convolution is 2D in each channel, • each channel has a different filter or kernel, the convolution per channel is then summed up in all channels to produce a scalar for non-linearity activation • The filiter in each channel is not normalized, so no need to have different linear combination coefficients. • 1*1 convolution is a dot product in different channel, a linear combination of different chanels • The backward pass of a convolution is also a convolution with spatially flipped filters. VisGraph, HKUST
  • 196. The convolution layers • Stacking several small convolution layers is different from convolution cascating • As each small convolution is followed by the nonlinearities ReLU • The nonlinearities make the features more expressive! • Have fewer parameters with small filters, but more memory. • Cascating simply enlarges the spatial extent, the receptive field • Whether each conv layer is also followed by a pooling? • Lenet does not! • AlexNet first did not. VisGraph, HKUST
  • 197. The Pooling Layer • Reduce the spatial size • Reduce the amount of parameters • Avoid over-fitting • Backpropagation for a max: only routing the gradient to the input that had the highest value in the forward pass • It is unclear whether the pooling is essential.
  • 198. Pooling layer down-samples the volume spatially, independently in each depth slice of the input volume. Left: the input volume of size [224x224x64] is pooled with filter size 2, stride 2 into output volume of size [112x112x64]. Notice that the volume depth is preserved. Right: The most common down-sampling operation is max, giving rise to max pooling, here shown with a stride of 2. That is, each max is taken over 4 numbers (little 2x2 square).
  • 199. The spatial hyperparameters • Depth • Stride • Zero-padding
  • 200. AlexNet 2012 • A strong prior has very low entropy, e.g. a Gaussian with low variance • An infinitely strong prior says that some parameters are forbidden, and places zero probability on them • The convolution ‘prior’ says the identical and zero weights • The pooling forces the invariance of small translations
  • 201. The convolution and pooling act as an infinitely strong prior! • A strong prior has very low entropy, e.g. a Gaussian with low variance • An infinitely strong prior says that some parameters are forbidden, and places zero probability on them • The convolution ‘prior’ says the identical and zero weights • The pooling forces the invariance of small translations
  • 202. The neuroscientific basis for CNN • The primary visual cortex, V1, about which we know the most • The brain region LGN, lateral geniculate nucleus, at the back of the head carries the signal from the eye to V1, a convolutional layer captures three aspects of V1 • It has a 2-dimensional structure • V1 contains many simple cells, linear units • V1 has many complex cells, corresponding to features with shift invariance, similar to pooling • When viewing an object, info flows from the retina, through LGN, to V1, then onward to V2, then V4, then IT, inferotemporal cortex, corresponding to the last layer of CNN features • Not modeled at all. The mammalian vision system develops an attention mechanism • The human eye is mostly very low resolution, except for a tiny patch fovea. • The human brain makes several eye movements saccades to salient parts of the scene • The human vision perceives 3D • A simple cell responds to a specific spatial frequency of brightness in a specific direction at a specific location --- Gabor-like functions
  • 203. Receptive field Left: An example input volume in red (e.g. a 32x32x3 CIFAR-10 image), and an example volume of neurons in the first Convolutional layer. Each neuron in the convolutional layer is connected only to a local region in the input volume spatially, but to the full depth (i.e. all color channels). Note, there are multiple neurons (5 in this example) along the depth, all looking at the same region in the input - see discussion of depth columns in text below. Right: The neurons from the Neural Network chapter remain unchanged: They still compute a dot product of their weights with the input followed by a non-linearity, but their connectivity is now restricted to be local spatially.
  • 207. CNN architectures and algorithms
  • 208. CNN architectures • The conventional linear structure, linear list of layers, feedforward • Generally a DAG, directed acyclic graph • ResNet simply adds back • Different terminology: complex layer and simple layer – A complex (complete) convolutional layer, including different stages such as convolution per se, batch normalization, nonlinearity, and pooling – Each stage is a layer, even there are no parameters • The traditional CNNs are just a few complex convolutional layers to extract features, then are followed by a softmax classification output layer • Convolutional networks output a high-dimensional, structured object, rather than just predicting a class label for a classiciation task or a real valuefor a regression task, it it an output tensor – S_i,j,k is the probability that pixel j,k belongs to class I
  • 209. The popular CNN • LeNet, 1998 • AlexNet, 2012 • VGGNet, 2014 • ResNet, 2015
  • 210. VGGNet • 16 layers • Only 3*3 convolutions • 138 million parameters
  • 212. Computational complexity • The memory bottleneck • GPU, a few GB
  • 213. Stochastic Gradient Descent • Gradient descent, follows the gradient of an entire training set downhill • Stochastic gradient descent, follows the gradient of randomly selected minbatches downhill
  • 214. The dropout regularization • Randomly shutdown a subset of units in training • It is a sparse representation • It is a different net each time, but all nets share the parameters • A net with n units can be seen as a collection of 2^n possible thinned nets, all of which share weights. • At test time, it is a single net with averaging • Avoid over-fitting
  • 215. Smaller Network: CNN • We know it is good to learn a small model. • From this fully connected model, do we really need all the edges? • Can some of these be shared?
  • 216. Consider learning an image: • Some patterns are much smaller than the whole image “beak” detector Can represent a small region with fewer parameters
  • 217. Same pattern appears in different places: They can be compressed! What about training a lot of such “small” detectors and each detector must “move around”. “ -left b k” detector “ i b k” detector They can be compressed to the same parameters.
  • 218. A convolutional layer A filter A CNN is a neural network with some convolutional layers (and some other layers). A convolutional layer has a number of filters that does convolutional operation. Beak detector
  • 219. Convolution 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 6 x 6 image 1 -1 -1 -1 1 -1 -1 -1 1 Filter 1 -1 1 -1 -1 1 -1 -1 1 -1 Filter 2 … … These are the network parameters to be learned. Each filter detects a small pattern (3 x 3).
  • 220. Convolution 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 6 x 6 image 1 -1 -1 -1 1 -1 -1 -1 1 Filter 1 3 -1 stride=1 Dot product
  • 221. Convolution 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 6 x 6 image 1 -1 -1 -1 1 -1 -1 -1 1 Filter 1 3 -3 If stride=2
  • 222. Convolution 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 6 x 6 image 1 -1 -1 -1 1 -1 -1 -1 1 Filter 1 3 -1 -3 -1 -3 1 0 -3 -3 -3 0 1 3 -2 -2 -1 stride=1
  • 223. Convolution 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 6 x 6 image 3 -1 -3 -1 -3 1 0 -3 -3 -3 0 1 3 -2 -2 -1 -1 1 -1 -1 1 -1 -1 1 -1 Filter 2 -1 -1 -1 -1 -1 -1 -2 1 -1 -1 -2 1 -1 0 -4 3 Repeat this for each filter stride=1 Two 4 x 4 images Forming 2 x 4 x 4 matrix Feature Map
  • 224. Color image: RGB 3 channels 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 1 -1 -1 -1 1 -1 -1 -1 1 Filter 1 -1 1 -1 -1 1 -1 -1 1 -1 Filter 2 1 -1 -1 -1 1 -1 -1 -1 1 1 -1 -1 -1 1 -1 -1 -1 1 -1 1 -1 -1 1 -1 -1 1 -1 -1 1 -1 -1 1 -1 -1 1 -1 Color image
  • 225. 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 image convolution -1 1 -1 -1 1 -1 -1 1 -1 1 -1 -1 -1 1 -1 -1 -1 1 1 x 2 x … … 36 x … … 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 Convolution v.s. Fully Connected Fully- connected
  • 226. 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 6 x 6 image 1 -1 -1 -1 1 -1 -1 -1 1 Filter 1 1 2 3 … 8 9 … 1 3 14 15 … Only connect to 9 inputs, not fully connected 4: 10: 16 1 0 0 0 0 1 0 0 0 0 1 1 3 fewer parameters!
  • 227. 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 1 -1 -1 -1 1 -1 -1 -1 1 Filter 1 1: 2: 3: … 7: 8: 9: … 1 3: 14: 15: … 4: 10: 16: 1 0 0 0 0 1 0 0 0 0 1 1 3 -1 Shared weights 6 x 6 image Fewer parameters Even fewer parameters
  • 228. The whole CNN Fully Connected Feedforward network c g …… Convolution Max Pooling Convolution Max Pooling Flattened Can repeat many times
  • 229. Max Pooling 3 -1 -3 -1 -3 1 0 -3 -3 -3 0 1 3 -2 -2 -1 -1 1 -1 -1 1 -1 -1 1 -1 Filter 2 -1 -1 -1 -1 -1 -1 -2 1 -1 -1 -2 1 -1 0 -4 3 1 -1 -1 -1 1 -1 -1 -1 1 Filter 1
  • 230. Why Pooling • Subsampling pixels will not change the object Subsampling bird bird We can subsample the pixels to make image smaller fewer parameters to characterize the image
  • 231. A CNN compresses a fully connected network in two ways: • Reducing number of connections • Shared weights on the edges • Max pooling further reduces the complexity
  • 232. Max pooling is a pooling operation that selects the maximum element from the region of the feature map covered by the filter. Thus, the output after max-pooling layer would be a feature map containing the most prominent features of the previous feature map 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 6 x 6 image 3 0 1 3 -1 1 3 0 2 x 2 image Each filter is a channel New image but smaller Conv Max Pooling
  • 234.
  • 235.
  • 236. The whole CNN Convolution Max Pooling Convolution Max Pooling Can repeat many times A new image The number of channels is the number of filters Smaller than the original image 3 0 1 3 -1 1 3 0
  • 237.
  • 238.
  • 239. The whole CNN Fully Connected Feedforward network c g …… Convolution Max Pooling Convolution Max Pooling Flattened A new image A new image
  • 240. Flattening 3 0 1 3 -1 1 3 0 Flattened 3 0 1 3 -1 1 0 3 Fully Connected Feedforward network
  • 241. Only modified the network structure and input format (vector -> 3-D tensor) CNN in Keras Convolution Max Pooling Convolution Max Pooling input 1 -1 -1 -1 1 -1 -1 -1 1 -1 1 -1 -1 1 -1 -1 1 -1 There are 25 3x3 filters. … … Input_shape = ( 28 , 28 , 1) 1: black/white, 3: RGB 28 x 28 pixels 3 -1 -3 1 3
  • 242. Only modified the network structure and input format (vector -> 3-D array) CNN in Keras Convolution Max Pooling Convolution Max Pooling Input 1 x 28 x 28 25 x 26 x 26 25 x 13 x 13 50 x 11 x 11 50 x 5 x 5 How many parameters for each filter? How many parameters for each filter? 9 225= 25x9
  • 243. Only modified the network structure and input format (vector -> 3-D array) CNN in Keras Convolution Max Pooling Convolution Max Pooling Input 1 x 28 x 28 25 x 26 x 26 25 x 13 x 13 50 x 11 x 11 50 x 5 x 5 Flattened 1250 Fully connected feedforward network Output
  • 244. AlphaGo Neural Network (19 x 19 positions) Next move 19 x 19 matrix Black: 1 white: -1 none: 0 Fully-connected feedforward network can be used But CNN performs much better
  • 245. AlphaGo’s policy network Note: AlphaGo does not use Max Pooling. The following is quotation from their Nature article:
  • 246. CNN in speech recognition Time Frequency Spectrogram CNN Image The filters move in the frequency direction.
  • 247. CNN in text classification Source of image: http://citeseerx.ist.psu.edu/viewdoc/downlo ad?doi=10.1.1.703.6858&rep=rep1&type=p df ?