SlideShare a Scribd company logo
Hands on Machine
Learning with Scikit Learn
Presented By:
Ahmed Yousry
Agenda
❑Introduction to Artificial Neural Networks.
❑Training Deep Neural Nets.
❑Convolutional Neural Networks.
❑Recurrent Neural Network.
❑Reinforcement Learning.
Introduction to ANN
• First introduced back in 1943 by the Warren
McCulloch .
• Successes of ANNs until the 1960s.
• In the early 1980s there was a revival of
interest in ANNs as new network architectures.
• By the 1990s, powerful alternative Machine
Learning techniques.
Reasons why ANN is much more profound
impact
❑There is now a huge quantity of data.
❑The tremendous increase in computing power.
❑The training algorithms have been improved.
❑Theoretical limitations of ANNs have turned
out to be benign.
❑virtuous circle of funding and progress and
products.
Biological Neurons
ANN simulation
The Perceptron
• One of the simplest ANN architectures, invented in
1957 by Frank Rosenblatt.
• It is based on a linear threshold unit (LTU).
Z = w1 x1 + w2 x2 + ⋯
+ wn xn = wT ・ x
hw(x) = step (Z)
= step (wT ・x)
Multioutput perceptron
❑ A Perceptron with two inputs and three outputs.
Note : No hidden layers in perceptron.
Training Algorithm
While epoch produces an error
Present network with next inputs from epoch
Err = T – O
If Err <> 0 then
Wj new = Wj old + LR * Ij * Err
End If
End While
• T: actual output , O: predicted output
• LR : learning rate , I :input
XOR classification problem and an MLP
that solves it
XOR Function
X1 XOR X2 = (X1 AND NOT X2) OR (X2 AND NOT X1)
2
2
2
2
-1
-1
Z1
Z2
Y
X1
X2
Multi-Layer Perceptron and
Backpropagation
• An MLP is composed of one input layer, one or more
layers of LTUs, called hidden layers, and one final
output layer
• When an ANN has two
or more hidden layers, it is called
a deep neural network (DNN).
A modern MLP (including ReLU and
softmax) for classification
Deep learning Problems
• Vanishing gradients problem (or the related
exploding gradients problem) lower layers
very hard to train.
• Second, with such a large network, training
would be extremely slow.
• Third, a model with millions of parameters
would severely risk overfitting the training
set.
Gradients problems
• Gradients often get smaller as the algorithm
progresses down to the lower layers.
• The Gradient Descent update leaves the lower layer
weights unchanged, and training never converges to
a good solution.
• This is called the vanishing gradients problem.
• The gradients can grow bigger and bigger, so many
layers get insanely large weight updates and the
algorithm diverges. This is the exploding gradients
problem
Solving the first problem(Van…)
A paper titled “Understanding the Difficulty of
Training Deep Feedforward Neural Networks” by
Xavier Glorot and Yoshua.
1. popular logistic sigmoid activation function.
2. using a normal distribution with a mean of 0
and a standard deviation of 1.
3. the hyperbolic tangent function has a mean
of 0 and behaves slightly better than the
logistic function in DNN.
Sigmoid activation function
you can see that when
inputs become large
(negative or positive),
the function saturates
at 0 or 1, with a
derivative extremely
close to 0.
1/1+e^-x
The problem of RELU (0,max)
• It suffers from a problem (dying ReLUs) during
training, some neurons effectively die.
• they stop outputting anything other than 0.
• In some cases, you may find that half of your
network’s neurons are dead training.
• To solve this problem, you may want to use a
variant of the ReLU function, such as the
leaky ReLU.
leaky ReLU (RReLU).
• leaky variants always outperformed the strict ReLU activation
function. In fact, setting α = 0.2 (huge leak) seemed to result in
better performance than α = 0.01 (small leak).
• They also evaluated the randomized leaky ReLU (RReLU).
• also evaluated the parametric leaky ReLU (PReLU),
Exponential linear unit (ELU)
• Outperformed all the ReLU variants in their
experiments: training time was reduced and the
neural network performed better on the test set.
Batch Normalization
• The technique consists of adding an operation in the model just before
the activation function of each layer.
• Simply zero-centering and normalizing the inputs, then scaling and
shifting the result using two new parameters per layer (one for scaling,
the other for shifting).
• In other words, this operation lets the model learn the optimal scale and
mean of the inputs for each layer.
• γ is the scaling parameter for the layer.
• β is the shifting parameter (offset) for the
layer.
Activation functions
Reusing Pretrained Layers
• It is generally not a good idea to train a very
large DNN from scratch.
• Try to find an existing neural network that
accomplishes a similar task.
• Reuse the lower layers of this network.
• This is called transfer learning.
Example
• DNN that was trained to
classify pictures into 100
different categories.
• You now want to train a
DNN to classify specific
types of vehicles.
• Freezing the Lower Layers
weights.
• Tweaking, Dropping, or
Replacing the Upper Layers.
Understanding AlexNet
Consists of 5 Convolutional Layers and 3 Fully Connected Layers (classify 1000 classes)
Faster Optimizers
• Five ways to speed up training (and reach a better
solution):
➢ Applying a good initialization strategy for the connection weights.
➢ using a good activation function.
➢ Using Batch Normalization.
➢ Reusing parts of a pretrained network.
➢ Using a faster optimizer than the regular Gradient Descent
optimizer.
• the most popular ones: Momentum optimization, Nesterov
Accelerated Gradient, AdaGrad, RMSProp, and finally Adam
optimization.
Momentum Optimization Algorithm
• Gradient Descent simply updates the weights θ by directly subtracting
the gradient of the cost function J(θ) with regards to the weights
(∇θJ(θ)) multiplied by the learning rate η (equation 1)
• Momentum optimization cares a great deal about what previous
gradients were.
• It updates the weights by simply subtracting this momentum vector.
• A new hyperparameter β, simply called the momentum, which must be
set between 0 and 1, typically 0.9. (equation 2)
Gradient Descent (1) Momentum Optimization (2)
Nesterov Momentum optimization
▪ The only difference from vanilla Momentum
optimization is that the gradient is measured
at θ + βm rather than at θ.
▪ This small tweak works because in general the
momentum vector will be pointing in the
right direction
▪ where ∇1 represents the gradient of the cost
function measured at the starting point θ,
and ∇2 represents the gradient at the point
located at θ + βm)
RMS Optimization
• Accumulating only the gradients from the most recent iterations (as
opposed to all the gradients since the beginning of training).
• It does so by using exponential decay in the first step.
• generally performs better than Momentum optimization and
Nesterov Accelerated Gradients.
• In fact, it was the preferred optimization algorithm of many
researchers until Adam optimization came around.
Adam Optimization
• Stands for adaptive moment estimation.
• Combines the ideas of Momentum optimization and RMSProp.
• Steps 3 and 4 are somewhat of a technical detail: since m and s
are initialized at 0, they will be biased toward 0 at the beginning of
training, so these two steps will help boost m and s at the
beginning of training.
Initialize β1 = 0.9, β2 =0.999, η = 0.001
term ϵ initialized to a tiny number 10–8 to avoid division by 0.
Difference between optimizers
Learning rate curves
Learning rate techniques
❑ Predetermined piecewise
constant learning rate For example, set the learning rate to η0 = 0.1 at first,
then to η1 = 0.001 after 50 epochs.
❑ Performance scheduling
Measure the validation error every N steps (just like for early stopping) and
reduce the learning rate by a factor of λ when the error stops dropping.
❑ Exponential scheduling
Set the learning rate to a function of the iteration number t:
This works great, but it requires tuning η0 and r. The learning rate will drop by
a factor of 10 every r steps.
❑ Power scheduling
Set the learning rate to η(t) = η0 (1 + t/r)–c The hyperparameter c is set to 1.
This is similar to exponential scheduling, but the learning rate drops much more
slowly.
Dropout
❑ It is a fairly simple algorithm: at every training step, every neuron
(including the input neurons but excluding the output neurons) has
a probability p of being temporarily “dropped out,” meaning it will
be entirely ignored during this training step, but it may be active
during the next step
Data Augmentation
❑ Consists of generating new training (rotating, resizing, flipping,
and cropping) instances from existing ones, artificially boosting
the size of the training set.
❑ This will reduce overfitting, making this a regularization technique.
The trick is to generate realistic training instances.
Convolutional Neural Networks
❑ A convolutional neural network (or ConvNet) is a type of feed-forward
artificial neural network.
❑ The architecture of a ConvNet is designed to take advantage of the 2D
structure of an input image.
❑ A ConvNet is comprised of one or more convolutional layers (often
with a pooling step) and then followed by one or more fully connected
layers as in a standard multilayer neural network.
How CNN works
• For example, a ConvNet takes the input as an image which
can be classified as ‘X’ or ‘O’
ConvNet Layers
▪CONV layer will compute the output of neurons that are connected
to local regions in the input, each computing a dot product between
their weights and a small region they are connected to in the input
volume.
▪RELU layer will apply an elementwise activation function, such as
the max(0,x) thresholding at zero. This leaves the size of the
volume unchanged.
▪POOL layer will perform a down sampling operation along the
spatial dimensions (width, height).
▪FC (i.e. fully-connected) layer will compute the class scores,
resulting in volume of size [1x1xN], where each of the N numbers
correspond to a class score, such as among the N categories.
Convolutional Layer - Filters
▪ The CONV layer’s parameters consist of a set of learnable
filters.
▪ Every filter is small spatially (along width and height), but
extends through the full depth of the input volume.
▪ During the forward pass, we slide (more precisely, convolve)
each filter across the width and height of the input volume and
compute dot products between the entries of the filter and the
input at any position.
Convolutional Layer - Filters
• Sliding the filter over the width and height of the input gives
2-dimensional activation map that responds to that filter at
every spatial position.
Convolutional Layer – Filters –Example
Convolutional Layer – Filters – Computation Example
Convolutional Layer – Filters – Output Feature Map
Relu Layer
Pool Layer
▪ The pooling layers down-sample the previous layers feature
map.
▪ Its function is to progressively reduce the spatial size of the
representation to reduce the amount of parameters and
computation in the network
▪ The pooling layer often uses the Max operation to perform
the down sampling process.
Pooling Filter example Size = 2 X 2, Stride = 2
Fully connected layer
❑ Fully connected layers are the normal
flat feed-forward neural network
layers.
❑ These layers may have a non-linear
activation function or a softmax
activation in order to predict classes.
❑ To compute our output, we simply
rearrange the output matrices as a 1-
D array.
SoftMax operation
❑ A special kind of activation layer,
usually at the end of FC layer
Outputs
❑ Can be viewed as a fancy normalizer
(a.k.a. Normalized exponential
function)
❑ Produce a discrete probability
distribution vector
❑ Very convenient when combined
with cross-entropy loss
Recurrent Neural Network
❑ Some problems require previous history/context in order
to be able to give proper output (speech recognition,
stock forecasting, target tracking, etc.
❑ One way to do that is to just provide all the necessary
context in one "snap-shot" and use standard learning
➢ How big should the snap-shot be? Varies for different
instances of the problem.
✓ If the input sequences are of fixed length, or can be
easily padded to a fixed length, they can be
collapsed into a single input vector and any of the
standard pattern classification algorithms.
Sequential data
❑ There are many tasks that require learning a temporal sequence
of events
❑ These problems can be broken into 3 distinct types of tasks
➢ Sequence Recognition: Produce a particular output pattern
when a specific input sequence is seen. Applications:
Sentiment Analysis, handwriting recognition
➢ Sequence Reproduction: Generate the rest of a sequence
when the network sees only part of the sequence.
Applications: Time series prediction (stock market, sun spots,
etc), language model.
➢ Temporal Association: Produce a particular output sequence
in response to a specific input sequence. Applications:
machine translation, speech generation
✓ Recurrent networks is flexible enough to solve these
problems.
Recurrent Networks offer a lot of flexibility:
(2) Sequence output
(e.g. image
captioning takes an
image and outputs a
sentence of words).
(3) Sequence input
(e.g. sentiment
analysis where a
given sentence is
classified as
expressing positive
or negative
sentiment).
(4) Sequence input and
sequence output (e.g.
Machine Translation:an
RNN reads a sentence
in English and then
outputs a sentence in
French).
(5) Synced
sequence input
and output (e.g.
video
classification
where we wish to
label each frame
of the video).
(1) fixed-sized
input to fixed-
sized output
(e.g. image
classification)
Recurrent Neural Networks
❑ Recurrent neural network lets the
network dynamically learn how much
context it needs in order to solve the
problem.
❑ RNN is a multilayer NN with the previous
set of hidden unit activations feeding
back into the network along with the
inputs.
❑ RNNs have a “memory” which captures
information about what has been
calculated so far.
Recurrent neural networks
❑ Parameter sharing makes it possible to extend and apply the model to
examples of different lengths and generalized across them.
❑ It means local connections are shared (same weights) across different
temporal instances of the hidden units.
❑ If we have to define a different function Gt for each possible sequence
length, each with its own parameters, we would not get any
generalization to sequences of a size not seen in the training set.
Dynamic systems
❑ A means of describing how one state develops into another state
over the course of time.
❑ Consider the classical form of a dynamical system:
✓ Where st is the system state at time t, ƒ8 is a mapping function.
❑ The same parameters (the same function ƒ8) is used for all time
steps.
❑ Unfolding flow graph of such system is:
Dynamic systems
❑ Now consider a dynamical system driven by an external signal xt
The state st now contains information about the whole past sequence
Recurrent Neural Networks
Cost function
❑ The total loss for a given input/target sequence pair
(x, y), measured in cross entropy
L y, y^= Σ Lt = Σ −yt log y^t
• where yt is the category that should be associated
with time step t in the output sequence. y^tis the
predicted output.
Computing the gradient in
RNN
Using the generalized back-propagation one can obtain the so-
called Back-propagation Through Time (BPTT) algorithm.
We can then iterate backwards in time to back-propagate
gradients through time, from t = T − 1 down to t = 1,
noting that st (for t < T) has as descendants both ot and
st+1
Exploding or vanishing gradient
❑ In recurrent nets (also in very deep nets), the final output is the
composition of a large number of non-linear transformations.
❑ Even if each of these non-linear transformations is smooth. Their
composition might not be.
❑ The derivative (i.e. Jacobian matrix) through the whole composition
will tend to be either very small or very large.
❑ Example, suppose all numbers in the product are scalar and have the
same value α. If multiplication times T goes to ∞ then α^T = ∞ if α >
1 and αT = 0 if α < 1.
Gradient clipping
❑ Once the gradient value grows extremely large, it causes an overflow
(i.e. NaN) which is easily detectable at runtime.
❑ A simple heuristic solution that clips gradients to a small number
whenever they explode. That is, whenever they reach a certain
threshold, they are set back to a small number. as shown in Algorithm:
Error surface of a single hidden unit RNN
Facing the vanishing gradient problem
❑ Echo State Networks (ESN)
❑ Long delays
❑ Leaky Units
❑ Gated Recurrent Neural Networks
Echo State Networks (ESN)
❑ How do we set the input and recurrent weights so that a rich set of
histories can be represented in the recurrent neural network state?
❑ Answer: is to make the dynamical system associated with the recurrent
net nearly be on the edge of stability, i.e., more precisely with values
around 1 for the leading eigenvalue of the Jacobian of the state-to-
state transition function.
❑ ESNs proposed to fix the weights of the input→ hidden connections
and the hidden → hidden at carefully random values to make the
Jacobians slightly contractive. This is achieved by making the λ of the
weight matrix large but slightly less than 1.
❑ ESNs are only learn the hidden→output connections.
Skipping Connects (Long delays)
❑ Adding Longer-delay connections allow to
connect the past states to future states
through short paths
❑ if we have a connection every time steps. The
gradients will be vanishing or explosion after
number T of time steps as O(hT).
❑ instead, if we have recurrent connections
with a time-delay of D, gradients grow as
O(fiT/D) without vanishing but still may
explosion at T.
❑ because the number of effective steps is T/D.
This allows the learning algorithm to capture
longer dependencies
Gated Recurrent Neural Networks
❑ GRNNs are a special kind of RNN, capable of learning long-term
dependencies by having more persistent memory. Two popular
architectures:
➢ Long short-term memory (LSTM) [Hochreiter and Schmidhuber,
1997].
➢ Gated recurrent unit (GRU), [Cho et al., 2014]
❑ Applications: handwriting recognition (Graves et al., 2009), speech
recognition (Graves et al., 2013; Graves and Jaitly, 2014), handwriting
generation (Graves, 2013), machine translation (Sutskever et al., 2014a),
image to text conversion (captioning) (Kiros et al., 2014b; Vinyals et al.,
2014b; Xu et al., 2015b) and parsing (Vinyals et al., 2014a).
Long Short-Term Memory (LSTM)
❑ Standard RNNs have a very
simple repeating module
structure, such as a single tanh
layer.
❑ LSTMs also have this chain like
structure, but the repeating
module has a different
structure. Instead of having a
single neural network layer,
there are four, interacting in a
very special way.
Generate image caption
❑ Vinyals et al., Show and Tell: A Neural Image Caption Generator,arXiv
2014
❑ Use a CNN as an image encoder and transform it to a fixed-length
vector
❑ It is used as the initial hidden state of a “decoder” RNN that generates
the target sequence
Translate videos to sentences
❑ Venugopalan et al. arXiv 2014
❑ The challenge is to capture the joint dependencies of a sequence of
frames and a corresponding sequence of words
Reinforcement Learning
❑ One of the most exciting fields of Machine Learning today,
and also one of the oldest.
❑ It has been around since the 1950s, producing many
interesting applications over the years in particular in
games (e.g., TD-Gammon, a Backgammon playing program).
❑ Revolution took place in 2013 when researchers from an
English startup called DeepMind demonstrated a system
that could learn to play just about any Atari game from
scratch.
❑ DeepMind was bought by Google for over 500 million
dollars in 2014.
Learning to Optimize Rewards
❑ In Reinforcement Learning, a software agent makes
observations and takes actions within an environment, and
in return it receives rewards.
❑ Its objective is to learn to act in a way that will maximize its
expected long-term rewards.
❑ The agent acts in the environment and learns by trial and
error to maximize its pleasure and minimize its pain.
Examples of RL agents
• (a) walking robot,
(b) Ms. Pac-Man,
(c) Go player,
• (d) thermostat,
• (e) automatic
trader5
Policy Search
❑ The algorithm used by the software agent to determine its
actions is called policy.
❑ For example, the policy could be a neural network taking
observations as inputs and outputting the action to take
Stochastic policy
❑ The policy can be any algorithm you can think of, and it
does not even have to be deterministic.
❑ For example, consider a robotic vacuum cleaner whose
reward is the amount of dust it picks up in 30 minutes. Its
policy could be to move forward with some probability p
every second, or randomly rotate left or right with
probability 1 – p.
❑ The rotation angle would be a random angle between –r
and +r. Since this policy involves some randomness, it is
called a stochastic policy.
Introduction to OpenAI Gym
❑ One of the challenges of Reinforcement Learning is that in
order to train an agent, you first need to have a working
environment.
❑ If you want to program an agent that will learn to play an
Atari game, you will need an Atari game simulator.
❑ If you want to program a walking robot, then the
environment is the real world and you can directly train
your robot in that environment.
Example of environment
❑ CartPole environment . This is a 2D simulation in which a
cart can be accelerated left or right in order to balance a
pole placed on top of it
Neural Network Policies
❑ In the case of the CartPole
environment, there are just two
possible actions (left or right)
❑ For example, if it outputs 0.7,
then we will pick action 0 with
70% probability, and action 1 with
30% probability.
Markov Decision Processes
❑ In the early 20th century, the mathematician Andrey
Markov studied stochastic processes with no memory,
called Markov chains.
❑ Such a process has a fixed number of states, and it
randomly evolves from one state to another at each step.
❑ The probability for it to evolve from a state s to a state s′ is
fixed, and it depends only on the pair (s,s′), not on past
states (the system has no memory).
❑ Markov chains can have very different dynamics, and they
are heavily used in thermodynamics, chemistry, statistics,
and much more.
MDP Example
❑ Suppose that the process starts in
state s0, and there is a 70% chance
that it will remain in that state at the
next step.
❑ Eventually it is bound to leave that
state and never come back since no
other state points back to s0.
❑ If it goes to state s1, it will then most
likely go to state s2 (90% probability),
then immediately back to state s1
(with 100% probability).
Another Example
Example: Grid World
❑ Noisy movement: actions do not always go as planned
❑ 80% of the time, the action North takes the agent North
(if there is no wall there)
❑ 10% of the time, North takes the agent West; 10% East
❑ If there is a wall in the direction the agent would have been
taken, the agent stays put.
❑ The agent receives rewards each time step
▪ Small “living” reward each step (can be negative)
▪ Big rewards come at the end (good or bad)
❑ Goal: maximize sum of rewards
Grid World Actions
Deterministic
Grid World
Stochastic
Grid World
Markov Decision Processes
❑ An MDP is defined by:
▪ A set of states s ∈ S
▪ A set of actions a ∈ A
▪ A transition function T(s, a, s’)
▪ Probability that a from s leads to s’, i.e., P(s’|
s, a)
▪ Also called the model or the dynamics
▪ A reward function R(s, a, s’)
▪ Sometimes just R(s) or R(s’)
▪ A start state
▪ Maybe a terminal state
What is Markov about MDPs?
❑ “Markov” generally means that given the present state, the future
and the past are independent
❑ For Markov decision processes, “Markov” means action outcomes
depend only on the current state
❑ This is just like search, where the successor function could only
depend on the current state (not the history)
Andrey Markov
(1856-1922)
Markov Property
S0 S1 St-1 St St+1
..
.
St St+1
=
Policies
❑ In deterministic single-agent search problems,
we wanted an optimal plan, or sequence of
actions, from start to a goal
❑ For MDPs, we want an optimal policy π*: S → A
▪ A policy π gives an action for each state
▪ An optimal policy is one that maximizes
expected utility if followed
▪ An explicit policy defines a reflex agent
Optimal policy when
R(s, a, s’) = -0.03 for all
non-terminals s
Optimal Policies
R(s) = -0.03R(s) = -0.01
R(s) = -2.0R(s) = -0.4
Utilities of Sequences
▪ What preferences should an agent have over reward sequences?
▪ More or less?
▪ Now or later?
[1, 2, 2] [2, 3, 4]or
[0, 0, 1] [1, 0, 0]or
Discounting
▪ It’s reasonable to maximize the sum of rewards
▪ It’s also reasonable to prefer rewards now to rewards later
▪ One solution: values of rewards decay exponentially
Worth Now Worth Next
Step
Worth In Two
Steps
Discounting
▪ How to discount?
▪ Each time we descend a level, we
multiply in the discount once
▪ Why discount?
▪ Sooner rewards probably do have higher
utility than later rewards
▪ Also helps our algorithms converge
▪ Example: discount of 0.5
▪ U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3
▪ U([1,2,3]) < U([3,2,1])
Infinite Utilities?!
▪ Problem: What if the game lasts forever? Do we get infinite
rewards?
▪ Solutions:
▪ Finite horizon: (similar to depth-limited search)
▪ Terminate episodes after a fixed T steps (e.g. life)
▪ Gives nonstationary policies (π depends on time left)
▪ Discounting: use 0 < γ < 1
▪ Smaller γ means smaller “horizon” – shorter term focus
▪ Absorbing state: guarantee that for every policy, a terminal state will
eventually be reached
THANKS
QUESTIONS?

More Related Content

What's hot

Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hakky St
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
Mohit Rajput
 
Supervised and Unsupervised Machine Learning
Supervised and Unsupervised Machine LearningSupervised and Unsupervised Machine Learning
Supervised and Unsupervised Machine Learning
Spotle.ai
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
CloudxLab
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
Pınar Yahşi
 
Machine Learning - Splitting Datasets
Machine Learning - Splitting DatasetsMachine Learning - Splitting Datasets
Machine Learning - Splitting Datasets
Andrew Ferlitsch
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
Milind Gokhale
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
Prof. Neeta Awasthy
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
Knoldus Inc.
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
VARUN KUMAR
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
Jon Lederman
 
Introduction to Random Forest
Introduction to Random Forest Introduction to Random Forest
Introduction to Random Forest
Rupak Roy
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine Learning
Upekha Vandebona
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
Kien Le
 
Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion Matrix
Andrew Ferlitsch
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
CloudxLab
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
Student
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
Abhimanyu Dwivedi
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
Md. Ariful Hoque
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CARTXueping Peng
 

What's hot (20)

Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
 
Supervised and Unsupervised Machine Learning
Supervised and Unsupervised Machine LearningSupervised and Unsupervised Machine Learning
Supervised and Unsupervised Machine Learning
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
 
Machine Learning - Splitting Datasets
Machine Learning - Splitting DatasetsMachine Learning - Splitting Datasets
Machine Learning - Splitting Datasets
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Introduction to Random Forest
Introduction to Random Forest Introduction to Random Forest
Introduction to Random Forest
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine Learning
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion Matrix
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 

Similar to Hands on machine learning with scikit-learn and tensor flow by ahmed yousry

ML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptx
DebabrataPain1
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
Akash Goel
 
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Simplilearn
 
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI TechnologiesAI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
Value Amplify Consulting
 
Machine learning Module-2, 6th Semester Elective
Machine learning Module-2, 6th Semester ElectiveMachine learning Module-2, 6th Semester Elective
Machine learning Module-2, 6th Semester Elective
MayuraD1
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
ankit_ppt
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
Junaid Bhat
 
Lec 6-bp
Lec 6-bpLec 6-bp
Lec 6-bp
Taymoor Nazmy
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyay
abhishek upadhyay
 
Introduction to Neural networks (under graduate course) Lecture 9 of 9
Introduction to Neural networks (under graduate course) Lecture 9 of 9Introduction to Neural networks (under graduate course) Lecture 9 of 9
Introduction to Neural networks (under graduate course) Lecture 9 of 9
Randa Elanwar
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
MoctardOLOULADE
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
Anirban Santara
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
DonghyunKang12
 
TensorFlow.pptx
TensorFlow.pptxTensorFlow.pptx
TensorFlow.pptx
Jayesh Patil
 
Machine learning project
Machine learning projectMachine learning project
Machine learning project
Harsh Jain
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentation
Owin Will
 
EE5180_G-5.pptx
EE5180_G-5.pptxEE5180_G-5.pptx
EE5180_G-5.pptx
MandeepChaudhary10
 
A Survey of Convolutional Neural Networks
A Survey of Convolutional Neural NetworksA Survey of Convolutional Neural Networks
A Survey of Convolutional Neural Networks
Rimzim Thube
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep Learning
Sourya Dey
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
Pierre de Lacaze
 

Similar to Hands on machine learning with scikit-learn and tensor flow by ahmed yousry (20)

ML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptx
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
 
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
 
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI TechnologiesAI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
 
Machine learning Module-2, 6th Semester Elective
Machine learning Module-2, 6th Semester ElectiveMachine learning Module-2, 6th Semester Elective
Machine learning Module-2, 6th Semester Elective
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
Lec 6-bp
Lec 6-bpLec 6-bp
Lec 6-bp
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyay
 
Introduction to Neural networks (under graduate course) Lecture 9 of 9
Introduction to Neural networks (under graduate course) Lecture 9 of 9Introduction to Neural networks (under graduate course) Lecture 9 of 9
Introduction to Neural networks (under graduate course) Lecture 9 of 9
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
TensorFlow.pptx
TensorFlow.pptxTensorFlow.pptx
TensorFlow.pptx
 
Machine learning project
Machine learning projectMachine learning project
Machine learning project
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentation
 
EE5180_G-5.pptx
EE5180_G-5.pptxEE5180_G-5.pptx
EE5180_G-5.pptx
 
A Survey of Convolutional Neural Networks
A Survey of Convolutional Neural NetworksA Survey of Convolutional Neural Networks
A Survey of Convolutional Neural Networks
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep Learning
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 

Recently uploaded

special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Atul Kumar Singh
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
DhatriParmar
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
GeoBlogs
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 

Recently uploaded (20)

special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 

Hands on machine learning with scikit-learn and tensor flow by ahmed yousry

  • 1. Hands on Machine Learning with Scikit Learn Presented By: Ahmed Yousry
  • 2. Agenda ❑Introduction to Artificial Neural Networks. ❑Training Deep Neural Nets. ❑Convolutional Neural Networks. ❑Recurrent Neural Network. ❑Reinforcement Learning.
  • 3. Introduction to ANN • First introduced back in 1943 by the Warren McCulloch . • Successes of ANNs until the 1960s. • In the early 1980s there was a revival of interest in ANNs as new network architectures. • By the 1990s, powerful alternative Machine Learning techniques.
  • 4. Reasons why ANN is much more profound impact ❑There is now a huge quantity of data. ❑The tremendous increase in computing power. ❑The training algorithms have been improved. ❑Theoretical limitations of ANNs have turned out to be benign. ❑virtuous circle of funding and progress and products.
  • 7. The Perceptron • One of the simplest ANN architectures, invented in 1957 by Frank Rosenblatt. • It is based on a linear threshold unit (LTU). Z = w1 x1 + w2 x2 + ⋯ + wn xn = wT ・ x hw(x) = step (Z) = step (wT ・x)
  • 8. Multioutput perceptron ❑ A Perceptron with two inputs and three outputs. Note : No hidden layers in perceptron.
  • 9. Training Algorithm While epoch produces an error Present network with next inputs from epoch Err = T – O If Err <> 0 then Wj new = Wj old + LR * Ij * Err End If End While • T: actual output , O: predicted output • LR : learning rate , I :input
  • 10. XOR classification problem and an MLP that solves it XOR Function X1 XOR X2 = (X1 AND NOT X2) OR (X2 AND NOT X1) 2 2 2 2 -1 -1 Z1 Z2 Y X1 X2
  • 11. Multi-Layer Perceptron and Backpropagation • An MLP is composed of one input layer, one or more layers of LTUs, called hidden layers, and one final output layer • When an ANN has two or more hidden layers, it is called a deep neural network (DNN).
  • 12. A modern MLP (including ReLU and softmax) for classification
  • 13. Deep learning Problems • Vanishing gradients problem (or the related exploding gradients problem) lower layers very hard to train. • Second, with such a large network, training would be extremely slow. • Third, a model with millions of parameters would severely risk overfitting the training set.
  • 14. Gradients problems • Gradients often get smaller as the algorithm progresses down to the lower layers. • The Gradient Descent update leaves the lower layer weights unchanged, and training never converges to a good solution. • This is called the vanishing gradients problem. • The gradients can grow bigger and bigger, so many layers get insanely large weight updates and the algorithm diverges. This is the exploding gradients problem
  • 15. Solving the first problem(Van…) A paper titled “Understanding the Difficulty of Training Deep Feedforward Neural Networks” by Xavier Glorot and Yoshua. 1. popular logistic sigmoid activation function. 2. using a normal distribution with a mean of 0 and a standard deviation of 1. 3. the hyperbolic tangent function has a mean of 0 and behaves slightly better than the logistic function in DNN.
  • 16. Sigmoid activation function you can see that when inputs become large (negative or positive), the function saturates at 0 or 1, with a derivative extremely close to 0. 1/1+e^-x
  • 17. The problem of RELU (0,max) • It suffers from a problem (dying ReLUs) during training, some neurons effectively die. • they stop outputting anything other than 0. • In some cases, you may find that half of your network’s neurons are dead training. • To solve this problem, you may want to use a variant of the ReLU function, such as the leaky ReLU.
  • 18. leaky ReLU (RReLU). • leaky variants always outperformed the strict ReLU activation function. In fact, setting α = 0.2 (huge leak) seemed to result in better performance than α = 0.01 (small leak). • They also evaluated the randomized leaky ReLU (RReLU). • also evaluated the parametric leaky ReLU (PReLU),
  • 19. Exponential linear unit (ELU) • Outperformed all the ReLU variants in their experiments: training time was reduced and the neural network performed better on the test set.
  • 20. Batch Normalization • The technique consists of adding an operation in the model just before the activation function of each layer. • Simply zero-centering and normalizing the inputs, then scaling and shifting the result using two new parameters per layer (one for scaling, the other for shifting). • In other words, this operation lets the model learn the optimal scale and mean of the inputs for each layer. • γ is the scaling parameter for the layer. • β is the shifting parameter (offset) for the layer.
  • 22. Reusing Pretrained Layers • It is generally not a good idea to train a very large DNN from scratch. • Try to find an existing neural network that accomplishes a similar task. • Reuse the lower layers of this network. • This is called transfer learning.
  • 23. Example • DNN that was trained to classify pictures into 100 different categories. • You now want to train a DNN to classify specific types of vehicles. • Freezing the Lower Layers weights. • Tweaking, Dropping, or Replacing the Upper Layers.
  • 24. Understanding AlexNet Consists of 5 Convolutional Layers and 3 Fully Connected Layers (classify 1000 classes)
  • 25. Faster Optimizers • Five ways to speed up training (and reach a better solution): ➢ Applying a good initialization strategy for the connection weights. ➢ using a good activation function. ➢ Using Batch Normalization. ➢ Reusing parts of a pretrained network. ➢ Using a faster optimizer than the regular Gradient Descent optimizer. • the most popular ones: Momentum optimization, Nesterov Accelerated Gradient, AdaGrad, RMSProp, and finally Adam optimization.
  • 26. Momentum Optimization Algorithm • Gradient Descent simply updates the weights θ by directly subtracting the gradient of the cost function J(θ) with regards to the weights (∇θJ(θ)) multiplied by the learning rate η (equation 1) • Momentum optimization cares a great deal about what previous gradients were. • It updates the weights by simply subtracting this momentum vector. • A new hyperparameter β, simply called the momentum, which must be set between 0 and 1, typically 0.9. (equation 2) Gradient Descent (1) Momentum Optimization (2)
  • 27.
  • 28. Nesterov Momentum optimization ▪ The only difference from vanilla Momentum optimization is that the gradient is measured at θ + βm rather than at θ. ▪ This small tweak works because in general the momentum vector will be pointing in the right direction ▪ where ∇1 represents the gradient of the cost function measured at the starting point θ, and ∇2 represents the gradient at the point located at θ + βm)
  • 29. RMS Optimization • Accumulating only the gradients from the most recent iterations (as opposed to all the gradients since the beginning of training). • It does so by using exponential decay in the first step. • generally performs better than Momentum optimization and Nesterov Accelerated Gradients. • In fact, it was the preferred optimization algorithm of many researchers until Adam optimization came around.
  • 30. Adam Optimization • Stands for adaptive moment estimation. • Combines the ideas of Momentum optimization and RMSProp. • Steps 3 and 4 are somewhat of a technical detail: since m and s are initialized at 0, they will be biased toward 0 at the beginning of training, so these two steps will help boost m and s at the beginning of training. Initialize β1 = 0.9, β2 =0.999, η = 0.001 term ϵ initialized to a tiny number 10–8 to avoid division by 0.
  • 33. Learning rate techniques ❑ Predetermined piecewise constant learning rate For example, set the learning rate to η0 = 0.1 at first, then to η1 = 0.001 after 50 epochs. ❑ Performance scheduling Measure the validation error every N steps (just like for early stopping) and reduce the learning rate by a factor of λ when the error stops dropping. ❑ Exponential scheduling Set the learning rate to a function of the iteration number t: This works great, but it requires tuning η0 and r. The learning rate will drop by a factor of 10 every r steps. ❑ Power scheduling Set the learning rate to η(t) = η0 (1 + t/r)–c The hyperparameter c is set to 1. This is similar to exponential scheduling, but the learning rate drops much more slowly.
  • 34. Dropout ❑ It is a fairly simple algorithm: at every training step, every neuron (including the input neurons but excluding the output neurons) has a probability p of being temporarily “dropped out,” meaning it will be entirely ignored during this training step, but it may be active during the next step
  • 35. Data Augmentation ❑ Consists of generating new training (rotating, resizing, flipping, and cropping) instances from existing ones, artificially boosting the size of the training set. ❑ This will reduce overfitting, making this a regularization technique. The trick is to generate realistic training instances.
  • 36. Convolutional Neural Networks ❑ A convolutional neural network (or ConvNet) is a type of feed-forward artificial neural network. ❑ The architecture of a ConvNet is designed to take advantage of the 2D structure of an input image. ❑ A ConvNet is comprised of one or more convolutional layers (often with a pooling step) and then followed by one or more fully connected layers as in a standard multilayer neural network.
  • 37. How CNN works • For example, a ConvNet takes the input as an image which can be classified as ‘X’ or ‘O’
  • 38. ConvNet Layers ▪CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume. ▪RELU layer will apply an elementwise activation function, such as the max(0,x) thresholding at zero. This leaves the size of the volume unchanged. ▪POOL layer will perform a down sampling operation along the spatial dimensions (width, height). ▪FC (i.e. fully-connected) layer will compute the class scores, resulting in volume of size [1x1xN], where each of the N numbers correspond to a class score, such as among the N categories.
  • 39. Convolutional Layer - Filters ▪ The CONV layer’s parameters consist of a set of learnable filters. ▪ Every filter is small spatially (along width and height), but extends through the full depth of the input volume. ▪ During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position.
  • 40. Convolutional Layer - Filters • Sliding the filter over the width and height of the input gives 2-dimensional activation map that responds to that filter at every spatial position.
  • 41. Convolutional Layer – Filters –Example
  • 42. Convolutional Layer – Filters – Computation Example
  • 43. Convolutional Layer – Filters – Output Feature Map
  • 45. Pool Layer ▪ The pooling layers down-sample the previous layers feature map. ▪ Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network ▪ The pooling layer often uses the Max operation to perform the down sampling process.
  • 46. Pooling Filter example Size = 2 X 2, Stride = 2
  • 47. Fully connected layer ❑ Fully connected layers are the normal flat feed-forward neural network layers. ❑ These layers may have a non-linear activation function or a softmax activation in order to predict classes. ❑ To compute our output, we simply rearrange the output matrices as a 1- D array.
  • 48. SoftMax operation ❑ A special kind of activation layer, usually at the end of FC layer Outputs ❑ Can be viewed as a fancy normalizer (a.k.a. Normalized exponential function) ❑ Produce a discrete probability distribution vector ❑ Very convenient when combined with cross-entropy loss
  • 49. Recurrent Neural Network ❑ Some problems require previous history/context in order to be able to give proper output (speech recognition, stock forecasting, target tracking, etc. ❑ One way to do that is to just provide all the necessary context in one "snap-shot" and use standard learning ➢ How big should the snap-shot be? Varies for different instances of the problem. ✓ If the input sequences are of fixed length, or can be easily padded to a fixed length, they can be collapsed into a single input vector and any of the standard pattern classification algorithms.
  • 50. Sequential data ❑ There are many tasks that require learning a temporal sequence of events ❑ These problems can be broken into 3 distinct types of tasks ➢ Sequence Recognition: Produce a particular output pattern when a specific input sequence is seen. Applications: Sentiment Analysis, handwriting recognition ➢ Sequence Reproduction: Generate the rest of a sequence when the network sees only part of the sequence. Applications: Time series prediction (stock market, sun spots, etc), language model. ➢ Temporal Association: Produce a particular output sequence in response to a specific input sequence. Applications: machine translation, speech generation ✓ Recurrent networks is flexible enough to solve these problems.
  • 51. Recurrent Networks offer a lot of flexibility: (2) Sequence output (e.g. image captioning takes an image and outputs a sentence of words). (3) Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative sentiment). (4) Sequence input and sequence output (e.g. Machine Translation:an RNN reads a sentence in English and then outputs a sentence in French). (5) Synced sequence input and output (e.g. video classification where we wish to label each frame of the video). (1) fixed-sized input to fixed- sized output (e.g. image classification)
  • 52. Recurrent Neural Networks ❑ Recurrent neural network lets the network dynamically learn how much context it needs in order to solve the problem. ❑ RNN is a multilayer NN with the previous set of hidden unit activations feeding back into the network along with the inputs. ❑ RNNs have a “memory” which captures information about what has been calculated so far.
  • 53. Recurrent neural networks ❑ Parameter sharing makes it possible to extend and apply the model to examples of different lengths and generalized across them. ❑ It means local connections are shared (same weights) across different temporal instances of the hidden units. ❑ If we have to define a different function Gt for each possible sequence length, each with its own parameters, we would not get any generalization to sequences of a size not seen in the training set.
  • 54. Dynamic systems ❑ A means of describing how one state develops into another state over the course of time. ❑ Consider the classical form of a dynamical system: ✓ Where st is the system state at time t, ƒ8 is a mapping function. ❑ The same parameters (the same function ƒ8) is used for all time steps. ❑ Unfolding flow graph of such system is:
  • 55. Dynamic systems ❑ Now consider a dynamical system driven by an external signal xt The state st now contains information about the whole past sequence
  • 57. Cost function ❑ The total loss for a given input/target sequence pair (x, y), measured in cross entropy L y, y^= Σ Lt = Σ −yt log y^t • where yt is the category that should be associated with time step t in the output sequence. y^tis the predicted output.
  • 58. Computing the gradient in RNN Using the generalized back-propagation one can obtain the so- called Back-propagation Through Time (BPTT) algorithm. We can then iterate backwards in time to back-propagate gradients through time, from t = T − 1 down to t = 1, noting that st (for t < T) has as descendants both ot and st+1
  • 59. Exploding or vanishing gradient ❑ In recurrent nets (also in very deep nets), the final output is the composition of a large number of non-linear transformations. ❑ Even if each of these non-linear transformations is smooth. Their composition might not be. ❑ The derivative (i.e. Jacobian matrix) through the whole composition will tend to be either very small or very large. ❑ Example, suppose all numbers in the product are scalar and have the same value α. If multiplication times T goes to ∞ then α^T = ∞ if α > 1 and αT = 0 if α < 1.
  • 60. Gradient clipping ❑ Once the gradient value grows extremely large, it causes an overflow (i.e. NaN) which is easily detectable at runtime. ❑ A simple heuristic solution that clips gradients to a small number whenever they explode. That is, whenever they reach a certain threshold, they are set back to a small number. as shown in Algorithm: Error surface of a single hidden unit RNN
  • 61. Facing the vanishing gradient problem ❑ Echo State Networks (ESN) ❑ Long delays ❑ Leaky Units ❑ Gated Recurrent Neural Networks
  • 62. Echo State Networks (ESN) ❑ How do we set the input and recurrent weights so that a rich set of histories can be represented in the recurrent neural network state? ❑ Answer: is to make the dynamical system associated with the recurrent net nearly be on the edge of stability, i.e., more precisely with values around 1 for the leading eigenvalue of the Jacobian of the state-to- state transition function. ❑ ESNs proposed to fix the weights of the input→ hidden connections and the hidden → hidden at carefully random values to make the Jacobians slightly contractive. This is achieved by making the λ of the weight matrix large but slightly less than 1. ❑ ESNs are only learn the hidden→output connections.
  • 63. Skipping Connects (Long delays) ❑ Adding Longer-delay connections allow to connect the past states to future states through short paths ❑ if we have a connection every time steps. The gradients will be vanishing or explosion after number T of time steps as O(hT). ❑ instead, if we have recurrent connections with a time-delay of D, gradients grow as O(fiT/D) without vanishing but still may explosion at T. ❑ because the number of effective steps is T/D. This allows the learning algorithm to capture longer dependencies
  • 64. Gated Recurrent Neural Networks ❑ GRNNs are a special kind of RNN, capable of learning long-term dependencies by having more persistent memory. Two popular architectures: ➢ Long short-term memory (LSTM) [Hochreiter and Schmidhuber, 1997]. ➢ Gated recurrent unit (GRU), [Cho et al., 2014] ❑ Applications: handwriting recognition (Graves et al., 2009), speech recognition (Graves et al., 2013; Graves and Jaitly, 2014), handwriting generation (Graves, 2013), machine translation (Sutskever et al., 2014a), image to text conversion (captioning) (Kiros et al., 2014b; Vinyals et al., 2014b; Xu et al., 2015b) and parsing (Vinyals et al., 2014a).
  • 65. Long Short-Term Memory (LSTM) ❑ Standard RNNs have a very simple repeating module structure, such as a single tanh layer. ❑ LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.
  • 66. Generate image caption ❑ Vinyals et al., Show and Tell: A Neural Image Caption Generator,arXiv 2014 ❑ Use a CNN as an image encoder and transform it to a fixed-length vector ❑ It is used as the initial hidden state of a “decoder” RNN that generates the target sequence
  • 67. Translate videos to sentences ❑ Venugopalan et al. arXiv 2014 ❑ The challenge is to capture the joint dependencies of a sequence of frames and a corresponding sequence of words
  • 68. Reinforcement Learning ❑ One of the most exciting fields of Machine Learning today, and also one of the oldest. ❑ It has been around since the 1950s, producing many interesting applications over the years in particular in games (e.g., TD-Gammon, a Backgammon playing program). ❑ Revolution took place in 2013 when researchers from an English startup called DeepMind demonstrated a system that could learn to play just about any Atari game from scratch. ❑ DeepMind was bought by Google for over 500 million dollars in 2014.
  • 69. Learning to Optimize Rewards ❑ In Reinforcement Learning, a software agent makes observations and takes actions within an environment, and in return it receives rewards. ❑ Its objective is to learn to act in a way that will maximize its expected long-term rewards. ❑ The agent acts in the environment and learns by trial and error to maximize its pleasure and minimize its pain.
  • 70. Examples of RL agents • (a) walking robot, (b) Ms. Pac-Man, (c) Go player, • (d) thermostat, • (e) automatic trader5
  • 71. Policy Search ❑ The algorithm used by the software agent to determine its actions is called policy. ❑ For example, the policy could be a neural network taking observations as inputs and outputting the action to take
  • 72. Stochastic policy ❑ The policy can be any algorithm you can think of, and it does not even have to be deterministic. ❑ For example, consider a robotic vacuum cleaner whose reward is the amount of dust it picks up in 30 minutes. Its policy could be to move forward with some probability p every second, or randomly rotate left or right with probability 1 – p. ❑ The rotation angle would be a random angle between –r and +r. Since this policy involves some randomness, it is called a stochastic policy.
  • 73. Introduction to OpenAI Gym ❑ One of the challenges of Reinforcement Learning is that in order to train an agent, you first need to have a working environment. ❑ If you want to program an agent that will learn to play an Atari game, you will need an Atari game simulator. ❑ If you want to program a walking robot, then the environment is the real world and you can directly train your robot in that environment.
  • 74. Example of environment ❑ CartPole environment . This is a 2D simulation in which a cart can be accelerated left or right in order to balance a pole placed on top of it
  • 75. Neural Network Policies ❑ In the case of the CartPole environment, there are just two possible actions (left or right) ❑ For example, if it outputs 0.7, then we will pick action 0 with 70% probability, and action 1 with 30% probability.
  • 76. Markov Decision Processes ❑ In the early 20th century, the mathematician Andrey Markov studied stochastic processes with no memory, called Markov chains. ❑ Such a process has a fixed number of states, and it randomly evolves from one state to another at each step. ❑ The probability for it to evolve from a state s to a state s′ is fixed, and it depends only on the pair (s,s′), not on past states (the system has no memory). ❑ Markov chains can have very different dynamics, and they are heavily used in thermodynamics, chemistry, statistics, and much more.
  • 77. MDP Example ❑ Suppose that the process starts in state s0, and there is a 70% chance that it will remain in that state at the next step. ❑ Eventually it is bound to leave that state and never come back since no other state points back to s0. ❑ If it goes to state s1, it will then most likely go to state s2 (90% probability), then immediately back to state s1 (with 100% probability).
  • 79. Example: Grid World ❑ Noisy movement: actions do not always go as planned ❑ 80% of the time, the action North takes the agent North (if there is no wall there) ❑ 10% of the time, North takes the agent West; 10% East ❑ If there is a wall in the direction the agent would have been taken, the agent stays put. ❑ The agent receives rewards each time step ▪ Small “living” reward each step (can be negative) ▪ Big rewards come at the end (good or bad) ❑ Goal: maximize sum of rewards
  • 80. Grid World Actions Deterministic Grid World Stochastic Grid World
  • 81. Markov Decision Processes ❑ An MDP is defined by: ▪ A set of states s ∈ S ▪ A set of actions a ∈ A ▪ A transition function T(s, a, s’) ▪ Probability that a from s leads to s’, i.e., P(s’| s, a) ▪ Also called the model or the dynamics ▪ A reward function R(s, a, s’) ▪ Sometimes just R(s) or R(s’) ▪ A start state ▪ Maybe a terminal state
  • 82. What is Markov about MDPs? ❑ “Markov” generally means that given the present state, the future and the past are independent ❑ For Markov decision processes, “Markov” means action outcomes depend only on the current state ❑ This is just like search, where the successor function could only depend on the current state (not the history) Andrey Markov (1856-1922)
  • 83. Markov Property S0 S1 St-1 St St+1 .. . St St+1 =
  • 84. Policies ❑ In deterministic single-agent search problems, we wanted an optimal plan, or sequence of actions, from start to a goal ❑ For MDPs, we want an optimal policy π*: S → A ▪ A policy π gives an action for each state ▪ An optimal policy is one that maximizes expected utility if followed ▪ An explicit policy defines a reflex agent Optimal policy when R(s, a, s’) = -0.03 for all non-terminals s
  • 85. Optimal Policies R(s) = -0.03R(s) = -0.01 R(s) = -2.0R(s) = -0.4
  • 86. Utilities of Sequences ▪ What preferences should an agent have over reward sequences? ▪ More or less? ▪ Now or later? [1, 2, 2] [2, 3, 4]or [0, 0, 1] [1, 0, 0]or
  • 87. Discounting ▪ It’s reasonable to maximize the sum of rewards ▪ It’s also reasonable to prefer rewards now to rewards later ▪ One solution: values of rewards decay exponentially Worth Now Worth Next Step Worth In Two Steps
  • 88. Discounting ▪ How to discount? ▪ Each time we descend a level, we multiply in the discount once ▪ Why discount? ▪ Sooner rewards probably do have higher utility than later rewards ▪ Also helps our algorithms converge ▪ Example: discount of 0.5 ▪ U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3 ▪ U([1,2,3]) < U([3,2,1])
  • 89. Infinite Utilities?! ▪ Problem: What if the game lasts forever? Do we get infinite rewards? ▪ Solutions: ▪ Finite horizon: (similar to depth-limited search) ▪ Terminate episodes after a fixed T steps (e.g. life) ▪ Gives nonstationary policies (π depends on time left) ▪ Discounting: use 0 < γ < 1 ▪ Smaller γ means smaller “horizon” – shorter term focus ▪ Absorbing state: guarantee that for every policy, a terminal state will eventually be reached