SlideShare a Scribd company logo
1 of 31
Download to read offline
Reinforcement Learning
AAA-Python Edition
Plan
●
1- Introduction
●
2- OpenAI Gym
●
3- Policy Gradient
●
4- Example
●
5- Q-Learning
3
1-Introduction
[By Amina Delali]
ConceptConcept
●
Reinforcement Learning its a type of machine learning.
●
It is based on simple concepts:
➢
A program called an Agent has to learn how to achieve a
certain goal.
➢
It will learn from its interaction with the surrounding
environment related to that goal.
➢
The agent learn by taking actions that afect the state of the
surrounding environment.
➢
Two elements of information afect the choice of which action
to take:
➔
The state of the environment
➔
The reward (the reinforcement) the agent will receive from
the environment after taking an action.
4
1-Introduction
[By Amina Delali]
LearningLearning
●
The interaction between the agent and its environment can be
modeled by the following diagram:
The agent
The Environment
Policy
Reward
Policy: a function that
determines the action to
be taken by the Agent
given an input state.
State: affected
by the agent’s
actions
Take actionsInput state
The goal of the
agent will be to
maximize the
reward
5
1-Introduction
[By Amina Delali]
Applications & SimulationApplications & Simulation
●
Reinforcement Learning can be used for diferent types of
applications:
➢
Robotics: the agent is the robot, and the environment is
either the real world or a simulation of it. Its goal can be for
example to reach a certain location, to clean a room, or to
explore a building.
➢
Games: the agent is a player, and the environment is a
simulation of the game. Its goal is to win the game.
●
In most of the cases, the Agent needs a simulation of the
environment.
●
OpenAI Gym is a library that provides a set of simulated
environments that can be used in diferent types of applications.
●
Installation of OpenAI Gym:
6
2-OpenAIGym
[By Amina Delali]
Elements to considerElements to consider
●
Here a list of elements to consider when using OpenAI gym
library:
➢
The available environments:
➔
there is a set of predefned environments that you select by
their associated name
➔
The name is passed as argument to the function: make
➔
Before using the environment, you have to initialize it with
the environment's method reset
➢
The agent is modeled by:
➔
An action: which is a value from the available possible actions
values obtained by the environment’s method : action_space
➔
Which action it takes at a certain point : you defne a
function (the policy) that will return the value of the action
to be taken.
➔
How it takes the action: you call the environment’s method
step
7
2-OpenAIGym
[By Amina Delali]
Elements to considerElements to consider
➢
The simulation: the display of the simulated environment is
available using the environment’s method render
➢
The reward and the state of the environment are returned by
the call of the method step. The method reset returns only the
state values (after initialization)
●
In order to render the environment in google colaboratory, we
had to install certain additional libraries, and adding some
statements for each render:
You have to uninstall the
library if its already
installed1
2
1
3
8
2-OpenAIGym
[By Amina Delali]
CartPole-V1 environmentCartPole-V1 environment
●
●
This environment simulates a pole
attached by an un-actuated joint to a cart,
which moves along a frictionless track
●
The goal for the agent will be to prevent
the pole from failing
●
There is only 2 (discrete)
possible actions to take :
push the cart to the right
(action == 1) or to the left
(action == 0)
●
The actions are
corresponding to applying +1
or -1 force to the cart
The cart position
The cart velocity
The pole
angle
The angular
velocity
9
2-OpenAIGym
[By Amina Delali]
●
The reward given to the
agent for each step taken:
in this environment, it is
equal to +1 for any step
taken as long as the pole
remains upright . The
termination step is also
included.
●
All the steps done in a simulation after its
reset and until its termination is called an
episode.
●
An episode in this environment is
terminated if:
●
The pole angle is more than 12 degree
●
The cart position is more than 2.4
●
The episode length is greater than 200
●
Solved Requirements: The average
reward is greater than or equal to 195.0
over 100 consecutive trials (episodes).
CartPole-V1 environmentCartPole-V1 environment
10
2-OpenAIGym
[By Amina Delali]
A simple exampleA simple example
●
Initial state before taking any action
(generated after a reset)
●
The reset assigns to the state values an
uniform random value in [-0.05..0.05]
1
2
We first move to the right
One episode (one trial) with a
maximum length of 600 steps.
11
2-OpenAIGym
[By Amina Delali]
A simple example (suite)A simple example (suite)
It is clear, from the picture
representing the last state of the
environment after its
termination, that the move that
causes that termination was a
move to the left.
3
12
3-PolicyGradient
[By Amina Delali]
ProblematicProblematic
●
We have to defne the policy of the agent that it will use to
choose the right action.
●
This action must permit the agent to maximize the total rewards
obtained all along the episode.
●
So, not only the reward obtained from the current action must
be maximized, but the sum of this reward and the following
rewards obtained after applying the following actions to the
end of the episode that must be maximized.
●
Several approaches exist to solve this issue. Most of them involve
the use of a neural networks.
13
3-PolicyGradient
[By Amina Delali]
Use of neural networksUse of neural networks
●
Neural networks can be used in diferent manners.
●
We will talk about 2 diferent ways to use a neural network:
➢
The neural network is itself the policy: it will predict the action
to apply.
➢
The neural network is combined with a policy:
➔
The neural network learn to predict a certain value. Known
as the Q-Value, computed for a pair of state, action
➔
Use the policy to select the action to apply: for example, after
predicting the Q-values of a state for all the possible
actions, select the action that corresponds to the maximum
Q-value
14
3-PolicyGradient
[By Amina Delali]
The discounted rewardsThe discounted rewards
●
As we said previously:
Neural network
Predict the action to apply Predict a Q-value
Use the Q-value to
Select the action
to apply
Learn to
Learn to
●
Each action can be evaluated by the reward the
agent gets immediately by taking this action, and
the following discounted rewards it gets when it
applies the following actions until the end of the
episode.
A1 r1 A2
r2
. . . rn
An
, , , .
Value(A1)=r1+γ⋅r2+γ2
⋅r3+⋯γn−1
rn
=r1+γ⋅Value( A2)
The last action just
before the end of
the episode
The discount factor:
represents the importance
of the future actions. A
good value of is
around 0.9
γIt could be any action Ai.:
Value(Ai)=ri+γ⋅Value( Ai+1)
15
3-PolicyGradient
[By Amina Delali]
Policy GradientPolicy Gradient
●
It implies the use of the gradients: the values computed from the
derivation of a cost function (for example the MSE error function)
and applied by an optimizer (for example the SGD algorithm) to
update the parameters of the policy (in our case, the neural
network policy, it updates the weights of the layers of the ANN).
●
The gradients must also maximize the rewards obtained.
●
One of the available algorithms used is the REINFORCE
algorithm.
●
We will discuss and apply a variant of that algorithm explained in
[Géron, 2017]
16
3-PolicyGradient
[By Amina Delali]
Policy Gradient: ExamplePolicy Gradient: Example
●
To train the neural to predict the optimum action to apply, we
follow this steps:
➢
Apply only feedforwards computations for a certain number of
episodes:
➔
the neural network get a state as input. In our example (the
CartPoole environment),the stat is represented by 4 values.
➔
The neural network will output probabilities related to each
action to take. In our case it will be one probability: the
probability to go left (to take the action 0).
➔
From this probability take the corresponding action==>
implies a new state values and a returned reward.
➔
To compute the gradients corresponding to this probability,
we need the corresponding labels. We will consider that the
taken action was the action to take ==> if the action was
0 the probability must be 1 (100% to go left ), if the action
was 1, the probability must be 0 (0% to go left).
17
3-PolicyGradient
[By Amina Delali]
Policy Gradient: ExamplePolicy Gradient: Example
➔
Compute the gradients with this label
➔
Use the new state values as the new input, and save the
obtained reward.
➔
Repeat this process for a certain number of episodes.
➢
Compute the values of the taken actions:
➔
Since for each action we have all the following ones, we can
compute its value.
➔
Normalize the computed values. So: good values will be
positives and the bad ones will be negatives.
➢
To take into account the previous values (actions values), we
multiply these values with the gradients.
➢
Compute the mean of the new gradients, and use them to
update the weights ==> apply one backpropagation.
18
3-PolicyGradient
[By Amina Delali]
Selecting the action while trainingSelecting the action while training
●
To select the action to take (while training) from the outputted
probability we will use a multinomial distribution.
●
Instead of selecting the action corresponding to the biggest
probability (in our case, since we have one probability, we would
select the action 0 if the probability is > 0.5, and the action 1 if
the probability is <= 0.5), we will use these probabilities in a
sampling function (the multinomial distribution) to generate
the actions values with these probabilities.
●
For example, in our case, if the neural network outputs: 0.7 ==>
the action 1 will have the probaility 0.3. Then multinomial
distribution will have to return one value. That value will have
0.7 probability to be a zero, and 0.3 probability to be a 1.
●
To do this, we will give the multinomial distribution the log of the
probabilities outputted ( in our example, log(0.7) and log(0.3) ).
19
3-PolicyGradient
[By Amina Delali]
ImplementationImplementation
●
We will adapt the implementation presented in [Géron, 2017] by
using keras and tensorfow.
●
Before going further, we will introduce the optimizer and the
cost funtion that we will use:
➢
The cost function will be the cross-entropy function. We
already defned it in the previous course.
➢
The optimizer will be the Adam Optimizer. Adam stands for
Adaptive Moment Estimation. It is an adaptive learning
rate algorithm. These algorithms decay (diminishe) the efect
of the learning rate for the steepest (fastest) dimensions
(features)
➢
It also keeps track of exponentially decaying average of past
gradients (momentum principle) and of squared gradients.
20
3-PolicyGradient
[By Amina Delali]
Adam OptimizerAdam Optimizer
●
The Adam algorithm can be described as follow (from [Kingma
and Ba, 2014]):
➢
Initialize the parameter vector : ( the parameters correspond
to the parameters that we noted w in the previous course )
➢
➢
While the parameters are not converged:
➔
➔
➔
➔
➔
➔
➔
➢
End while
m0←0
θ0 θ
v0←0t←0
θt
t←t+1
gt←∇θ f t(θt−1)
mt ←β1⋅mt−1+(1−β1)⋅gt
^mt ←
mt
1−β1
t
^vt←
vt
1−β2
t
θt←θt−1−
η⋅^m
(√ ^vt+ϵ)
, , , , , , are all
vectors. The t indicates the
a timestamp: the value s of
the vectors at a time t
(iteration t)
, , , , , , are all
vectors. The t indicates the
a timestamp: the value s of
the vectors at a time t
(iteration t)
θt
gt mt vt ^mt ^vt
The
gradients
(derivation of
the cost
function f)
The decay of
the average
of the
gradients
The decay of
the average
of the
squared
gradients
, : are the
momentum decay
hyperparameters
: is the learning rate
: is a smoothing term
used to avoid division by
zero
β1 β2
η ϵ
ηGood default values are:
: 0.002
: 0.9
: 0.999
β1
β2
vt←β2⋅vt−1+(1−β2)⋅gt
2
The multiplications and the division between
vectors are element-wise
The multiplications and the division between
vectors are element-wise
21
4-Example
[By Amina Delali]
Normalization of discounted
rewards
Normalization of discounted
rewards
●
To test our function
22
4-Example
[By Amina Delali]
The trainingThe training
●
We set num_iterations to
25 (instead of 250, and
max_episodes to 5
(instead of 10) for time
execution issues
23
4-Example
[By Amina Delali]
The training (2)The training (2)
●
24
4-Example
[By Amina Delali]
The PredictionThe Prediction
●
Since we
didn’t let
the model
train long
enough,
we ill not
get good
results
25
5-Q-Learning
[By Amina Delali]
Q-LearningQ-Learning
Neural network
Predict the action to apply
Estimate a Q-value
Use the Q-value to
select the action to apply
Learn to Learn to
●
A Q-value is an estimation of the
maximum discounted future
reward when an action a is
performed on a state s.
Considering that the RL problem
can be modeled as a Marcov
Decision Process
●
It is computed using the Bellman
Equation:
Q(st ,at)=r+γ⋅maxat+1
Q(st+1 ,at+1)
●
The neural network will have to learn to
estimate these Q-values ==> Q-
Learning
●
If the neural network is a Deep NN, we
will talk about Deep Q-Learning (DQN)
●
In general, the learning involves the use
of an initialized Q-table containing the Q-
value of each possible pair (state,action)
●
In general, the NN will perform random
actions (exploration), and use the
obtained rewards to correct the Q-table
values.
●
When its estimation become better, it will
use its estimations to choose the
actions (exploitation), and continue the
training
Q(st ,at)=max(Rt+1)
26
5-Q-Learning
[By Amina Delali]
Example: IntroductionExample: Introduction
●
For our example, we will use the library keras-rl
●
It is a library that implements some deep reinforcement learning
algorithms in python.
●
It integrates with keras, works with OpenAI gym algorithms.
●
In the example, there is a use of a Memory object. In fact, the DQN
learning involves the use of a Replay memory:
➢
The actions performed by the Neural Network are collected in a Replay
Memory with all the information related to that action (the starting state,
the obtained reward, and the resulting state after performing the action).
➢
During the training, the DNN will use randomly batches from this
memory as data for training.
●
To write our example, we used the tutorial available at [Keras rl] with
some modifications.
27
5-Q-Learning
[By Amina Delali]
The exampleThe example
Flatten the input.
Will print the configuration
of the model
28
5-Q-Learning
[By Amina Delali]
The exampleThe example
Returns the
current best
action
according to
q_values
29
5-Q-Learning
[By Amina Delali]
Some Elements about the last 2 examplesSome Elements about the last 2 examples
●
The ReLU activation function:
F(x) = relu(x, alpha=0.0, max_value=None, threshold=0.0)
➢ For default values (our case), it is defined as: F(x) = max(0,x)
➢ In the other cases :
➔ f(x) = max_value for x >= max_value,
➔ f(x) = x for threshold <= x < max_value
➔f(x) = alpha * (x - threshold) otherwise.
●
VarianceScaling initializer of the weights:
VarianceScaling(scale=1.0, mode='fan_in', distribution='normal', seed=None)
➢
With the normal distribution (our case), it will draw a truncated normal
distribution centred on zero. The standard deviation depends of the mode.
➢
With distribution="uniform", samples are drawn from a uniform distribution
within [-limit, limit], with limit = sqrt(3 * scale / n) (n depends on the mode).
For more details, see the documentation of keras.
●
Sigmoid activation function:
➢
F(x)= 1 /( 1+ e-x
)
●
The linear activation function is simply the identity function
References
●
[Géron, 2017] Géron, A. (2017). Hands-on machine learning with
Scikit-Learn and TensorFlow: concepts, tools, and techniques to
build intelligent systems. O’Reilly Media, Inc.
●
[Gulli and Pal, 2017] Gulli, A. and Pal, S. (2017). Deep Learning with
Keras. Packt Publishing Ltd.
●
Joshi Prateek. Artifcial intelligence with Python. Packt Publishing,
2017.
●
[Keras, ] Keras. Keras documentation. https://keras.io/. Accessed on
25-03-2019.
●
[keras rl, ] keras rl. dqn cartpole example. https://github.com/keras-
rl/keras-rl/blob/master/examples/. Accessed on 25-03-2019.
●
[Kingma and Ba, 2014] Kingma, D. and Ba, J. (2014). Adam: A
method for stochastic optimization. International Conference on
Learning Representations.
●
[Matthew, 2019] Matthew, B. (2019). How to deep control back
propagation with keras. https://github.com/keras-
team/keras/issues/956#issuecomment-458801928. Accessed on 25-
03-2019.
Thank
you!
FOR ALL YOUR TIME

More Related Content

Similar to Aaa ped-24- Reinforcement Learning

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfVaishnavGhadge1
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxMohibKhan79
 
Naive Reinforcement algorithm
Naive Reinforcement algorithmNaive Reinforcement algorithm
Naive Reinforcement algorithmSameerJolly2
 
rlpptgroup3-231018180804-0c05fb2f789piutt
rlpptgroup3-231018180804-0c05fb2f789piuttrlpptgroup3-231018180804-0c05fb2f789piutt
rlpptgroup3-231018180804-0c05fb2f789piutt201roopikha
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learningazzeddine chenine
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptxManiMaran230751
 
Aaa ped-14-Ensemble Learning: About Ensemble Learning
Aaa ped-14-Ensemble Learning: About Ensemble LearningAaa ped-14-Ensemble Learning: About Ensemble Learning
Aaa ped-14-Ensemble Learning: About Ensemble LearningAminaRepo
 
semi supervised Learning and Reinforcement learning (1).pptx
 semi supervised Learning and Reinforcement learning (1).pptx semi supervised Learning and Reinforcement learning (1).pptx
semi supervised Learning and Reinforcement learning (1).pptxDr.Shweta
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy OptimizationShubhaManikarnike
 
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesReinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesSeung Jae Lee
 
Machine learning ( Part 3 )
Machine learning ( Part 3 )Machine learning ( Part 3 )
Machine learning ( Part 3 )Sunil OS
 
PPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning AlgorithmsPPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning AlgorithmsJisang Yoon
 
Power Laws: Optimizing Demand-side Strategies. Second Prize Solution
Power Laws: Optimizing Demand-side Strategies. Second Prize SolutionPower Laws: Optimizing Demand-side Strategies. Second Prize Solution
Power Laws: Optimizing Demand-side Strategies. Second Prize SolutionGuillermo Barbadillo Villanueva
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 

Similar to Aaa ped-24- Reinforcement Learning (20)

Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptx
 
Naive Reinforcement algorithm
Naive Reinforcement algorithmNaive Reinforcement algorithm
Naive Reinforcement algorithm
 
rlpptgroup3-231018180804-0c05fb2f789piutt
rlpptgroup3-231018180804-0c05fb2f789piuttrlpptgroup3-231018180804-0c05fb2f789piutt
rlpptgroup3-231018180804-0c05fb2f789piutt
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
Deep einforcement learning
Deep einforcement learningDeep einforcement learning
Deep einforcement learning
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Aaa ped-14-Ensemble Learning: About Ensemble Learning
Aaa ped-14-Ensemble Learning: About Ensemble LearningAaa ped-14-Ensemble Learning: About Ensemble Learning
Aaa ped-14-Ensemble Learning: About Ensemble Learning
 
semi supervised Learning and Reinforcement learning (1).pptx
 semi supervised Learning and Reinforcement learning (1).pptx semi supervised Learning and Reinforcement learning (1).pptx
semi supervised Learning and Reinforcement learning (1).pptx
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
 
Q_Learning.ppt
Q_Learning.pptQ_Learning.ppt
Q_Learning.ppt
 
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesReinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision Processes
 
Machine learning ( Part 3 )
Machine learning ( Part 3 )Machine learning ( Part 3 )
Machine learning ( Part 3 )
 
PPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning AlgorithmsPPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning Algorithms
 
Power Laws: Optimizing Demand-side Strategies. Second Prize Solution
Power Laws: Optimizing Demand-side Strategies. Second Prize SolutionPower Laws: Optimizing Demand-side Strategies. Second Prize Solution
Power Laws: Optimizing Demand-side Strategies. Second Prize Solution
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 

More from AminaRepo

Aaa ped-23-Artificial Neural Network: Keras and Tensorfow
Aaa ped-23-Artificial Neural Network: Keras and TensorfowAaa ped-23-Artificial Neural Network: Keras and Tensorfow
Aaa ped-23-Artificial Neural Network: Keras and TensorfowAminaRepo
 
Aaa ped-22-Artificial Neural Network: Introduction to ANN
Aaa ped-22-Artificial Neural Network: Introduction to ANNAaa ped-22-Artificial Neural Network: Introduction to ANN
Aaa ped-22-Artificial Neural Network: Introduction to ANNAminaRepo
 
Aaa ped-21-Recommender Systems: Content-based Filtering
Aaa ped-21-Recommender Systems: Content-based FilteringAaa ped-21-Recommender Systems: Content-based Filtering
Aaa ped-21-Recommender Systems: Content-based FilteringAminaRepo
 
Aaa ped-20-Recommender Systems: Model-based collaborative filtering
Aaa ped-20-Recommender Systems: Model-based collaborative filteringAaa ped-20-Recommender Systems: Model-based collaborative filtering
Aaa ped-20-Recommender Systems: Model-based collaborative filteringAminaRepo
 
Aaa ped-19-Recommender Systems: Neighborhood-based Filtering
Aaa ped-19-Recommender Systems: Neighborhood-based FilteringAaa ped-19-Recommender Systems: Neighborhood-based Filtering
Aaa ped-19-Recommender Systems: Neighborhood-based FilteringAminaRepo
 
Aaa ped-18-Unsupervised Learning: Association Rule Learning
Aaa ped-18-Unsupervised Learning: Association Rule LearningAaa ped-18-Unsupervised Learning: Association Rule Learning
Aaa ped-18-Unsupervised Learning: Association Rule LearningAminaRepo
 
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAminaRepo
 
Aaa ped-16-Unsupervised Learning: clustering
Aaa ped-16-Unsupervised Learning: clusteringAaa ped-16-Unsupervised Learning: clustering
Aaa ped-16-Unsupervised Learning: clusteringAminaRepo
 
Aaa ped-15-Ensemble Learning: Random Forests
Aaa ped-15-Ensemble Learning: Random ForestsAaa ped-15-Ensemble Learning: Random Forests
Aaa ped-15-Ensemble Learning: Random ForestsAminaRepo
 
Aaa ped-12-Supervised Learning: Support Vector Machines & Naive Bayes Classifer
Aaa ped-12-Supervised Learning: Support Vector Machines & Naive Bayes ClassiferAaa ped-12-Supervised Learning: Support Vector Machines & Naive Bayes Classifer
Aaa ped-12-Supervised Learning: Support Vector Machines & Naive Bayes ClassiferAminaRepo
 
Aaa ped-11-Supervised Learning: Multivariable Regressor & Classifers
Aaa ped-11-Supervised Learning: Multivariable Regressor & ClassifersAaa ped-11-Supervised Learning: Multivariable Regressor & Classifers
Aaa ped-11-Supervised Learning: Multivariable Regressor & ClassifersAminaRepo
 
Aaa ped-10-Supervised Learning: Introduction to Supervised Learning
Aaa ped-10-Supervised Learning: Introduction to Supervised LearningAaa ped-10-Supervised Learning: Introduction to Supervised Learning
Aaa ped-10-Supervised Learning: Introduction to Supervised LearningAminaRepo
 
Aaa ped-9-Data manipulation: Time Series & Geographical visualization
Aaa ped-9-Data manipulation: Time Series & Geographical visualizationAaa ped-9-Data manipulation: Time Series & Geographical visualization
Aaa ped-9-Data manipulation: Time Series & Geographical visualizationAminaRepo
 
Aaa ped-Data-8- manipulation: Plotting and Visualization
Aaa ped-Data-8- manipulation: Plotting and VisualizationAaa ped-Data-8- manipulation: Plotting and Visualization
Aaa ped-Data-8- manipulation: Plotting and VisualizationAminaRepo
 
Aaa ped-8- Data manipulation: Data wrangling, aggregation, and group operations
Aaa ped-8- Data manipulation: Data wrangling, aggregation, and group operationsAaa ped-8- Data manipulation: Data wrangling, aggregation, and group operations
Aaa ped-8- Data manipulation: Data wrangling, aggregation, and group operationsAminaRepo
 
Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation
Aaa ped-6-Data manipulation:  Data Files, and Data Cleaning & PreparationAaa ped-6-Data manipulation:  Data Files, and Data Cleaning & Preparation
Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & PreparationAminaRepo
 
Aaa ped-5-Data manipulation: Pandas
Aaa ped-5-Data manipulation: Pandas Aaa ped-5-Data manipulation: Pandas
Aaa ped-5-Data manipulation: Pandas AminaRepo
 
Aaa ped-4- Data manipulation: Numpy
Aaa ped-4- Data manipulation: Numpy Aaa ped-4- Data manipulation: Numpy
Aaa ped-4- Data manipulation: Numpy AminaRepo
 
Aaa ped-3. Pythond: advanced concepts
Aaa ped-3. Pythond: advanced conceptsAaa ped-3. Pythond: advanced concepts
Aaa ped-3. Pythond: advanced conceptsAminaRepo
 
Aaa ped-2- Python: Basics
Aaa ped-2- Python: BasicsAaa ped-2- Python: Basics
Aaa ped-2- Python: BasicsAminaRepo
 

More from AminaRepo (20)

Aaa ped-23-Artificial Neural Network: Keras and Tensorfow
Aaa ped-23-Artificial Neural Network: Keras and TensorfowAaa ped-23-Artificial Neural Network: Keras and Tensorfow
Aaa ped-23-Artificial Neural Network: Keras and Tensorfow
 
Aaa ped-22-Artificial Neural Network: Introduction to ANN
Aaa ped-22-Artificial Neural Network: Introduction to ANNAaa ped-22-Artificial Neural Network: Introduction to ANN
Aaa ped-22-Artificial Neural Network: Introduction to ANN
 
Aaa ped-21-Recommender Systems: Content-based Filtering
Aaa ped-21-Recommender Systems: Content-based FilteringAaa ped-21-Recommender Systems: Content-based Filtering
Aaa ped-21-Recommender Systems: Content-based Filtering
 
Aaa ped-20-Recommender Systems: Model-based collaborative filtering
Aaa ped-20-Recommender Systems: Model-based collaborative filteringAaa ped-20-Recommender Systems: Model-based collaborative filtering
Aaa ped-20-Recommender Systems: Model-based collaborative filtering
 
Aaa ped-19-Recommender Systems: Neighborhood-based Filtering
Aaa ped-19-Recommender Systems: Neighborhood-based FilteringAaa ped-19-Recommender Systems: Neighborhood-based Filtering
Aaa ped-19-Recommender Systems: Neighborhood-based Filtering
 
Aaa ped-18-Unsupervised Learning: Association Rule Learning
Aaa ped-18-Unsupervised Learning: Association Rule LearningAaa ped-18-Unsupervised Learning: Association Rule Learning
Aaa ped-18-Unsupervised Learning: Association Rule Learning
 
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
 
Aaa ped-16-Unsupervised Learning: clustering
Aaa ped-16-Unsupervised Learning: clusteringAaa ped-16-Unsupervised Learning: clustering
Aaa ped-16-Unsupervised Learning: clustering
 
Aaa ped-15-Ensemble Learning: Random Forests
Aaa ped-15-Ensemble Learning: Random ForestsAaa ped-15-Ensemble Learning: Random Forests
Aaa ped-15-Ensemble Learning: Random Forests
 
Aaa ped-12-Supervised Learning: Support Vector Machines & Naive Bayes Classifer
Aaa ped-12-Supervised Learning: Support Vector Machines & Naive Bayes ClassiferAaa ped-12-Supervised Learning: Support Vector Machines & Naive Bayes Classifer
Aaa ped-12-Supervised Learning: Support Vector Machines & Naive Bayes Classifer
 
Aaa ped-11-Supervised Learning: Multivariable Regressor & Classifers
Aaa ped-11-Supervised Learning: Multivariable Regressor & ClassifersAaa ped-11-Supervised Learning: Multivariable Regressor & Classifers
Aaa ped-11-Supervised Learning: Multivariable Regressor & Classifers
 
Aaa ped-10-Supervised Learning: Introduction to Supervised Learning
Aaa ped-10-Supervised Learning: Introduction to Supervised LearningAaa ped-10-Supervised Learning: Introduction to Supervised Learning
Aaa ped-10-Supervised Learning: Introduction to Supervised Learning
 
Aaa ped-9-Data manipulation: Time Series & Geographical visualization
Aaa ped-9-Data manipulation: Time Series & Geographical visualizationAaa ped-9-Data manipulation: Time Series & Geographical visualization
Aaa ped-9-Data manipulation: Time Series & Geographical visualization
 
Aaa ped-Data-8- manipulation: Plotting and Visualization
Aaa ped-Data-8- manipulation: Plotting and VisualizationAaa ped-Data-8- manipulation: Plotting and Visualization
Aaa ped-Data-8- manipulation: Plotting and Visualization
 
Aaa ped-8- Data manipulation: Data wrangling, aggregation, and group operations
Aaa ped-8- Data manipulation: Data wrangling, aggregation, and group operationsAaa ped-8- Data manipulation: Data wrangling, aggregation, and group operations
Aaa ped-8- Data manipulation: Data wrangling, aggregation, and group operations
 
Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation
Aaa ped-6-Data manipulation:  Data Files, and Data Cleaning & PreparationAaa ped-6-Data manipulation:  Data Files, and Data Cleaning & Preparation
Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation
 
Aaa ped-5-Data manipulation: Pandas
Aaa ped-5-Data manipulation: Pandas Aaa ped-5-Data manipulation: Pandas
Aaa ped-5-Data manipulation: Pandas
 
Aaa ped-4- Data manipulation: Numpy
Aaa ped-4- Data manipulation: Numpy Aaa ped-4- Data manipulation: Numpy
Aaa ped-4- Data manipulation: Numpy
 
Aaa ped-3. Pythond: advanced concepts
Aaa ped-3. Pythond: advanced conceptsAaa ped-3. Pythond: advanced concepts
Aaa ped-3. Pythond: advanced concepts
 
Aaa ped-2- Python: Basics
Aaa ped-2- Python: BasicsAaa ped-2- Python: Basics
Aaa ped-2- Python: Basics
 

Recently uploaded

Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptxSilpa
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxSuji236384
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry Areesha Ahmad
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIADr. TATHAGAT KHOBRAGADE
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxSilpa
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxSilpa
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY1301aanya
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsSérgio Sacani
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.Silpa
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Silpa
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptxArvind Kumar
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Silpa
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Silpa
 

Recently uploaded (20)

Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptx
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 

Aaa ped-24- Reinforcement Learning

  • 2. Plan ● 1- Introduction ● 2- OpenAI Gym ● 3- Policy Gradient ● 4- Example ● 5- Q-Learning
  • 3. 3 1-Introduction [By Amina Delali] ConceptConcept ● Reinforcement Learning its a type of machine learning. ● It is based on simple concepts: ➢ A program called an Agent has to learn how to achieve a certain goal. ➢ It will learn from its interaction with the surrounding environment related to that goal. ➢ The agent learn by taking actions that afect the state of the surrounding environment. ➢ Two elements of information afect the choice of which action to take: ➔ The state of the environment ➔ The reward (the reinforcement) the agent will receive from the environment after taking an action.
  • 4. 4 1-Introduction [By Amina Delali] LearningLearning ● The interaction between the agent and its environment can be modeled by the following diagram: The agent The Environment Policy Reward Policy: a function that determines the action to be taken by the Agent given an input state. State: affected by the agent’s actions Take actionsInput state The goal of the agent will be to maximize the reward
  • 5. 5 1-Introduction [By Amina Delali] Applications & SimulationApplications & Simulation ● Reinforcement Learning can be used for diferent types of applications: ➢ Robotics: the agent is the robot, and the environment is either the real world or a simulation of it. Its goal can be for example to reach a certain location, to clean a room, or to explore a building. ➢ Games: the agent is a player, and the environment is a simulation of the game. Its goal is to win the game. ● In most of the cases, the Agent needs a simulation of the environment. ● OpenAI Gym is a library that provides a set of simulated environments that can be used in diferent types of applications. ● Installation of OpenAI Gym:
  • 6. 6 2-OpenAIGym [By Amina Delali] Elements to considerElements to consider ● Here a list of elements to consider when using OpenAI gym library: ➢ The available environments: ➔ there is a set of predefned environments that you select by their associated name ➔ The name is passed as argument to the function: make ➔ Before using the environment, you have to initialize it with the environment's method reset ➢ The agent is modeled by: ➔ An action: which is a value from the available possible actions values obtained by the environment’s method : action_space ➔ Which action it takes at a certain point : you defne a function (the policy) that will return the value of the action to be taken. ➔ How it takes the action: you call the environment’s method step
  • 7. 7 2-OpenAIGym [By Amina Delali] Elements to considerElements to consider ➢ The simulation: the display of the simulated environment is available using the environment’s method render ➢ The reward and the state of the environment are returned by the call of the method step. The method reset returns only the state values (after initialization) ● In order to render the environment in google colaboratory, we had to install certain additional libraries, and adding some statements for each render: You have to uninstall the library if its already installed1 2 1 3
  • 8. 8 2-OpenAIGym [By Amina Delali] CartPole-V1 environmentCartPole-V1 environment ● ● This environment simulates a pole attached by an un-actuated joint to a cart, which moves along a frictionless track ● The goal for the agent will be to prevent the pole from failing ● There is only 2 (discrete) possible actions to take : push the cart to the right (action == 1) or to the left (action == 0) ● The actions are corresponding to applying +1 or -1 force to the cart The cart position The cart velocity The pole angle The angular velocity
  • 9. 9 2-OpenAIGym [By Amina Delali] ● The reward given to the agent for each step taken: in this environment, it is equal to +1 for any step taken as long as the pole remains upright . The termination step is also included. ● All the steps done in a simulation after its reset and until its termination is called an episode. ● An episode in this environment is terminated if: ● The pole angle is more than 12 degree ● The cart position is more than 2.4 ● The episode length is greater than 200 ● Solved Requirements: The average reward is greater than or equal to 195.0 over 100 consecutive trials (episodes). CartPole-V1 environmentCartPole-V1 environment
  • 10. 10 2-OpenAIGym [By Amina Delali] A simple exampleA simple example ● Initial state before taking any action (generated after a reset) ● The reset assigns to the state values an uniform random value in [-0.05..0.05] 1 2 We first move to the right One episode (one trial) with a maximum length of 600 steps.
  • 11. 11 2-OpenAIGym [By Amina Delali] A simple example (suite)A simple example (suite) It is clear, from the picture representing the last state of the environment after its termination, that the move that causes that termination was a move to the left. 3
  • 12. 12 3-PolicyGradient [By Amina Delali] ProblematicProblematic ● We have to defne the policy of the agent that it will use to choose the right action. ● This action must permit the agent to maximize the total rewards obtained all along the episode. ● So, not only the reward obtained from the current action must be maximized, but the sum of this reward and the following rewards obtained after applying the following actions to the end of the episode that must be maximized. ● Several approaches exist to solve this issue. Most of them involve the use of a neural networks.
  • 13. 13 3-PolicyGradient [By Amina Delali] Use of neural networksUse of neural networks ● Neural networks can be used in diferent manners. ● We will talk about 2 diferent ways to use a neural network: ➢ The neural network is itself the policy: it will predict the action to apply. ➢ The neural network is combined with a policy: ➔ The neural network learn to predict a certain value. Known as the Q-Value, computed for a pair of state, action ➔ Use the policy to select the action to apply: for example, after predicting the Q-values of a state for all the possible actions, select the action that corresponds to the maximum Q-value
  • 14. 14 3-PolicyGradient [By Amina Delali] The discounted rewardsThe discounted rewards ● As we said previously: Neural network Predict the action to apply Predict a Q-value Use the Q-value to Select the action to apply Learn to Learn to ● Each action can be evaluated by the reward the agent gets immediately by taking this action, and the following discounted rewards it gets when it applies the following actions until the end of the episode. A1 r1 A2 r2 . . . rn An , , , . Value(A1)=r1+γ⋅r2+γ2 ⋅r3+⋯γn−1 rn =r1+γ⋅Value( A2) The last action just before the end of the episode The discount factor: represents the importance of the future actions. A good value of is around 0.9 γIt could be any action Ai.: Value(Ai)=ri+γ⋅Value( Ai+1)
  • 15. 15 3-PolicyGradient [By Amina Delali] Policy GradientPolicy Gradient ● It implies the use of the gradients: the values computed from the derivation of a cost function (for example the MSE error function) and applied by an optimizer (for example the SGD algorithm) to update the parameters of the policy (in our case, the neural network policy, it updates the weights of the layers of the ANN). ● The gradients must also maximize the rewards obtained. ● One of the available algorithms used is the REINFORCE algorithm. ● We will discuss and apply a variant of that algorithm explained in [Géron, 2017]
  • 16. 16 3-PolicyGradient [By Amina Delali] Policy Gradient: ExamplePolicy Gradient: Example ● To train the neural to predict the optimum action to apply, we follow this steps: ➢ Apply only feedforwards computations for a certain number of episodes: ➔ the neural network get a state as input. In our example (the CartPoole environment),the stat is represented by 4 values. ➔ The neural network will output probabilities related to each action to take. In our case it will be one probability: the probability to go left (to take the action 0). ➔ From this probability take the corresponding action==> implies a new state values and a returned reward. ➔ To compute the gradients corresponding to this probability, we need the corresponding labels. We will consider that the taken action was the action to take ==> if the action was 0 the probability must be 1 (100% to go left ), if the action was 1, the probability must be 0 (0% to go left).
  • 17. 17 3-PolicyGradient [By Amina Delali] Policy Gradient: ExamplePolicy Gradient: Example ➔ Compute the gradients with this label ➔ Use the new state values as the new input, and save the obtained reward. ➔ Repeat this process for a certain number of episodes. ➢ Compute the values of the taken actions: ➔ Since for each action we have all the following ones, we can compute its value. ➔ Normalize the computed values. So: good values will be positives and the bad ones will be negatives. ➢ To take into account the previous values (actions values), we multiply these values with the gradients. ➢ Compute the mean of the new gradients, and use them to update the weights ==> apply one backpropagation.
  • 18. 18 3-PolicyGradient [By Amina Delali] Selecting the action while trainingSelecting the action while training ● To select the action to take (while training) from the outputted probability we will use a multinomial distribution. ● Instead of selecting the action corresponding to the biggest probability (in our case, since we have one probability, we would select the action 0 if the probability is > 0.5, and the action 1 if the probability is <= 0.5), we will use these probabilities in a sampling function (the multinomial distribution) to generate the actions values with these probabilities. ● For example, in our case, if the neural network outputs: 0.7 ==> the action 1 will have the probaility 0.3. Then multinomial distribution will have to return one value. That value will have 0.7 probability to be a zero, and 0.3 probability to be a 1. ● To do this, we will give the multinomial distribution the log of the probabilities outputted ( in our example, log(0.7) and log(0.3) ).
  • 19. 19 3-PolicyGradient [By Amina Delali] ImplementationImplementation ● We will adapt the implementation presented in [Géron, 2017] by using keras and tensorfow. ● Before going further, we will introduce the optimizer and the cost funtion that we will use: ➢ The cost function will be the cross-entropy function. We already defned it in the previous course. ➢ The optimizer will be the Adam Optimizer. Adam stands for Adaptive Moment Estimation. It is an adaptive learning rate algorithm. These algorithms decay (diminishe) the efect of the learning rate for the steepest (fastest) dimensions (features) ➢ It also keeps track of exponentially decaying average of past gradients (momentum principle) and of squared gradients.
  • 20. 20 3-PolicyGradient [By Amina Delali] Adam OptimizerAdam Optimizer ● The Adam algorithm can be described as follow (from [Kingma and Ba, 2014]): ➢ Initialize the parameter vector : ( the parameters correspond to the parameters that we noted w in the previous course ) ➢ ➢ While the parameters are not converged: ➔ ➔ ➔ ➔ ➔ ➔ ➔ ➢ End while m0←0 θ0 θ v0←0t←0 θt t←t+1 gt←∇θ f t(θt−1) mt ←β1⋅mt−1+(1−β1)⋅gt ^mt ← mt 1−β1 t ^vt← vt 1−β2 t θt←θt−1− η⋅^m (√ ^vt+ϵ) , , , , , , are all vectors. The t indicates the a timestamp: the value s of the vectors at a time t (iteration t) , , , , , , are all vectors. The t indicates the a timestamp: the value s of the vectors at a time t (iteration t) θt gt mt vt ^mt ^vt The gradients (derivation of the cost function f) The decay of the average of the gradients The decay of the average of the squared gradients , : are the momentum decay hyperparameters : is the learning rate : is a smoothing term used to avoid division by zero β1 β2 η ϵ ηGood default values are: : 0.002 : 0.9 : 0.999 β1 β2 vt←β2⋅vt−1+(1−β2)⋅gt 2 The multiplications and the division between vectors are element-wise The multiplications and the division between vectors are element-wise
  • 21. 21 4-Example [By Amina Delali] Normalization of discounted rewards Normalization of discounted rewards ● To test our function
  • 22. 22 4-Example [By Amina Delali] The trainingThe training ● We set num_iterations to 25 (instead of 250, and max_episodes to 5 (instead of 10) for time execution issues
  • 23. 23 4-Example [By Amina Delali] The training (2)The training (2) ●
  • 24. 24 4-Example [By Amina Delali] The PredictionThe Prediction ● Since we didn’t let the model train long enough, we ill not get good results
  • 25. 25 5-Q-Learning [By Amina Delali] Q-LearningQ-Learning Neural network Predict the action to apply Estimate a Q-value Use the Q-value to select the action to apply Learn to Learn to ● A Q-value is an estimation of the maximum discounted future reward when an action a is performed on a state s. Considering that the RL problem can be modeled as a Marcov Decision Process ● It is computed using the Bellman Equation: Q(st ,at)=r+γ⋅maxat+1 Q(st+1 ,at+1) ● The neural network will have to learn to estimate these Q-values ==> Q- Learning ● If the neural network is a Deep NN, we will talk about Deep Q-Learning (DQN) ● In general, the learning involves the use of an initialized Q-table containing the Q- value of each possible pair (state,action) ● In general, the NN will perform random actions (exploration), and use the obtained rewards to correct the Q-table values. ● When its estimation become better, it will use its estimations to choose the actions (exploitation), and continue the training Q(st ,at)=max(Rt+1)
  • 26. 26 5-Q-Learning [By Amina Delali] Example: IntroductionExample: Introduction ● For our example, we will use the library keras-rl ● It is a library that implements some deep reinforcement learning algorithms in python. ● It integrates with keras, works with OpenAI gym algorithms. ● In the example, there is a use of a Memory object. In fact, the DQN learning involves the use of a Replay memory: ➢ The actions performed by the Neural Network are collected in a Replay Memory with all the information related to that action (the starting state, the obtained reward, and the resulting state after performing the action). ➢ During the training, the DNN will use randomly batches from this memory as data for training. ● To write our example, we used the tutorial available at [Keras rl] with some modifications.
  • 27. 27 5-Q-Learning [By Amina Delali] The exampleThe example Flatten the input. Will print the configuration of the model
  • 28. 28 5-Q-Learning [By Amina Delali] The exampleThe example Returns the current best action according to q_values
  • 29. 29 5-Q-Learning [By Amina Delali] Some Elements about the last 2 examplesSome Elements about the last 2 examples ● The ReLU activation function: F(x) = relu(x, alpha=0.0, max_value=None, threshold=0.0) ➢ For default values (our case), it is defined as: F(x) = max(0,x) ➢ In the other cases : ➔ f(x) = max_value for x >= max_value, ➔ f(x) = x for threshold <= x < max_value ➔f(x) = alpha * (x - threshold) otherwise. ● VarianceScaling initializer of the weights: VarianceScaling(scale=1.0, mode='fan_in', distribution='normal', seed=None) ➢ With the normal distribution (our case), it will draw a truncated normal distribution centred on zero. The standard deviation depends of the mode. ➢ With distribution="uniform", samples are drawn from a uniform distribution within [-limit, limit], with limit = sqrt(3 * scale / n) (n depends on the mode). For more details, see the documentation of keras. ● Sigmoid activation function: ➢ F(x)= 1 /( 1+ e-x ) ● The linear activation function is simply the identity function
  • 30. References ● [Géron, 2017] Géron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. O’Reilly Media, Inc. ● [Gulli and Pal, 2017] Gulli, A. and Pal, S. (2017). Deep Learning with Keras. Packt Publishing Ltd. ● Joshi Prateek. Artifcial intelligence with Python. Packt Publishing, 2017. ● [Keras, ] Keras. Keras documentation. https://keras.io/. Accessed on 25-03-2019. ● [keras rl, ] keras rl. dqn cartpole example. https://github.com/keras- rl/keras-rl/blob/master/examples/. Accessed on 25-03-2019. ● [Kingma and Ba, 2014] Kingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. International Conference on Learning Representations. ● [Matthew, 2019] Matthew, B. (2019). How to deep control back propagation with keras. https://github.com/keras- team/keras/issues/956#issuecomment-458801928. Accessed on 25- 03-2019.