The Reinforcement Learning (RL) is a particular type of learning. It is useful when we try to learn from an unknown environment. Which means, that our model will have to explore the environment in order to collect the necessary data to use for its training. The model is represented as an Agent, trying to achieve a certain goal in a particular environment. The Agent affects this environment by taking actions that change the state of the environment and generate rewards produced by this later one.
The learning relies on the generated rewards, and the goal will be to maximize them. To choose the actions to apply, the agents use a policy. It can be defined as the process that the agent use to choose the actions that will permit it to optimize the overall rewards. In this course, we will see two methods used to develop these polices: policy gradient and Q-Learning. We will implement our examples using the following libraries: OpenAI gym, keras , tensorflow and keras-rl.
[Notebook 1](https://colab.research.google.com/drive/1395LU6jWULFogfErI8CIYpi35Y00YiRj)
[Notebook 2](https://colab.research.google.com/drive/1MpDS5rj-PwzzLIZtAGYnZ_jjEwhWZEdC)
3. 3
1-Introduction
[By Amina Delali]
ConceptConcept
●
Reinforcement Learning its a type of machine learning.
●
It is based on simple concepts:
➢
A program called an Agent has to learn how to achieve a
certain goal.
➢
It will learn from its interaction with the surrounding
environment related to that goal.
➢
The agent learn by taking actions that afect the state of the
surrounding environment.
➢
Two elements of information afect the choice of which action
to take:
➔
The state of the environment
➔
The reward (the reinforcement) the agent will receive from
the environment after taking an action.
4. 4
1-Introduction
[By Amina Delali]
LearningLearning
●
The interaction between the agent and its environment can be
modeled by the following diagram:
The agent
The Environment
Policy
Reward
Policy: a function that
determines the action to
be taken by the Agent
given an input state.
State: affected
by the agent’s
actions
Take actionsInput state
The goal of the
agent will be to
maximize the
reward
5. 5
1-Introduction
[By Amina Delali]
Applications & SimulationApplications & Simulation
●
Reinforcement Learning can be used for diferent types of
applications:
➢
Robotics: the agent is the robot, and the environment is
either the real world or a simulation of it. Its goal can be for
example to reach a certain location, to clean a room, or to
explore a building.
➢
Games: the agent is a player, and the environment is a
simulation of the game. Its goal is to win the game.
●
In most of the cases, the Agent needs a simulation of the
environment.
●
OpenAI Gym is a library that provides a set of simulated
environments that can be used in diferent types of applications.
●
Installation of OpenAI Gym:
6. 6
2-OpenAIGym
[By Amina Delali]
Elements to considerElements to consider
●
Here a list of elements to consider when using OpenAI gym
library:
➢
The available environments:
➔
there is a set of predefned environments that you select by
their associated name
➔
The name is passed as argument to the function: make
➔
Before using the environment, you have to initialize it with
the environment's method reset
➢
The agent is modeled by:
➔
An action: which is a value from the available possible actions
values obtained by the environment’s method : action_space
➔
Which action it takes at a certain point : you defne a
function (the policy) that will return the value of the action
to be taken.
➔
How it takes the action: you call the environment’s method
step
7. 7
2-OpenAIGym
[By Amina Delali]
Elements to considerElements to consider
➢
The simulation: the display of the simulated environment is
available using the environment’s method render
➢
The reward and the state of the environment are returned by
the call of the method step. The method reset returns only the
state values (after initialization)
●
In order to render the environment in google colaboratory, we
had to install certain additional libraries, and adding some
statements for each render:
You have to uninstall the
library if its already
installed1
2
1
3
8. 8
2-OpenAIGym
[By Amina Delali]
CartPole-V1 environmentCartPole-V1 environment
●
●
This environment simulates a pole
attached by an un-actuated joint to a cart,
which moves along a frictionless track
●
The goal for the agent will be to prevent
the pole from failing
●
There is only 2 (discrete)
possible actions to take :
push the cart to the right
(action == 1) or to the left
(action == 0)
●
The actions are
corresponding to applying +1
or -1 force to the cart
The cart position
The cart velocity
The pole
angle
The angular
velocity
9. 9
2-OpenAIGym
[By Amina Delali]
●
The reward given to the
agent for each step taken:
in this environment, it is
equal to +1 for any step
taken as long as the pole
remains upright . The
termination step is also
included.
●
All the steps done in a simulation after its
reset and until its termination is called an
episode.
●
An episode in this environment is
terminated if:
●
The pole angle is more than 12 degree
●
The cart position is more than 2.4
●
The episode length is greater than 200
●
Solved Requirements: The average
reward is greater than or equal to 195.0
over 100 consecutive trials (episodes).
CartPole-V1 environmentCartPole-V1 environment
10. 10
2-OpenAIGym
[By Amina Delali]
A simple exampleA simple example
●
Initial state before taking any action
(generated after a reset)
●
The reset assigns to the state values an
uniform random value in [-0.05..0.05]
1
2
We first move to the right
One episode (one trial) with a
maximum length of 600 steps.
11. 11
2-OpenAIGym
[By Amina Delali]
A simple example (suite)A simple example (suite)
It is clear, from the picture
representing the last state of the
environment after its
termination, that the move that
causes that termination was a
move to the left.
3
12. 12
3-PolicyGradient
[By Amina Delali]
ProblematicProblematic
●
We have to defne the policy of the agent that it will use to
choose the right action.
●
This action must permit the agent to maximize the total rewards
obtained all along the episode.
●
So, not only the reward obtained from the current action must
be maximized, but the sum of this reward and the following
rewards obtained after applying the following actions to the
end of the episode that must be maximized.
●
Several approaches exist to solve this issue. Most of them involve
the use of a neural networks.
13. 13
3-PolicyGradient
[By Amina Delali]
Use of neural networksUse of neural networks
●
Neural networks can be used in diferent manners.
●
We will talk about 2 diferent ways to use a neural network:
➢
The neural network is itself the policy: it will predict the action
to apply.
➢
The neural network is combined with a policy:
➔
The neural network learn to predict a certain value. Known
as the Q-Value, computed for a pair of state, action
➔
Use the policy to select the action to apply: for example, after
predicting the Q-values of a state for all the possible
actions, select the action that corresponds to the maximum
Q-value
14. 14
3-PolicyGradient
[By Amina Delali]
The discounted rewardsThe discounted rewards
●
As we said previously:
Neural network
Predict the action to apply Predict a Q-value
Use the Q-value to
Select the action
to apply
Learn to
Learn to
●
Each action can be evaluated by the reward the
agent gets immediately by taking this action, and
the following discounted rewards it gets when it
applies the following actions until the end of the
episode.
A1 r1 A2
r2
. . . rn
An
, , , .
Value(A1)=r1+γ⋅r2+γ2
⋅r3+⋯γn−1
rn
=r1+γ⋅Value( A2)
The last action just
before the end of
the episode
The discount factor:
represents the importance
of the future actions. A
good value of is
around 0.9
γIt could be any action Ai.:
Value(Ai)=ri+γ⋅Value( Ai+1)
15. 15
3-PolicyGradient
[By Amina Delali]
Policy GradientPolicy Gradient
●
It implies the use of the gradients: the values computed from the
derivation of a cost function (for example the MSE error function)
and applied by an optimizer (for example the SGD algorithm) to
update the parameters of the policy (in our case, the neural
network policy, it updates the weights of the layers of the ANN).
●
The gradients must also maximize the rewards obtained.
●
One of the available algorithms used is the REINFORCE
algorithm.
●
We will discuss and apply a variant of that algorithm explained in
[Géron, 2017]
16. 16
3-PolicyGradient
[By Amina Delali]
Policy Gradient: ExamplePolicy Gradient: Example
●
To train the neural to predict the optimum action to apply, we
follow this steps:
➢
Apply only feedforwards computations for a certain number of
episodes:
➔
the neural network get a state as input. In our example (the
CartPoole environment),the stat is represented by 4 values.
➔
The neural network will output probabilities related to each
action to take. In our case it will be one probability: the
probability to go left (to take the action 0).
➔
From this probability take the corresponding action==>
implies a new state values and a returned reward.
➔
To compute the gradients corresponding to this probability,
we need the corresponding labels. We will consider that the
taken action was the action to take ==> if the action was
0 the probability must be 1 (100% to go left ), if the action
was 1, the probability must be 0 (0% to go left).
17. 17
3-PolicyGradient
[By Amina Delali]
Policy Gradient: ExamplePolicy Gradient: Example
➔
Compute the gradients with this label
➔
Use the new state values as the new input, and save the
obtained reward.
➔
Repeat this process for a certain number of episodes.
➢
Compute the values of the taken actions:
➔
Since for each action we have all the following ones, we can
compute its value.
➔
Normalize the computed values. So: good values will be
positives and the bad ones will be negatives.
➢
To take into account the previous values (actions values), we
multiply these values with the gradients.
➢
Compute the mean of the new gradients, and use them to
update the weights ==> apply one backpropagation.
18. 18
3-PolicyGradient
[By Amina Delali]
Selecting the action while trainingSelecting the action while training
●
To select the action to take (while training) from the outputted
probability we will use a multinomial distribution.
●
Instead of selecting the action corresponding to the biggest
probability (in our case, since we have one probability, we would
select the action 0 if the probability is > 0.5, and the action 1 if
the probability is <= 0.5), we will use these probabilities in a
sampling function (the multinomial distribution) to generate
the actions values with these probabilities.
●
For example, in our case, if the neural network outputs: 0.7 ==>
the action 1 will have the probaility 0.3. Then multinomial
distribution will have to return one value. That value will have
0.7 probability to be a zero, and 0.3 probability to be a 1.
●
To do this, we will give the multinomial distribution the log of the
probabilities outputted ( in our example, log(0.7) and log(0.3) ).
19. 19
3-PolicyGradient
[By Amina Delali]
ImplementationImplementation
●
We will adapt the implementation presented in [Géron, 2017] by
using keras and tensorfow.
●
Before going further, we will introduce the optimizer and the
cost funtion that we will use:
➢
The cost function will be the cross-entropy function. We
already defned it in the previous course.
➢
The optimizer will be the Adam Optimizer. Adam stands for
Adaptive Moment Estimation. It is an adaptive learning
rate algorithm. These algorithms decay (diminishe) the efect
of the learning rate for the steepest (fastest) dimensions
(features)
➢
It also keeps track of exponentially decaying average of past
gradients (momentum principle) and of squared gradients.
20. 20
3-PolicyGradient
[By Amina Delali]
Adam OptimizerAdam Optimizer
●
The Adam algorithm can be described as follow (from [Kingma
and Ba, 2014]):
➢
Initialize the parameter vector : ( the parameters correspond
to the parameters that we noted w in the previous course )
➢
➢
While the parameters are not converged:
➔
➔
➔
➔
➔
➔
➔
➢
End while
m0←0
θ0 θ
v0←0t←0
θt
t←t+1
gt←∇θ f t(θt−1)
mt ←β1⋅mt−1+(1−β1)⋅gt
^mt ←
mt
1−β1
t
^vt←
vt
1−β2
t
θt←θt−1−
η⋅^m
(√ ^vt+ϵ)
, , , , , , are all
vectors. The t indicates the
a timestamp: the value s of
the vectors at a time t
(iteration t)
, , , , , , are all
vectors. The t indicates the
a timestamp: the value s of
the vectors at a time t
(iteration t)
θt
gt mt vt ^mt ^vt
The
gradients
(derivation of
the cost
function f)
The decay of
the average
of the
gradients
The decay of
the average
of the
squared
gradients
, : are the
momentum decay
hyperparameters
: is the learning rate
: is a smoothing term
used to avoid division by
zero
β1 β2
η ϵ
ηGood default values are:
: 0.002
: 0.9
: 0.999
β1
β2
vt←β2⋅vt−1+(1−β2)⋅gt
2
The multiplications and the division between
vectors are element-wise
The multiplications and the division between
vectors are element-wise
22. 22
4-Example
[By Amina Delali]
The trainingThe training
●
We set num_iterations to
25 (instead of 250, and
max_episodes to 5
(instead of 10) for time
execution issues
24. 24
4-Example
[By Amina Delali]
The PredictionThe Prediction
●
Since we
didn’t let
the model
train long
enough,
we ill not
get good
results
25. 25
5-Q-Learning
[By Amina Delali]
Q-LearningQ-Learning
Neural network
Predict the action to apply
Estimate a Q-value
Use the Q-value to
select the action to apply
Learn to Learn to
●
A Q-value is an estimation of the
maximum discounted future
reward when an action a is
performed on a state s.
Considering that the RL problem
can be modeled as a Marcov
Decision Process
●
It is computed using the Bellman
Equation:
Q(st ,at)=r+γ⋅maxat+1
Q(st+1 ,at+1)
●
The neural network will have to learn to
estimate these Q-values ==> Q-
Learning
●
If the neural network is a Deep NN, we
will talk about Deep Q-Learning (DQN)
●
In general, the learning involves the use
of an initialized Q-table containing the Q-
value of each possible pair (state,action)
●
In general, the NN will perform random
actions (exploration), and use the
obtained rewards to correct the Q-table
values.
●
When its estimation become better, it will
use its estimations to choose the
actions (exploitation), and continue the
training
Q(st ,at)=max(Rt+1)
26. 26
5-Q-Learning
[By Amina Delali]
Example: IntroductionExample: Introduction
●
For our example, we will use the library keras-rl
●
It is a library that implements some deep reinforcement learning
algorithms in python.
●
It integrates with keras, works with OpenAI gym algorithms.
●
In the example, there is a use of a Memory object. In fact, the DQN
learning involves the use of a Replay memory:
➢
The actions performed by the Neural Network are collected in a Replay
Memory with all the information related to that action (the starting state,
the obtained reward, and the resulting state after performing the action).
➢
During the training, the DNN will use randomly batches from this
memory as data for training.
●
To write our example, we used the tutorial available at [Keras rl] with
some modifications.
29. 29
5-Q-Learning
[By Amina Delali]
Some Elements about the last 2 examplesSome Elements about the last 2 examples
●
The ReLU activation function:
F(x) = relu(x, alpha=0.0, max_value=None, threshold=0.0)
➢ For default values (our case), it is defined as: F(x) = max(0,x)
➢ In the other cases :
➔ f(x) = max_value for x >= max_value,
➔ f(x) = x for threshold <= x < max_value
➔f(x) = alpha * (x - threshold) otherwise.
●
VarianceScaling initializer of the weights:
VarianceScaling(scale=1.0, mode='fan_in', distribution='normal', seed=None)
➢
With the normal distribution (our case), it will draw a truncated normal
distribution centred on zero. The standard deviation depends of the mode.
➢
With distribution="uniform", samples are drawn from a uniform distribution
within [-limit, limit], with limit = sqrt(3 * scale / n) (n depends on the mode).
For more details, see the documentation of keras.
●
Sigmoid activation function:
➢
F(x)= 1 /( 1+ e-x
)
●
The linear activation function is simply the identity function
30. References
●
[Géron, 2017] Géron, A. (2017). Hands-on machine learning with
Scikit-Learn and TensorFlow: concepts, tools, and techniques to
build intelligent systems. O’Reilly Media, Inc.
●
[Gulli and Pal, 2017] Gulli, A. and Pal, S. (2017). Deep Learning with
Keras. Packt Publishing Ltd.
●
Joshi Prateek. Artifcial intelligence with Python. Packt Publishing,
2017.
●
[Keras, ] Keras. Keras documentation. https://keras.io/. Accessed on
25-03-2019.
●
[keras rl, ] keras rl. dqn cartpole example. https://github.com/keras-
rl/keras-rl/blob/master/examples/. Accessed on 25-03-2019.
●
[Kingma and Ba, 2014] Kingma, D. and Ba, J. (2014). Adam: A
method for stochastic optimization. International Conference on
Learning Representations.
●
[Matthew, 2019] Matthew, B. (2019). How to deep control back
propagation with keras. https://github.com/keras-
team/keras/issues/956#issuecomment-458801928. Accessed on 25-
03-2019.